I've been talking with ML engineers, infrastructure leads and CTOs about a problem that keeps coming up in slightly different forms.

This is a typical perspective:

"We trained a model three months ago. It performs well. We don't know exactly how we made it."

Feeling similarly frustrated? Others opened up about their fears about deploying safety-critical ML systems into production:

"The thing that keeps me up at night is test contamination. I don't fully trust that we didn't train on something we shouldn't have."

These same problems show up again and again all across ML engineering, whether you're fine-tuning models, prompt engineering, or working on agents. It's hard to maintain a full picture of exactly how complicated ML systems are built.

From a software engineering perspective, that feels like a huge step backward. Version control and CI gave us clear provenance for code. I, or anyone else in my company, could easily go onto GitHub and answer for my org:

  • What changed?
  • Who changed it?
  • When did it change?
  • What version is running in production?

But many systems today have increasingly more ML models and far less code. We've traded lines of code for numbers of parameters, prompts, or retrieval configurations. These new models and their new models of development depend not only on source files, but on datasets, pre-processing steps, environment configuration, distributed execution, and often non-deterministic training behavior. Changes to any of these can change the behavior of the model.

Provenance is knowing the origin and evolution of a system — how it came to be. Largely solved for software. Hard for ML systems. Erosion of trust in provenance in ML is a systemic problem. Loss of provenance happens quietly.

Recently I had a long conversation with one of our advisors, Christian Kästner, Professor at CMU, about provenance in ML systems. Christian has spent years studying software engineering practices for ML systems. His perspective helped frame the landscape more clearly. We talked about:

  • How do ML systems deal with provenance today?
  • Where does provenance typically break down?
  • The tradeoffs between velocity, visibility and flexibility.

Three Approaches to Provenance Today

If you zoom out, most of today's provenance systems fall into one of three buckets:

1. Walled gardens (SageMaker, Vertex, Azure ML). Infrastructure ML platforms can track lineage because they control the entire development environment. Datasets, jobs, and storage all live inside one infrastructure boundary, which makes uniform tracking feasible.

Walled gardens are powerful, but only if you stay inside the garden. The more deeply lineage is embedded in a particular infrastructure stack, the more tightly your engineering workflows become coupled to that infrastructure stack. Over time, provenance, governance, and cost structure all align with the platform's abstractions.

2. Explicit pipelines (DVC, Airflow, Dagster, etc.). Everything is declared up front. Pipelines are structured, so artifacts can be versioned. Provenance exists because you declare it.

Explicit declarations work — if you don't change the pipeline frequently and you've got an expert to maintain them. If you want to iterate quickly, though, it can be burdensome. Engineers tweak configs, try alternative pre-processors, fork scripts, launch one-off runs. When exploration is fast, the overhead of updating YAML files, maintaining DAG definitions, or restructuring pipelines can generate unwanted overhead.

3. Manual experiment logging (Weights & Biases, MLflow, etc.). Experiment trackers take a lighter-weight approach. You log runs. You name them. You add comments.

But this creates its own friction. You're asked to supply a run label before you fully understand how the run will perform — so pre-specified labels often lack meaning. The result is often dozens (or hundreds) of runs with hard-to-remember names and inconsistent annotations.

And still, these approaches are leaky. Worse, they leak trust in provenance quietly — so you don't know about it — making provenance not completely reliable.

"Provenance tracking today feels like filing compliance paperwork and the result is still often patchy and unreliable when a question actually comes up. No wonder nobody wants to do it."

— Prof. Christian Kästner, CMU

How Provenance Quietly Erodes

Accurate provenance has multiple points of failure, and most of them are of the silent but deadly sort. These failures can happen using any of the three models because many ML systems don't live perfectly within them:

  • A preprocessing step overwrites a file.
  • A config change isn't committed.
  • A junior engineer accidentally trains on the test set.
  • A model improves dramatically, but no one knows which change mattered.

These failures often occur at the "seams" — when teams move outside the declared pipeline, mix infrastructure, or experiment quickly.

No team enacts process perfectly, 100% of the time. So gaps accumulate. The system begins to feel opaque. That's when "black box" becomes less a metaphor and more a lived experience.

Velocity, Visibility or Flexibility?

One ML engineer at a large and mature organization told me:

"Flying blind doesn't feel like engineering."

What struck me wasn't just the frustration. It was the sense of regression.

Traditional software engineering gave us leverage over quality. Version control and CI tools made it easy to take for granted the benefits of stronger guarantees on what's running and who made it — or broke it. All of which has helped teams close the loop and improve quality over time.

With provenance you have knowledge about the origin and evolution of a system, which allows you to make systematic quality improvements — with extra emphasis on systematic rather than accidental. Engineering a system instead of relying on luck. In ML, where it's hard to see exactly what changed — in data, configuration, execution, or environment — we lose our leverage over quality. Improvements become harder to isolate; regressions harder to debug.

Yet, to get provenance in ML we're forced into a choice:

  • Constrain yourself to a single infrastructure boundary.
  • Introduce more process and orchestration.
  • Or just accept ambiguity and rely on institutional memory.

This is a tension between velocity, visibility and flexibility. Flexibility to choose your infrastructure without lock-in — and the flexibility that comes from not having to be explicit.

The Runtime Gap

Is there a way to preserve visibility and velocity — without losing flexibility? The core engineering questions haven't changed:

  • What data fed this model?
  • What code actually ran?
  • What changed between this run and the last?
  • Why did this metric move?

In our old-school software systems, all these answers lived in version control. In modern ML systems, they often don't. They emerge at runtime — and runtime is rarely treated as something we version, inspect, or reason about with the same rigor as source code.

Yet the questions engineers care about — what was read, what was written, what changed — are answered there. If traditional software engineering matured around version control for code, perhaps ML engineering needs an equivalent layer for execution.

Not to enforce compliance, but to restore inspectability. To make "flying blind" feel less normal.

Closing

Christian put it simply in one of our conversations:

"In software engineering, we've developed habits and tools to track changes and test systems incrementally, maintaining tight control over what affects what. We call it version control, but at its core it's provenance for code. In high-assurance systems, we even establish traceability from requirements to code to tests. In ML, we're now rediscovering many of these engineering practices as we move beyond ad-hoc experimentation."

— Prof. Christian Kästner, CMU

If provenance is what lets engineering close the loop on quality, then ML needs a way to close the loop on execution.

And that means treating runtime as a first-class artifact.

Disagree? Where do you think provenance breaks down in ML systems?