Blog

Train Different.

Thinking out loud about ML engineering, provenance, and what it means to build reliable systems in a world that runs on learned behavior.

Benchmarks · Provenance

Tracing a Training Run Costs 1%. The PCIe H100 "Discount" Costs 89%.

Purify and Valgrind made tracing infamous. We measured roar on real H100 pretraining: about 1% on a three-hour run. The expensive variable turned out to be the one nobody had put a number on.

Jul 14, 2026 · 6 min read Read →

Provenance · Source of Truth

Git Solved Provenance. AI Quietly Broke It.

Git made provenance objective and automatic — history as a byproduct of doing the work. Then the models crept in. Every MLOps tool since is a crutch that rebuilds provenance before or after the run. The fix is during.

Jun 25, 2026 · 5 min read Read →

Provenance · ML Engineering

How Provenance Quietly Erodes in ML Engineering

Every ML system is a chain — data, pre-processing, extraction, model. Trust lives or dies at the seams. Three approaches to provenance today, and why all of them leak.

Feb 26, 2026 · 8 min read Read →

Benchmarks · Provenance

Tracing a Training Run Costs 1%. The PCIe H100 "Discount" Costs 89%.

Runtime observation of ML training sounds ruinously expensive — if you've ever used Valgrind, you just winced. So we measured it on real H100 pretraining. The wince is obsolete, and the expensive variable was one nobody was watching.

Chris Geyer · July 14, 2026 · 6 min read

Did you know Reed Hastings' first company sold software that watched other software run?

Yes, the Netflix guy. Before the DVDs-by-mail, before the streaming wars, Hastings co-founded Pure Software, whose flagship product — Purify — did one thing: it caught memory leaks in the act, on Sun workstations, in the early '90s.

Wait, a memory leak, what's that, you say? Well, sonny, back when the dinosaurs roamed and we walked uphill both ways to school, we worried about memory leaks. (How quaint!) You'd allocate a chunk of memory, forget to hand it back, and your program would bloat until the system keeled over. The bug was invisible: nothing in your source code looked wrong, and nothing in the crash told you where the memory went.

So Purify — and later Valgrind, its open-source descendant on Linux — ran your program under observation. Every memory operation intercepted, every byte in and every byte out accounted for, and a report at the end telling you exactly where you'd dropped one. Developers paid real money for this, because the alternative was guessing.

Hold that thought.

The crazy idea

Last time I argued that models broke the source of truth git gave us — and that the fix isn't reconstructing provenance after the run or declaring it before, but capturing it during: automatically, objectively, as a byproduct of doing the work.

Because that's what provenance does in ML: it leaks. Not with a crash — quietly, the way memory used to. A dataset gets renamed, a preprocessing step goes uncommitted, a config lives only in someone's shell history — and nothing in your source looks wrong. There is no lint for a provenance leak. But there can be a Purify.

So here's the crazy idea. Catch provenance leaking at runtime, the way Purify caught memory. Intercept every file a training run reads and writes. Account for every byte in and every byte out. Hand back a complete record of what the job actually touched.

There's a catch, and anyone who ever ran Valgrind just winced.

Valgrind makes your program crawl — 20x, 50x, go get lunch and it still isn't done. Do that to a training run and you've turned an expensive job into an unaffordable one. On a multi-node GPU cluster running hundreds of dollars an hour, time is money in the most literal way. If runtime observation doubled runtime, the conversation would be over before it started.

Runtime observation — that's crazy!

And it would be, if it cost what Valgrind cost. Which is exactly why we didn't argue about it. We measured it.

We built it twice, on purpose

We built the tool — Runtime Observation & Artifact Registration, which abbreviates to roar, and yes, we worked backward from the animal, and no, I'm not sorry. It records every file a run reads and writes.

We built more than one version of it deliberately: if the answer depends on how you happen to observe, it isn't really an answer. One tracer watches file activity from inside the operating system kernel; another intercepts file operations out in user space, before they ever reach the kernel — two vantage points on the same events, with different overhead profiles. (The mechanics are a rabbit hole worth falling into another day: ptrace is the one to avoid like the plague, eBPF and LD_PRELOAD are the fast paths. Full detail in the tracer docs.)

The first benchmark failed

Then we had to point it at something real — nanochat pretraining on 2×H100 GPUs — and the first thing we learned is that we'd measured nothing at all.

We'd compared a traced run on one cloud instance against an untraced run on another, supposedly identical instance. The numbers looked interesting, and for a little while we believed them — until it dawned on us that the two machines differed by nearly 10% on their own, before any instrumentation entered the picture. The signal we were chasing was smaller than that. We'd been admiring noise.

That wince turned out to be the most useful result of the whole exercise. To measure the overhead at all, we had to corner it: traced and untraced workloads back-to-back on the same hardware, over and over, same seed, same config. Do that and the noise collapses to about a quarter of a percent within a machine — and the tracer signal, finally, steps out of the fog.

Observation is cheap

So how cheap is it? Cheaper than I'd braced for. On shorter runs, tracing added 1.5–4% of wall-clock depending on which tracer; on a roughly three-hour run, that fell to about 1.1%. The shape of it makes sense once you see it: the tracer pays a roughly fixed cost while the GPUs spend more and more time doing real work, so the longer the run, the smaller a slice observation takes. (Full numbers in the benchmarks.)

Why doesn't the ghost of Valgrind apply? Because Purify and Valgrind stood inside the factory, inspecting every operation on every machine — each memory access of a CPU-bound program passed through the observer. Lineage doesn't need the factory floor. It needs the loading dock — what came in, what went out — and on a modern training run the factory floor is the GPU, while the loading dock is comparatively quiet. Valgrind watched every machine. roar watches the doors.

The headline isn't that tracing is free. It's that it isn't remotely expensive enough to be the thing that kills the idea.

The thing we didn't go looking for

And then, chasing that first number, we tripped over a second one we hadn't gone looking for.

Needing another pod, I grabbed whatever was available on RunPod — and what was left that day happened to be PCIe H100s rather than the NVLink SXM hardware we'd been on. Same GPU. Same 80 GB. Different interconnect. If you're deep into GPUs you're already rolling your eyes — of course PCIe is slower. But I hadn't put a number on it, and the number was startling: the same workload ran nearly twice as long — 89% more wall-clock. That gap was about twenty times larger than the worst tracing overhead we'd measured.

Do the economics and it gets worse. The PCIe H100 rents for roughly 10% less per hour than its NVLink sibling. A 10% discount on hardware that takes 89% longer means each completed run costs you roughly 70% more. The discount has negative value. Better still: the same RunPod catalog listing landed different silicon on different days — one pod SXM-class, one PCIe. Nominally identical, priced identically, half the speed.

Now, this doesn't mean hardware doesn't matter; engineers sweat it for good reason. What stopped me was noticing where my own attention had gone: I'd spent weeks fretting over a few percent of instrumentation cost while waving off hardware effects twenty times bigger as background scenery. The expensive variable was the one I'd never put a number on — and if you're not tracking which hardware your runs actually landed on, it's going to bite you too.

Which changes the question

The standing objection to runtime observation is that visibility costs speed. Having actually measured it, I don't think that's the live question anymore. On modern training workloads, observation is cheap enough to be practical — so the interesting question was never whether we can afford to observe a run. It's whether we can trust what we observed.

Remember the PhD student from last time, the one whose model already knew the answer? He didn't recover by trusting his results more — he recovered by refusing to, and then verifying. Cheap observation buys you a faithful record of what happened. It doesn't yet tell you whether that record can be believed. That's the harder half of the problem, and the one I'll pick up next.

What about you — what's the biggest variable in your training stack that you've never actually put a number on?

Provenance · Source of Truth

Git Solved Provenance. AI Quietly Broke It.

History used to accrue as a byproduct of doing the work. Models broke that — and every MLOps tool since has been a crutch.

Chris Geyer · June 25, 2026 · 5 min read

The other day I was speaking with a PhD student working in machine learning. When he was just getting started, he'd built a few-shot model that performed really well — almost too well.

Stop me if you've heard this story before.

He kept building on it. But, the contradictions started piling up: results that didn't agree across tests. He changed the architecture, found some bugs, but got stranger results, so changed more things, and went in circles for a long time before he finally instrumented the model directly and found the culprit. The target label had been leaking into the model the whole time.

It performed well because it already knew the answer.

Like others before him, he'd been burned by test leakage. His takeaway: don't trust your model. Literally he says, "don't be happy about a good result." Be skeptical, and stress-test a good-looking result before you believe it.

That conversation wasn't unique. I'd heard variations of it over and over during the past year, and eventually it led me to a question: How do you rebuild trust into the process of building AI? Though there's great industry around MLOps tooling and AI governance — do we have a source of truth that underlies trust?

Software 1.0's Source of Truth — Objective and Automatic

In the world of Software 1.0, we — developers — have operated around a common source of truth centered on Git, later extended by GitHub, Docker, Terraform, package managers, CI systems, and other tooling. Every line, every change, every author, every moment a thing went from working to broken, all of it in one place you can see. Everyone from the newest engineer right up to your VP of Software can answer, "What's running in production?" and who broke main.

How did git give us a source of truth?

It was objective and automatic.

Objective and complete: the artifact and its history are the same object. There is no interpretation and there's no choosing which code changes to track. Git doesn't merely store files, it makes the history checkable — anyone can verify that this state came from that sequence of changes, cryptographically, without having to trust a person who tells them so.

Changes are logged automatically: history accrues as a byproduct of doing the work, not as a chore you perform alongside it. Nobody declares their commit graph in advance or writes it up afterward — you commit, and the provenance is simply there.

But just as it seemed like we were on top of things, then the models started creeping in.

Today's MLOps Tools: Crutches for Our Lost Source of Truth

The story of models creeping in has been told before: see Andrej Karpathy's Software 2.0 story.

I won't repeat the story other than to say that a model, a binary blob, is the output of a factory with many inputs — data, code, config, environment, repeated calls to a random number generator — and the blob itself keeps no receipt of any of them. Stare at the weights all you like; they won't tell you what went in.

With Software 2.0 we lost the automatic transparency that we had in Software 1.0. We lost traceability, reproducibility, accountability, and just made it harder to collaborate because only a few have access to the "truth".

Today's MLOps tools are the crutches trying to make up for that loss of trust.

We've got pipeline orchestrators like Airflow, Dagster and others that are great if your factory doesn't change. Great for mature pipelines, not great if you need to be flexible.

We've got experiment trackers that are great at showing loss curves, MFU, and GPU temperatures, but can't guarantee completeness. Good for experimenting, but not great if you need to eventually believe in your production model.

We've got "walled gardens" or end-to-end ML tools like Sagemaker, Vertex and others, which are powerful (if not always performant) tools, but tie provenance to an infrastructure boundary.

And we've got AI governance tools that audit a record they didn't capture and can't check — they sit downstream of a source of truth that was never there.

Because these solutions are incomplete, forcing these tradeoffs, every organization adds a bit of DIY too. So on top of all our other losses, let's pile on loss of standardization, and added complexity.

Gosh! Are we just stuck here?

Tools Build Provenance Before or After Runtime

We're not stuck. We're looking in the wrong place.

Every tool on that list — the orchestrators, the trackers, governance tools, the walled gardens, the bit of DIY glue every team adds — is trying to reconstruct provenance after the fact, or make you declare it up front. They treat the lost source of truth as something you have to build by hand — before or after. That's why none of them quite work: they fail the two tests that made Software 1.0 trustworthy in the first place.

Build Provenance During Runtime — Automatically and Objectively

But here's the thing we keep walking past: the run already knows what it touched. Git never asked you to declare your commit history — it fell out of doing the work. The data a training job reads, the code it runs, the artifacts it writes: all of that is really happening, on a real machine, whether or not anyone writes it down. The provenance isn't gone. It's just not being captured.

That has made me wonder if we've just had the problem backwards. The question was never which crutch do I buy to limp along without a source of truth. It's whether a model's provenance can be automatic and objective again — recorded as a byproduct of training, checkable by anyone, the way git's was.

I think it can. I'll show you how next time.

What about you — where do you think the source of truth in ML actually breaks down? Walled garden, pipeline, tracker, DIY? Where does your team's source of truth actually live today?

Provenance · ML Engineering

How Provenance Quietly Erodes in ML Engineering

Every ML system is a chain — data, pre-processing, extraction, model. Trust lives — or dies — at the seams.

Chris Geyer · February 26, 2026 · 8 min read

I've been talking with ML engineers, infrastructure leads and CTOs about a problem that keeps coming up in slightly different forms.

This is a typical perspective:

"We trained a model three months ago. It performs well. We don't know exactly how we made it."

Feeling similarly frustrated? Others opened up about their fears about deploying safety-critical ML systems into production:

"The thing that keeps me up at night is test contamination. I don't fully trust that we didn't train on something we shouldn't have."

These same problems show up again and again all across ML engineering, whether you're fine-tuning models, prompt engineering, or working on agents. It's hard to maintain a full picture of exactly how complicated ML systems are built.

From a software engineering perspective, that feels like a huge step backward. Version control and CI gave us clear provenance for code. I, or anyone else in my company, could easily go onto GitHub and answer for my org:

What changed?
Who changed it?
When did it change?
What version is running in production?

But many systems today have increasingly more ML models and far less code. We've traded lines of code for numbers of parameters, prompts, or retrieval configurations. These new models and their new models of development depend not only on source files, but on datasets, pre-processing steps, environment configuration, distributed execution, and often non-deterministic training behavior. Changes to any of these can change the behavior of the model.

Provenance is knowing the origin and evolution of a system — how it came to be. Largely solved for software. Hard for ML systems. Erosion of trust in provenance in ML is a systemic problem. Loss of provenance happens quietly.

Recently I had a long conversation with one of our advisors, Christian Kästner, Professor at CMU, about provenance in ML systems. Christian has spent years studying software engineering practices for ML systems. His perspective helped frame the landscape more clearly. We talked about:

How do ML systems deal with provenance today?
Where does provenance typically break down?
The tradeoffs between velocity, visibility and flexibility.

Three Approaches to Provenance Today

If you zoom out, most of today's provenance systems fall into one of three buckets:

1. Walled gardens (SageMaker, Vertex, Azure ML). Infrastructure ML platforms can track lineage because they control the entire development environment. Datasets, jobs, and storage all live inside one infrastructure boundary, which makes uniform tracking feasible.

Walled gardens are powerful, but only if you stay inside the garden. The more deeply lineage is embedded in a particular infrastructure stack, the more tightly your engineering workflows become coupled to that infrastructure stack. Over time, provenance, governance, and cost structure all align with the platform's abstractions.

2. Explicit pipelines (DVC, Airflow, Dagster, etc.). Everything is declared up front. Pipelines are structured, so artifacts can be versioned. Provenance exists because you declare it.

Explicit declarations work — if you don't change the pipeline frequently and you've got an expert to maintain them. If you want to iterate quickly, though, it can be burdensome. Engineers tweak configs, try alternative pre-processors, fork scripts, launch one-off runs. When exploration is fast, the overhead of updating YAML files, maintaining DAG definitions, or restructuring pipelines can generate unwanted overhead.

3. Manual experiment logging (Weights & Biases, MLflow, etc.). Experiment trackers take a lighter-weight approach. You log runs. You name them. You add comments.

But this creates its own friction. You're asked to supply a run label before you fully understand how the run will perform — so pre-specified labels often lack meaning. The result is often dozens (or hundreds) of runs with hard-to-remember names and inconsistent annotations.

And still, these approaches are leaky. Worse, they leak trust in provenance quietly — so you don't know about it — making provenance not completely reliable.

"Provenance tracking today feels like filing compliance paperwork and the result is still often patchy and unreliable when a question actually comes up. No wonder nobody wants to do it."
— Prof. Christian Kästner, CMU

How Provenance Quietly Erodes

Accurate provenance has multiple points of failure, and most of them are of the silent but deadly sort. These failures can happen using any of the three models because many ML systems don't live perfectly within them:

A preprocessing step overwrites a file.
A config change isn't committed.
A junior engineer accidentally trains on the test set.
A model improves dramatically, but no one knows which change mattered.

These failures often occur at the "seams" — when teams move outside the declared pipeline, mix infrastructure, or experiment quickly.

No team enacts process perfectly, 100% of the time. So gaps accumulate. The system begins to feel opaque. That's when "black box" becomes less a metaphor and more a lived experience.

Velocity, Visibility or Flexibility?

One ML engineer at a large and mature organization told me:

"Flying blind doesn't feel like engineering."

What struck me wasn't just the frustration. It was the sense of regression.

Traditional software engineering gave us leverage over quality. Version control and CI tools made it easy to take for granted the benefits of stronger guarantees on what's running and who made it — or broke it. All of which has helped teams close the loop and improve quality over time.

With provenance you have knowledge about the origin and evolution of a system, which allows you to make systematic quality improvements — with extra emphasis on systematic rather than accidental. Engineering a system instead of relying on luck. In ML, where it's hard to see exactly what changed — in data, configuration, execution, or environment — we lose our leverage over quality. Improvements become harder to isolate; regressions harder to debug.

Yet, to get provenance in ML we're forced into a choice:

Constrain yourself to a single infrastructure boundary.
Introduce more process and orchestration.
Or just accept ambiguity and rely on institutional memory.

This is a tension between velocity, visibility and flexibility. Flexibility to choose your infrastructure without lock-in — and the flexibility that comes from not having to be explicit.

The Runtime Gap

Is there a way to preserve visibility and velocity — without losing flexibility? The core engineering questions haven't changed:

What data fed this model?
What code actually ran?
What changed between this run and the last?
Why did this metric move?

In our old-school software systems, all these answers lived in version control. In modern ML systems, they often don't. They emerge at runtime — and runtime is rarely treated as something we version, inspect, or reason about with the same rigor as source code.

Yet the questions engineers care about — what was read, what was written, what changed — are answered there. If traditional software engineering matured around version control for code, perhaps ML engineering needs an equivalent layer for execution.

Not to enforce compliance, but to restore inspectability. To make "flying blind" feel less normal.

Closing

Christian put it simply in one of our conversations:

"In software engineering, we've developed habits and tools to track changes and test systems incrementally, maintaining tight control over what affects what. We call it version control, but at its core it's provenance for code. In high-assurance systems, we even establish traceability from requirements to code to tests. In ML, we're now rediscovering many of these engineering practices as we move beyond ad-hoc experimentation."
— Prof. Christian Kästner, CMU

If provenance is what lets engineering close the loop on quality, then ML needs a way to close the loop on execution.

And that means treating runtime as a first-class artifact.

Disagree? Where do you think provenance breaks down in ML systems?