The Early-Stopping Trap: How I Killed RL Runs Before They Could Learn

Early stopping is supposed to save you from burning compute on a model that’s finished learning. For a long stretch, it was the thing burning my compute — by killing good agents before they ever got good.

More of my reinforcement-learning runs died to my own early-stopping rule than to bad ideas. The kill switch I’d built to be disciplined turned out to be the most undisciplined thing in the system.

Illustrative noisy reward curve that dips and is stopped early, with a dashed continuation showing the climb it never reached Illustrative, not real performance data. The agent was stopped during a normal dip — right before the part where it would have climbed.

The kill switch that backfired

I borrowed early stopping from supervised learning, where it’s gospel: watch the validation loss, and when it stops improving for a while, stop — you’re only overfitting from here. It’s clean, it’s safe, and in supervised learning it usually works.

So I wired the same idea into RL training: watch the metric, lose patience after a plateau, stop the run, save the GPU-hours. It felt responsible. In practice it quietly threw away run after run — many of them early, before the agent had found anything at all. I ended up with a folder full of runs that had simply never started.

Why RL breaks the rule

A supervised validation curve is a fairly smooth, mostly-monotone thing. A reinforcement-learning agent’s progress is not. Reward is noisy by construction: it depends on a policy that is itself changing, interacting with an environment full of variance. Improvement comes in fits and starts — long flats, sudden jumps, dips that recover. A plateau in RL is often not the end of learning. It’s the part right before it.

So a patience tuned for supervised smoothness fires during perfectly normal RL variance. You don’t catch a model that’s done; you execute one that was about to get interesting. And the “savings” are an illusion — you still pay for every run that died early and returned nothing.

Stop stopping early

The fix wasn’t a better number. It was inverting the default. Early stopping should be rare, not eager — a guard against genuine, sustained overfitting, not a hair-trigger on noise. That means putting the bar where noise can’t realistically reach it, and judging degradation over a window long enough to dwarf the run’s natural variance — not a handful of evaluations. The default has to be keep going. The burden of proof belongs on stopping, not on continuing.

Put bluntly: patience is a hyperparameter too, and most beginners — me included — set it as if RL were supervised learning. It isn’t.

Where humans still belong

This rhymes with the lesson from the first post. I don’t decide, run by run, when to pull the plug — that’s just me freezing a guess into the system again. I decide what “genuinely getting worse” means versus “normal noise,” and I let that definition do the stopping. I’m designing the judge, not delivering each verdict by hand. Get the judge right and you stop discarding the runs that would have worked.

The takeaway

In RL, the most expensive bug is often impatience. Prematurely optimizing your compute budget doesn’t save compute — it costs you the agent, and the cruel part is you never see the result you killed.

So if your runs keep “not working,” check whether they ever got the chance to. Sometimes the model wasn’t failing. You were.


This is part 2 of an ongoing, anonymous log of building a reinforcement-learning trading system. It’s about method and mistakes, not signals — nothing here is investment advice, and no strategy details are shared.