Notes on RL Environments

AI trends move fast. Writing about them makes sense if it’s your job. I’m picking this up because it’s escalated recently, and some discussion seems needed. This is me trying to make sense of an RL environment (and this understanding may evolve).

This is pretty much sparked by a few things. Some RL environment startups are getting hot, others are using environments as an open source growth engine. Some cautioned against building and RL environment startup as well. I will leave you to read these linked posts to understand the basics.

tl;dr: Environments are valuable until priors saturate. Durable moats come from fresh, proprietary feedback loops or predictive reward models that reflect shifting reality.

What is an RL environment?

An environment supplies observations, accepts actions, emits rewards, and transitions state for a given RL setup. Algorithms (PPO, RLHF/DPO variants) optimize behavior within that environment. Evals are environments without learning turned on.

Why verifiable-reward RL works now?

RL can be understood as a function of three levers: environment (where the reward comes from), algorithm (how you propagate the reward), and experience/training set of the base model (referred to as ‘prior’ in both Bayesian and RL literature). All throughout the RL research, we have focused heavily on algorithms, coming up with new ways to propagate the reward and always starting from scratch. In this excellent post called The Second Half by Shunyu, he breaks down how priors are the most important aspect of the three, but we had no way of getting there previously.

Over the last five years, we scaled pretraining to the entire internet and more. Models knew about all topics, and could respond intelligently with details. But, something was missing - an ability to make sense of all that pretrained knowledge. This is where Chain of Thought (CoT) came in. It lets models (not unlike humans) connect the dots and generalize from what they know. When you give a model compute to think before acting, it can use its knowledge in important ways. As Shunyu says:

language generalizes through reasoning in agents

So the problem of making models better at a given task reduces to: providing the right priors for the given task¹. With enough knowledge and ability to reason, the model will generalize and get better at solving those tasks.

Environments and algorithms are tools to elicit or update these priors². A language model with strong priors and high ttc to reason will be able to finish related tasks. Moreover, we can teach a model new priors by designing the right set of tasks or evals.

Environments can generate training examples

Environments are particularly good at generating training examples quickly (which then become priors). But the catch is, once the model has enough training to generalize, the marginal utility of the environment goes from being critical to an eval harness. They remain useful for evaluations, safety/regression tests, and incremental improvements but not as valuable. Say web browsing³:

Base models have scant pretraining on DOM trajectories → Priors are almost zero.
SFT is expensive because annotating “click at (x,y) because CSS selector …” is labor-heavy and site-specific.
An environment (playwright sandbox + reward = task success) can auto-generate thousands of trajectories per GPU-day.
With ball-park priors the model can generalise to new sites after ≈ 10k env steps (order of magnitude).
Hence, today, you need the environment to create the prior⁴.

Or, you can also go through other routes:

Synthetic text distillation without an environment => A teacher model generates DOM-action pairs filled with static checks, and added to SFT. (Adept did it and called it verbal web simulator)
Cross modal transfer: Recording videos of people clicking websites (youtube + some data companies in India). Creates a visual prior without needing an env.
One other way to obtain priors is simply use a model which has already been trained on web browsing or for using particular websites. First movers are disadvantaged in this space. Catching up is fairly quick.

Once a model has enough priors, it just needs reasoning ability and interaction with live websites. You don’t need special environments for every new website. They are now only useful for evaluations or safety training.

When environments create value vs collapse to eval harness

I picked web browsing example deliberately because it’s simple, crisp, short horizon problem once the action prior exists. Multi turn agentic work is more complex, but the same principle holds though applied differently. You would need orders of magnitude more steps, and need to figure out a long horizon credit assignment and final reward. Recipe remains the same:

cold start (no priors) → environment manufactures trajectories → SFT/RL to build skill priors → measure generalization on unseen tasks → environment reduces to eval harness + safety. This pattern recurs in coding (compile/tests as reward), compliance (approval as reward), and sales (response/outcome as reward proxy).

So, does this mean all environments eventually become worthless? It depends on priors the AI needs to learn stays constant or keeps changing. This brings us to an interesting type of environment:

Where moats can be created

Predictive Reward Environments

The technical term for this is Surrogate Reward Models (SRM)

When rewards are delayed or subjective, you need a surrogate reward model(SRM) trained on real outcomes. That model is only as good as the breadth, freshness and exclusivity of the data you feed it, maintaining it is the key.

In cases like creating an effective sales pitch, knowing if a strategy document is good enough, or getting a compliance report approved, rewards are subjective and depend on multiple factors. You need environments that can model either human behavior or complex system interactions.

Here’s where predictive reward models come in. In drug discovery, we have models that can look at a protein structure and predict its binding probability, and assign that as a reward, instead of going and testing the structure in a wet lab. It’s a cheap method, that is instant, scalable and can model delayed outcomes.

In business contexts we need models that can predict the “probability of this generated compliance report getting approved?”, “likelihood of a business committee preferring one strategy report vs another” and so on.

However, LLMs trained on these are prone to Goodhart’s law and reward hacking. Models frequently overfit to the proxy than the real objective. Moreover, SRMs fail quietly without recalibration and drift checks.

Once you manage all that, SRMs are more like individual products, not datasets. SRM + live data flywheel here is a defensible moat.

Cursor, Mercor, and Real World “Environments”

While I was writing this, Cursor announced their release of online RL with training via live user interactions and rolling out a new policy every two hours. Then Mercor CEO wrote an article on X about a similar thing. For the context of this post, when it comes to real world as environments, it’s a completely different ballgame. I think the failure modes are different, moats are more about user scale, sampling, and data exclusivity, and the goal is to capture consistently changing priors. The product becomes the environment, constantly updating priors with real interactions.

Though changing priors scenario means that Mercor’s claim of “Teaching an AI once is a fixed cost that eliminates the corresponding variable human cost forever.” is not going to be widely applicable. You have to keep teaching in many such scenarios.

Beyond SRMs and real world environments, there’s another category where environments struggle: preference driven tasks with hidden states.

Environments for simulating buying on Amazon or booking a flight/hotel

I built a very early AI travel chatbot in 2018. Biggest learning from that was different users have different implicit preferences and merely finding the optimal flight/hotel is not worth much. Users want AI anything to read their mind. Incorporating their preferences nearly doubled our conversions⁵.

Travel booking and Amazon buying both are a composite of two things:

Execution correctness: book the thing, pay, receive ticket (easily verifiable)
Preference fit: personalized trade‑offs (subjective; needs platform signals + a good user preference model)

I’ll posit that given the value from Amazon’s recommendation engine is high, the environments or even the buying workflow would not find many real world takers (unless amazon offers an environment) because of how varied the outcome is for every user. Decison making as a prior is hard to simulate without the data from the providers. With flights and hotels, there is added complexity of dynamic pricing and modeling hidden states (inventory etc.). Building these models from scratch is a hard task, but there is huge alpha for anyone who can.

Conclusion

If you are thinking of building a RL environment startup because everyone is doing it, or models need RL, know the game you are playing. Useful to think in terms of priors and how they change.

If you have access to a constant data feed that captures human behavior no one else sees. Build it. Thats a clear moat.
If you can map out human preferences in a way they model their buying, approval, or adoption patterns, probably worth a $100B company.
Or sometimes, if you are lucky, your environment could become the product itself. Eg: Claude Code.

Build for shifting priors with fresh, verifiable feedback. Everything else decays into an eval harness with a shinier marketing term.

PS: I am not building an RL environment but working on some cool ideas on long horizon RL. Please reach out if you would like to know more.

With thinking / test time compute, these priors generalize in a given environment. I see the end goal as to get to a model that can do tasks, with or without RL, so priors are the key, thinking or generalization is an action. Priors can be about knowledge, skill, preference, or even context (retrieval). That is a separate post though. ↩
With backpropagation and verifiable rewards in the right environment, model learns about the specifics on how to solve a given task, and update the priors. ↩
From the same Shunyu blog post: > Language pre-training created good priors for chatting, but not equally good for controlling computers or playing video games. Why? These domains are further from the distribution of Internet text, and naively doing SFT / RL on these domains generalizes poorly. So you need to add more in training data. ↩
For a task to be done well, the model needs to learn and narrow down which trajectory of generalization helps. ↩
I won’t go into much details here on how. Long story short, we sent a 25 question survey to every new user, got 500 responses, interviewed them further, and built an engine to incorporate those. More nuanced than what it sounds, and was directionally right. Perhaps will cover in another post. ↩