How Agents Learn Through Trial and Error: Reinforcement Learning

Reinforcement learning (RL) and LLM-based multi-agent orchestration are often discussed in the same breath, and that conflation hides a meaningful engineering boundary. RL trains a policy through reward signals from environment interaction. LLM agent frameworks chain pre-trained language models through prompts, tools, and shared memory. They are different mechanisms with different failure modes, and choosing the wrong one for a problem is one of the more expensive errors we see in agentic AI work.

This article walks through the core mechanics of RL — Markov Decision Processes, the Bellman equation, value and policy methods, Q-learning — and then frames where RL belongs in the wider multi-agent landscape that dominates 2026 discussion. Our purpose is not to pick a winner. It is to give engineers a vocabulary precise enough to decide which technique their problem actually needs.

What is reinforcement learning, and how does it differ from LLM orchestration?

RL trains an agent to maximise cumulative reward by acting in an environment. The agent observes a state, picks an action, receives a reward, and moves to a new state. Over many episodes, the policy converges toward action choices that produce the highest long-term return. The training signal is the reward itself — there are no labelled examples.

LLM-based multi-agent orchestration is structurally different. The “agents” are pre-trained language models wrapped in prompt templates, tool interfaces, and message-passing layers. There is no reward function, no policy update, and no training loop at inference time. Coordination emerges from prompting and orchestration code, not from reinforcement signals.

This distinction matters because the two paradigms break in different ways. RL fails when the reward function is misspecified, when the state space is too large to explore, or when training diverges. LLM orchestration fails when prompts drift under load, when tool calls cascade into deadlocks, or when one agent’s hallucinated output corrupts the shared context downstream. We discuss the orchestration-side failure surface in How Multi-Agent Systems Coordinate — and Where They Break; the present article focuses on the RL side.

Core concepts: MDPs and the Bellman equation

Markov Decision Process

A Markov Decision Process formalises the environment an RL agent acts in. An MDP is defined by four elements:

States — the situations the agent can be in.
Actions — the choices available in each state.
Transition probabilities — the likelihood of moving from one state to another given an action.
Rewards — the scalar feedback received after each transition.

The Markov property holds when the next state depends only on the current state and action, not the full history. That assumption is what makes RL tractable, and it is also where many real-world RL deployments break down — when the true environment has long-range dependencies the MDP formulation cannot represent, the learned policy generalises poorly.

The Bellman equation

The Bellman equation expresses the value of a state recursively: the value of the current state equals the immediate reward plus the discounted value of the next state under the chosen policy. This recursive structure is what lets RL algorithms compute expected long-term return without enumerating every possible trajectory.

In practice, the Bellman equation is the engine behind value iteration, policy iteration, and Q-learning. Each method differs in how it uses the recursion — but they all rely on the same fundamental decomposition.

Methods: dynamic programming, value iteration, policy iteration, Q-learning

Dynamic programming

Dynamic programming solves an MDP exactly when the full transition model is known. It applies the Bellman equation iteratively across the entire state space until value estimates stabilise. DP gives the optimal policy, but it requires complete knowledge of transition probabilities and reward dynamics — a requirement that rarely holds outside of simulated environments or small discrete domains.

Value iteration

Value iteration is a value-based method that refines state values until convergence. The agent starts from an arbitrary initial value function, then repeatedly applies the Bellman optimality update — picking, at each state, the action that maximises the expected return. Once values stabilise, the optimal policy is read off directly: in each state, pick the action with the highest expected value.

Value iteration works well when the state and action spaces are well-defined and small enough to enumerate. In a grid-world navigation task, for instance, the agent calculates the value of every cell with respect to the goal, then follows the gradient of values to reach it.

Policy iteration

Policy iteration alternates between two phases: policy evaluation (compute the value function under the current policy) and policy improvement (update the policy to be greedy with respect to the value function). The cycle continues until the policy stops changing.

The practical difference between value and policy iteration is convergence behaviour. Policy iteration often converges in fewer outer iterations but each iteration is more expensive. The choice between them depends on the problem geometry rather than a universal rule.

Q-learning

Q-learning is the most widely used model-free RL algorithm. The agent maintains a Q-function — an estimate of the expected return for each (state, action) pair — and updates it through direct interaction, without ever modelling the environment’s transition probabilities.

The update rule uses the temporal-difference error between the observed reward plus discounted next-state value, and the current Q estimate.

Q-Learning Update Rule Formula. Source: Medium

When the state space is too large to tabulate Q-values directly — for example, high-dimensional sensor inputs in robotics or pixel frames in games — a neural network approximates the Q-function. This is deep Q-learning, the family of methods popularised by DeepMind’s Atari work and now widely used in robotic control and simulation environments. Common implementations sit on PyTorch or TensorFlow, with environments wrapped in interfaces like Gymnasium and training runs orchestrated through MLflow or Weights & Biases.

The exploration–exploitation trade-off is the central design choice. The agent must try new actions enough to discover their rewards (exploration) while exploiting actions known to be good (exploitation). Epsilon-greedy is the standard baseline: with probability epsilon, pick a random action; otherwise pick the current best. More sophisticated schemes — softmax, upper-confidence-bound, intrinsic motivation — adjust the balance based on uncertainty or novelty signals.

When does a problem genuinely need RL?

This is the decision we encounter most often when scoping agent projects, and it is also the most commonly mishandled. The default temptation is to reach for RL whenever “the system needs to learn”. That framing is too loose to be useful.

Signal	Use RL	Use LLM orchestration	Use plain automation
Reward function is well-defined and measurable	✓	—	—
Environment can be simulated cheaply	✓	—	—
Decisions are sequential with delayed feedback	✓	partial	—
Task is mostly language understanding and tool use	—	✓	—
Task is deterministic with known rules	—	—	✓
Training data exists but no environment	use supervised	possible	—

RL is the right tool when there is a sequential decision problem, a measurable reward, and an environment (real or simulated) the agent can interact with at scale. It is the wrong tool when the actual problem is language understanding, code generation, or document routing — for those, LLM-based agents and conventional automation outperform RL on cost, latency, and engineering effort.

Multi-agent reinforcement learning (MARL) extends single-agent RL to settings where several learning agents interact. MARL is genuinely useful in domains like traffic control, market simulation, and cooperative robotics. It is rarely the right answer for the orchestration patterns most enterprise teams call “multi-agent” — those are almost always LLM-orchestration problems wearing RL vocabulary. We unpack that distinction in multi-agent system design principles.

Value-based, policy-based, and model-based RL

The three families are distinguished by what they optimise.

Value-based methods (Q-learning, DQN) estimate a value function and derive the policy from it. They are sample-efficient in discrete action spaces and have a long track record in game-playing and grid-world domains.

Policy-based methods (REINFORCE, PPO, actor-critic variants) parameterise the policy directly and update it via gradient ascent on expected return. Actor-critic combines a policy network (actor) with a value estimator (critic), and is the dominant approach in continuous-control problems like robotic manipulation.

Model-based RL learns a model of the environment’s dynamics and uses it to plan ahead through simulation. When the model is accurate, model-based methods are dramatically more sample-efficient than model-free ones. When the model is inaccurate, planning amplifies the errors — and recovering from a wrong model can be harder than learning model-free from scratch.

In our experience, the choice between these families is driven less by theoretical elegance and more by what the environment affords. If you can simulate cheaply, model-free policy gradients with PPO are a reliable default. If real-world interaction is expensive — robotic hardware, clinical settings, physical infrastructure — model-based methods or offline RL become structurally necessary.

Applications and where the boundary sits

RL is in production across a few well-bounded domains: robotic manipulation and locomotion (Boston Dynamics, ANYbotics), recommendation and bidding systems where reward signals are clean, supply-chain optimisation against simulation environments, and data-centre cooling control. In each case, the problem has a clean reward, a feasible simulation environment, and sequential structure.

In healthcare, deep RL has shown promise for adaptive treatment regimes and dosing optimisation, but production deployment is constrained by the difficulty of defining safe reward functions and by regulatory requirements around model auditability. We treat these as research-grade applications rather than production-ready ones.

The boundary worth holding in mind: RL trains decision-making behaviour from environmental interaction, and LLM orchestration composes pre-trained reasoning into workflows. Most agentic AI work in 2026 is the latter. RL remains essential where the former applies, and it is unhelpful — and expensive — when teams reach for it because “agents” feels like an RL problem.

FAQ

References

Guide, S. (2023, January 7). The Q in Q-learning: A Comprehensive Guide.
Javatpoint. (2023, October). Reinforcement Learning Tutorial.
Markov Decision Process in Reinforcement Learning, Neptune.ai (2023).
Singh, N. (2023, July 10). The Bellman Equation.
Thorat, R. (2023, October 29). Actor-Critic method explained.