Skip to content
In the companion essay, I argued that LLM-based “agents” don’t deserve the name. They have no beliefs, no uncertainty quantification, no principled way to decide whether a tool query is worth its cost. This post is the technical backing for that claim — the mathematics and code behind Credence, the benchmark I used to test it.
The Problem: Every Query Has a Price
A standard LangChain ReAct agent, faced with a question and four tools, will query most of them for most questions. It has no mechanism to reason about whether the next query is worth its cost. The prompt says “be helpful” and the agent interprets helpfulness as thoroughness.
This is fine if queries are free. They are not. Every API call costs tokens, latency, and — in a benchmark with explicit costs — points. The question isn’t “can this tool help me?” but “does the expected information gain from this tool exceed what I’m paying for it?”
Answering that requires three things the LangChain agent lacks: a model of tool reliability that updates from experience, a way to compute the expected value of information before committing to a query, and a decision rule that compares querying against submitting or abstaining. All three are undergraduate probability theory.
Modelling Tool Reliability with Beta Distributions
Each tool has a different reliability for each question category. A calculator is perfect for numerical questions and useless for everything else. A knowledge base is excellent for factual recall but expensive. We don’t know these reliabilities in advance — we learn them.
The natural model is Beta-Bernoulli. For each tool-category pair, maintain a Beta distribution over the probability that the tool returns the correct answer:
\[r_{t,c} \sim \text{Beta}(\alpha_{t,c},\, \beta_{t,c})\]
Start with \(\alpha = 1, \beta = 1\) — a uniform prior, total ignorance. The expected reliability is the familiar ratio:
\[\mathbb{E}[r_{t,c}] = \frac{\alpha_{t,c}}{\alpha_{t,c} + \beta_{t,c}}\]
After each question, once we know ground truth, we update. If the tool gave the correct answer, increment \(\alpha\); if wrong, increment \(\beta\). Weighted by the category posterior, so learning is distributed across categories proportionally:
Code
def update_reliability_table(
table: ReliabilityTable,
tool_idx: int,
category_posterior: CategoryPosterior,
tool_was_correct: bool | None,
forgetting: float = 1.0,
) -> ReliabilityTable:
new_table = table.copy()
if tool_was_correct is None:
return new_table
params = new_table[tool_idx]
if forgetting < 1.0:
params[:, 0] = np.maximum(1e-10, forgetting * params[:, 0])
params[:, 1] = np.maximum(1e-10, forgetting * params[:, 1])
if tool_was_correct:
params[:, 0] += category_posterior # increment alpha
else:
params[:, 1] += category_posterior # increment beta
return new_table
The forgetting parameter (\(\lambda = 0.95\) in the drift experiments) implements exponential decay — multiplying \(\alpha\) and \(\beta\) by \(\lambda\) before each update. This prevents old observations from dominating when tool reliability changes mid-task. It’s what allows the agent to notice when a tool degrades and reallocate queries to alternatives within a few questions.
Two features make this work well. First, the update is exact. No gradient descent, no approximation, no convergence check. Observe, update, done. Second, the Beta distribution naturally encodes both the estimate and the uncertainty. A tool with \(\text{Beta}(2, 2)\) and one with \(\text{Beta}(20, 20)\) both have expected reliability 0.5, but the agent knows it’s far less certain about the first.
Bayesian Updates on Answers
When a tool returns answer \(x_j\), we need to update our beliefs about which answer is correct. The likelihood model is simple: if the tool’s effective reliability is \(r\), it returns the correct answer with probability \(r\) and each wrong answer with probability \(\frac{1-r}{3}\):
\[P(\text{tool says } x_j \mid \text{true answer is } x_i, r) = \begin{cases} r & \text{if } i = j \\ \frac{1-r}{3} & \text{if } i \neq j \end{cases}\]
Bayes’ rule gives us the posterior over answers:
\[P(x_i \mid \text{tool says } x_j) \propto P(\text{tool says } x_j \mid x_i, r) \cdot P(x_i)\]
In code, this is a few lines of NumPy:
Code
def update_answer_posterior(
prior: AnswerPosterior,
response_idx: int,
r_effective: float,
) -> AnswerPosterior:
n = len(prior)
wrong_likelihood = (1.0 - r_effective) / (n - 1)
likelihood = np.full(n, wrong_likelihood)
likelihood[response_idx] = r_effective
updated = prior * likelihood
total = updated.sum()
if total < 1e-10:
return prior.copy()
return updated / total
But the effective reliability \(r\) is not known exactly — we have a distribution over it. We marginalise over category uncertainty:
\[r_{\text{eff}} = \sum_{c} P(c \mid \text{question}) \cdot \mathbb{E}[r_{t,c}]\]
The agent doesn’t know which category a question belongs to either, so it maintains a category posterior that updates as tools respond. A calculator returning “not applicable” eliminates the numerical category. A knowledge base returning nothing shifts probability toward categories with low coverage. Every piece of information tightens every belief simultaneously.
Expected Utility: When to Answer
The scoring rule is: +10 for a correct answer, -5 for a wrong answer, 0 for abstaining. This makes the expected utility of submitting answer \(x_j\):
\[\text{EU}_{\text{submit}}(x_j) = P(x_j) \cdot 10 + (1 - P(x_j)) \cdot (-5) = 15 \cdot P(x_j) - 5\]
Setting \(\text{EU}_{\text{submit}} > \text{EU}_{\text{abstain}} = 0\) gives a clean threshold: only submit when your best candidate has posterior probability above \(\frac{1}{3}\). Below that, the expected cost of a wrong answer outweighs the expected gain.
Code
def eu_submit(answer_posterior: AnswerPosterior) -> float:
p_best = float(np.max(answer_posterior))
return p_best * REWARD_CORRECT + (1.0 - p_best) * PENALTY_WRONG
def eu_abstain() -> float:
return REWARD_ABSTAIN
def eu_star(answer_posterior: AnswerPosterior) -> float:
return max(eu_submit(answer_posterior), eu_abstain())
This is why the Bayesian agent abstains sometimes. Not because it “doesn’t know” — that’s a vibes judgement — but because the posterior probability of its best answer falls below the decision-theoretic threshold. Abstention is a calculated choice, not a failure mode.
The Decision Loop
The full loop ties these pieces together. At each step, the agent computes the EU of submitting its best answer, the EU of abstaining, and the net VOI of each unused tool. It picks the action with the highest expected utility:
Code
def select_action(state, reliability_table, tool_configs):
eu_sub = eu_submit(state.answer_posterior)
eu_abs = eu_abstain()
best_action = Action(ActionType.ABSTAIN, eu=eu_abs)
if eu_sub >= eu_abs:
best_idx = int(np.argmax(state.answer_posterior))
best_action = Action(ActionType.SUBMIT, answer_idx=best_idx, eu=eu_sub)
for t_idx in range(len(tool_configs)):
if t_idx in state.used_tools:
continue
voi = compute_voi(
state.answer_posterior, reliability_table,
t_idx, state.category_posterior, tool_configs[t_idx],
)
net = voi - tool_configs[t_idx].cost
if net > best_action.eu:
best_action = Action(
ActionType.QUERY, tool_idx=t_idx, eu=net,
)
return best_action
The loop terminates when the best action is submitting or abstaining — when no tool query has a net VOI exceeding the current best option. This is convergent by construction: each query either improves the posterior enough to justify its cost, or the agent stops. No maximum iteration count, no hardcoded “query at most N tools.” The mathematics decides.
What the Numbers Show
The benchmark runs 50 multiple-choice questions with four tools of varying cost and reliability. The results from the essay bear repeating with the mechanism made explicit:
| Bayesian |
59.6% |
+112.6 |
~1.0 |
| LangChain ReAct |
63.7% |
-8.0 |
3.22 |
| LangChain Enhanced |
66.0% |
-68.2 |
3.94 |
| Random |
25.0% |
-72.5 |
2.0 |
The LangChain Enhanced agent — the one with careful prompting about being selective — performed worse than basic LangChain and barely beat random. The prompt told it to be careful, and it interpreted “careful” as “thorough,” querying 3.94 tools per question. More sophisticated prompting produced worse outcomes because cost-benefit analysis is not a prompting problem.
The Bayesian agent’s lower accuracy is a feature, not a bug. It abstained on questions where its posterior confidence was too low to justify submission — choosing 0 points over an expected loss. The LangChain agents submitted every time, racking up wrong-answer penalties on questions they should have skipped.
Adaptation Under Drift
In the drift experiment, one tool’s reliability degrades mid-task. An agent that always trusts the degraded tool sees its score collapse by 69 points. The Bayesian agent’s forgetting mechanism (\(\lambda = 0.95\)) exponentially discounts old observations:
\[\alpha_{\text{new}} = 0.95 \cdot \alpha_{\text{old}} + \Delta\alpha\]
Within a few questions of the degradation, the posterior shifts, the VOI calculation redirects queries to more reliable alternatives, and the score barely dips. No change-detection heuristic, no special-case logic. Just Bayes’ rule with a discount factor.
What I Still Don’t Know
The benchmark has 50 questions, 4 tools, and 5 categories. The VOI calculation enumerates all possible responses for each tool — four candidates per tool, tractable here. In a production system with hundreds of endpoints, ambiguous objectives, and continuous rather than discrete responses, this enumeration explodes. Whether you can approximate the VOI cheaply enough to preserve the advantage is an open question.
There’s also the question of composability. The Bayesian agent treats each tool query as independent. But real tools have correlated errors — two endpoints hitting the same upstream database will fail together. Modelling those correlations requires moving from independent Betas to something like a Dirichlet or a full joint distribution, and the computational cost scales accordingly.
And perhaps the most interesting question: whether principled decision theory and the flexibility of language models are complementary rather than competing. A hybrid that uses Bayesian VOI to decide whether to call a tool, and an LLM to interpret the response, might combine the strengths of both. I haven’t built that yet, but the interfaces are clean enough that it wouldn’t be hard to try.
Code and full results: github.com/gfrmin/credence