Imagine you're telling someone how much your bike cost. To a friend over coffee, you'd probably say "around $500." To your insurance agent: "$489.99." The choice carries social meaning. Say "$489.99" to your friend and you sound pedantic. Tell your insurance agent "around $500" and you sound less competent. The shifts in social judgment are small but reliable across people — and they show up consistently across six attributes tested in human experiments. But what happens if we let three frontier LLMs make the same judgments? The answer: GPT lands closest to the human pattern. Claude is split: it overreacts to what speakers say, but underreacts to the social context. Gemini moves in the same direction as humans — but the shifts are about three times as large.
Why is this surprising? Most LLM evaluations ask whether the model got the right answer — did it translate the sentence, summarize the document, classify the sentiment? Social meaning often doesn't work that way. There is no fact of the matter about how pedantic "$489.99" makes the speaker sound — only the range of impressions real listeners form, and how those impressions shift across contexts.
So Gemini isn't simply "wrong." It moves impressions in the same direction as humans do — toward pedantic for precise speakers, away from competent for imprecise ones. What it gets wrong is the magnitude. This is the difference between directional accuracy and calibration, and it's the kind of error standard metrics miss. A model can be perfectly aligned with humans in direction and still be wildly miscalibrated.
Here's what we did. We took scenarios like the bike example and varied two things: how precisely a speaker stated a number, and the social context (high vs. low precision needs — insurance agent vs. friend). Real participants rated each speaker — based on what they said — on six attributes: competent, helpful, likeable, well-prepared, knowledgeable, and pedantic. We then asked GPT, Claude, and Gemini to do the same task across a range of prompting conditions, from bare-minimum instructions to prompts that explicitly invited pragmatic reasoning about potential speaker motives for being more or less precise.
To compare them, we used a measure called the Effect Size Ratio (ESR): it captures how much bigger or smaller a model's effect is than the corresponding human effect. An ESR of 1.0 means perfect calibration; above 1.0 means the model overreacts, below 1.0 means it under-reacts.
The heatmap separates two kinds of shift. The top rows track main effects — how much the choice of a precise vs. imprecise utterance moves impressions on average. The bottom rows track interactions — how much that shift itself depends on the social context.
GPT under-reacts on both. Its ESRs sit below 1.0 in most cells, sharpest on interactions, where some values drop close to 0.1. GPT moves in the right direction, just by too little.
Claude shows a split pattern: it amplifies main effects (ESRs frequently above 2.0, peaking above 4.0) but under-reacts on interactions (mostly below 1.0). Claude exaggerates how much what you say matters, and underestimates how much who you're saying it to matters.
Gemini amplifies everything. Its ESRs sit consistently above 2.0 in both main effects and interactions, with many cells above 3.0 — the source of the "three times" headline from the opening.
Why these different profiles? The study shows that they differ; it doesn't show why.
Why does this matter? For LLM evaluation, directional accuracy can hide major magnitude errors. A model can agree with humans about whether an utterance sounds more pedantic and still differ by a factor of three on how much. Standard metrics won't catch that.
For applied NLP, the implication is concrete. If you're using LLMs for anything socially loaded — resume screening, writing assistants, register-aware translation — the model's intuitions may be amplified or muted relative to humans, and your accuracy scores won't show it.
Can prompting fix this? Not really. It affects all three models strongly, but the direction isn't predictable: some prompts bring a model closer to human calibration; others push it further away. The latter case can be dramatic — Claude and Gemini both produce ESRs above 4 under certain prompting conditions. Prompting is a lever, but pulling it can shift calibration in either direction.
The paper appears in the CMCL 2026 proceedings (arXiv:2604.02512); the code, data, and replication notebook are at github.com/muehlenbernd/llm-social-calibration. A recorded talk on this work, given at Leibniz MMS Days 2026, is available on the TIB AV-Portal (DOI: 10.5446/72794):
A follow-up study, currently under review for EMNLP, takes the next step: instead of asking how LLMs rate speakers, it asks why — probing the pragmatic reasoning behind the judgments, not just the ratings.