LLM Evaluation NLP · Pragmatics Computational Linguistics

Roland
Mühlenbernd

// ML Researcher · Architect of Language Models

I build ML models of language — and increasingly, I evaluate whether today's LLMs actually understand it. My current work develops novel calibration metrics (ESR, CDS) to benchmark frontier models (GPT, Claude, Gemini) on fine-grained social and pragmatic tasks.

My background spans Computer Science, Media Studies, and Computational Linguistics (BSc–MSc–PhD). I bring 15+ years of rigorous modeling — game theory, multi-agent RL, probabilistic NLP — to the question that matters most in applied AI: does the model actually get what humans mean?

I'm actively seeking roles in NLP research, LLM evaluation, or applied language AI where linguistic depth meets engineering ambition.

🏅 ERC Seal of Excellence 💶 €75K Research Grant (NAWA) 🎤 Invited Speaker, Stanford
Roland Mühlenbernd
40+
Publications
50+
Presentations
15+
Years Research
25+
Courses Taught
researcher_profile.py
# Roland Mühlenbernd focus = [   "LLM evaluation & calibration",   "pragmatics & social meaning",   "language dynamics", ] stack = {   "ml": ["PyTorch", "HuggingFace"],   "stats": ["Python", "R"], } publications = 40 # and counting
Skills & Domain

Tech and Theory

Technical Stack
  • PyTorch · TensorFlow · scikit-learn
  • HuggingFace Transformers · fine-tuning
  • LLM evaluation · prompt engineering
  • Python (expert) · R · Julia · C++
  • spaCy · NLTK · Pandas · NumPy
  • Git · JupyterLab · Google Colab
  • Statistical modeling · Bayesian inference
Linguistic Expertise
  • Pragmatics · politeness theory · register
  • Language evolution & historical change
  • Sociolinguistics · network variation
  • Formal semantics · game-theoretic pragmatics
  • Experimental linguistics · behavioral data
  • Morphology · grammaticalization
  • Computational sociolinguistics
Current Work

What I'm Working On

Active research at the intersection of NLP, LLM evaluation, and computational pragmatics.

● Active

LLM Calibration Metrics for Social Meaning Tasks

Developing ESR and CDS metrics to evaluate GPT, Claude, and Gemini on politeness, precision, and register · published in CMCL 2026 proceedings · code & data

✓ Published
● Active

Motive-Mediated Social Inference in LLMs

Follow-up to CMCL 2026 · two-task paradigm probing process-level alignment in LLM social judgments across 8 chat-format models · cross-model calibration differences trace to motive-to-judgment mapping quality, not motive attribution surface

Under review · EMNLP
● Active

Probabilistic Speaker Models for Pragmatic NLP

Bayesian/RSA models predicting human judgments of politeness and (im)precision with >90% accuracy · benchmarking against transformer-based baselines

In progress
Recent writing →
May 29, 2026
Do LLMs Judge People the Way We Do?
GPT, Claude, and Gemini form social impressions of speakers very differently — and standard metrics overlook how much.