// ML Researcher · Architect of Language Models
I build ML models of language — and increasingly, I evaluate whether today's LLMs actually understand it. My current work develops novel calibration metrics (ESR, CDS) to benchmark frontier models (GPT, Claude, Gemini) on fine-grained social and pragmatic tasks.
My background spans Computer Science, Media Studies, and Computational Linguistics (BSc–MSc–PhD). I bring 15+ years of rigorous modeling — game theory, multi-agent RL, probabilistic NLP — to the question that matters most in applied AI: does the model actually get what humans mean?
I'm actively seeking roles in NLP research, LLM evaluation, or applied language AI where linguistic depth meets engineering ambition.
Active research at the intersection of NLP, LLM evaluation, and computational pragmatics.
Developing ESR and CDS metrics to evaluate GPT, Claude, and Gemini on politeness, precision, and register · published in CMCL 2026 proceedings · code & data
Follow-up to CMCL 2026 · two-task paradigm probing process-level alignment in LLM social judgments across 8 chat-format models · cross-model calibration differences trace to motive-to-judgment mapping quality, not motive attribution surface
Bayesian/RSA models predicting human judgments of politeness and (im)precision with >90% accuracy · benchmarking against transformer-based baselines