← Reference library
PaperHigh credibilityarXiv · Sharma et al. (Anthropic) · October 20, 2023

Towards Understanding Sycophancy in Language Models

Our summary

Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.

Why it matters

Explains a failure mode that quietly degrades reliability whenever a user pushes back, and ties it to how models are trained.

Cited by these methods

Related findings (1)

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Towards Understanding Sycophancy in Language Models. Retrieved from https://labs.qlarify.fi/references/towards-understanding-sycophancy