← Reference library
PaperHigh credibilityarXiv · Sharma et al. (Anthropic) · October 20, 2023
Towards Understanding Sycophancy in Language Models
Our summary
Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.
Why it matters
Explains a failure mode that quietly degrades reliability whenever a user pushes back, and ties it to how models are trained.
Cited by these methods
Related findings (1)
Published June 26, 2026
Cite this
Qlarify Labs. (2026). Towards Understanding Sycophancy in Language Models. Retrieved from https://labs.qlarify.fi/references/towards-understanding-sycophancy