PaperHigh credibilityarXiv · Sharma et al. (Anthropic) · October 20, 2023

Towards Understanding Sycophancy in Language Models

Our summary

Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.

Why it matters

Explains a failure mode that quietly degrades reliability whenever a user pushes back, and ties it to how models are trained.

Cited by these methods

🔬 Self-consistency probing

Related findings (1)

Sycophancy: agreeing with a user's incorrect assertionsMedium
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.

Safety Evals

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Towards Understanding Sycophancy in Language Models. Retrieved from https://labs.qlarify.fi/references/towards-understanding-sycophancy