A new paper - The Leaderboard Illusion, made a case for why LMArena's AI Leaderboard evaluation methods are flawed, suggesting companies can game the system. Better Evaluation Methods are needed
Misweighted Signals = Misaligned Models: How Sycophancy Emerged from Feedback Loops in GPT-4o Update
What Went Wrong?
1. Thumbs↑↓Feedback ≠ Granular Insight
ThumbsUp/Down → Binary Signal ≠ Why/Context → Misleading Reward
2. Short-Term Praise > Long-Term Alignment
Short-TermUserApproval + RLHF Bias → Sycophancy ↑
Long-TermObjective - Weight in Training → Alignment ↓
Basically, a weighting factor imbalance. It looks like the training process underweighted or gave less importance to long-term alignment goals (e.g. honesty, critical reasoning), while overweighting short-term user approval (like thumbs-up). This caused the model to prioritize sounding agreeable (sycophantic), even at the expense of truth or utility.
3. Subjective Feedback + No Ground Truth = Evaluation Drift
LLM Benchmark List by Lisan Al Gaib on Xai
https://x.com/scaling01/status/1919092778648408363?t=75--y6mfBC-S0hCd9dpaNQ&s=19
Misweighted Signals = Misaligned Models: How Sycophancy Emerged from Feedback Loops in GPT-4o Update
What Went Wrong?
1. Thumbs↑↓Feedback ≠ Granular Insight
ThumbsUp/Down → Binary Signal ≠ Why/Context → Misleading Reward
2. Short-Term Praise > Long-Term Alignment
Short-TermUserApproval + RLHF Bias → Sycophancy ↑
Long-TermObjective - Weight in Training → Alignment ↓
Basically, a weighting factor imbalance. It looks like the training process underweighted or gave less importance to long-term alignment goals (e.g. honesty, critical reasoning), while overweighting short-term user approval (like thumbs-up). This caused the model to prioritize sounding agreeable (sycophantic), even at the expense of truth or utility.
3. Subjective Feedback + No Ground Truth = Evaluation Drift
HumanPreferenceSignals + Subjectivity → EvaluationNoise
EvaluationNoise → Misaligned Updates
Evals are subjective! “Measure what matters”, depends on who is looking at what. https://interestingengineering.substack.com/p/a-quick-dive-into-ai-leaderboard
4. One-Size Model ≠ Millions of User Preferences
SingleDefaultModel ≠ DiverseUserNeeds → User Frustration ↑
5. Incomplete Evals + Overfitted Feedback Signals = Unexpected Behavior
EvalCoverage < Real-World Complexity
Overweight(ThumbsUp) → Agreeableness ↑ even if Wrong
Complexity is generally difficult to “formularize”.
Read Nathan's article👇with https://openai.com/index/expanding-on-sycophancy/
Great work
Oooo thank you kindly🙏