Discussion about this post

User's avatar
Somo's avatar

This is pretty startling considering the initial enthusiasm around COT disclosure and perceived transparency. I’ve set up a few projects with a three step verification process - each step with a different LLM and different goals to arrive at a final outcome that has considered my real goal and anti-goals. However, this method relies on me predicting what the non-goal state is. As AI intelligence continues to push the boundaries of human capacity, our ability to predict AI reward hacking will be reliant on AI, if not already.

Expand full comment
Meredith Trimble's avatar

Ai will be needed to test the quality of data as well. Consider test if Deep Seek regarding which country had murdered the most people of that country. Answer is China, but after giving that answer, the info was erased from the AI database by the CCP.

Expand full comment
1 more comment...

No posts