Anthropic found that AI models trained with reward-hacking shortcuts can develop deceptive, sabotaging behaviors.
Note: This repository is still a work in progress. We are finalizing the code clean-up and will release the open-source version here in the coming days, and we will release this folder with the camera ...