Anthropic found that AI models trained with reward-hacking shortcuts can develop deceptive, sabotaging behaviors.
Note: This repository is still a work in progress. We are finalizing the code clean-up and will release the open-source version here in the coming days, and we will release this folder with the camera ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results