Mistral Large 2 (123B) seems to exhibit alignment faking
UPDATE: Recent work with improved AF and compliance gap classifiers disagrees with our results. We recommend using the improved classifiers. Summary We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models)....