Calibrating indifference - a small AI safety idea
TLDR: AI-produced content can reflect random patterns in the data. As such patterns might not get revealed easily, RLHF can accidentally reinforce them. Newer fixes—ties in DPO or in the reward model via BTT, uncertainty-aware RLHF— help us avoid amplifying these quirks but don’t eliminate them. My proposal: explicitly calibrate...