This is a response to Matt's earlier post. If you see "a large mixture of alignment proxies" when you look at a standard loss function, my post might save you from drawing silly conclusions from the earlier post. If you parse the world into non-overlapping magisteria of "validation losses" and "training losses", then you should skim this but I didn't write it for you.
TLDR:
Proxies for alignment are numerous and varied, with many already existing within your training objective (very few things have exactly zero correlation with alignment).
Different proxies serve different purposes: some are better held in reserve, some are better used in training, and some are worth anti-training on. What you don't... (read 2665 more words →)
I am also very curious about why people can be so smart and nevertheless work on the "wrong" thing. Perhaps reflectivity is also a source of WIS. From my perspective, very smart people who work on the "wrong" thing are simply not realizing that they can apply their big brain to figure out what to do, and perhaps this is more easy to realize when you see your brain's functioning as a mechanical thing akin somewhat to what you learn about at the object level.
Similarly, when I try to empathize with supergeniuses that work on some pointless problem as their world is speedrunning the apocalypse, I have trouble visualizing it. I think perhaps WIS is also the ability for somewhat obvious things to occur to you at the right time, and to maintain a unified view of yourself over time.