All of SimonBiggs's Comments + Replies

I'd be keen for other's thoughts around a "Socratic tale" of one particular way in which CIRL might be a helpful component of the alignment story.


Let's say we make leaps and bounds within mechanistic interpretability research to the point where we have identified a primary objective style mesa optimiser within the transformer network. But, when looking into its internalised loss function we see that it is less than ideal.

But given, in this make believe future, we have built up sufficient mechanistic interpretability understanding, we now have a way that ... (read more)

This reminds me of the problems that STPA are trying to solve in safe systems design:
https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf

And, for those who prefer video, here's a good video intro to STPA:


Their approach is designed to handle complex systems, by decomposing the system into parts. However, they are not decomposed into functions or tasks, but instead they decompose the system into a control structure.

They approach this problem by, addressing a system as built up of a graph of controllers (internal mesa optimisers which are pot... (read more)