Continuing the experiment from August, let's try another open thread for AI Alignment discussion. The goal is to be a place where researchers and upcoming research can ask small questions they are confused about, share early stage ideas and have lower-key discussions.
I think you get distributional shift in the mental configurations that you have access to when you run for more steps, and this means that for the ML to line up with the ground truth you either need training data from those regions of configuration-space or you need well-characterized dynamics that you could correctly identify by training on 100,000 steps.
Arithmetic has these well-characterized dynamics, for example; if you have the right architecture and train on small multiplication problems, you can also perform well on big multiplication problems, because the underlying steps are the same, just repeated more times. This isn't true of piecewise linear approximations to complicated functions, as your approximations will only be good in regions where you had lots of training data. (Imagine trying to fit x^3 with a random forest.) If there are different 'modes of thought' that humans can employ, you need either complete coverage of those modes of thought or 'functional coverage', in that the response to any strange new mental configurations you enter can be easily predicted from normal mental configurations you saw in training.
Like, consider moving in the opposite direction; if I have a model that I train on a single step from questions, then I probably just have a model that's able to 'read' questions (or even just the starts of questions). Once I want to extend this to doing 100,000 steps, I need to not just be able to read inputs, but also do something interesting with them, which probably requires not just 'more' training data from the same distribution, but data from a different, more general distribution.
Hence, the underlying empirical uncertainty that this question sort of asks us to condition on: is there a meaningful difference between what happens in human brains / models trained this way in the first 10 minutes of thought and the first 1,000 days of thought?