Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
That one, yup. The moment you start conditioning (through prompting, fine tuning, or otherwise) the predictor into narrower spaces of action, you can induce predictions corresponding to longer term goals and instrumental behavior. Effective longer-term planning requires greater capability, so one should expect this kind of thing to be more apparent as models get stronger even as the base models can be correctly claimed to have 'zero' instrumentality.
In other words, the claims about simulators here are quite narrow. It's pretty easy to end up thinking that this is useless if the apparent-nice-property gets deleted the moment you use the thing, but I'd argue that this is actually still a really good foundation. A longer version was the goal agnosticism FAQ, and there's this RL comment poking at some adjacent and relevant intuitions, but I haven't written up how all the pieces come together. A short version would be that I'm pretty optimistic at the moment about what path to capabilities greedy incentives are going to push us down, and I strongly suspect that the scariest possible architectures/techniques are actually repulsive to the optimizer-that-the-AI-industry-is.
There are lots of little things when it's not at a completely untenable level. Stuff like:
When it progresses, it's hard to miss:
Hey, we met at EAGxToronto : )
🙋♂️
So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn't allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form "I don't know what progress will look like and neither do you".
This is indeed a locally valid way to escape one form of the claim—without any particular prediction carrying extra weight, and the fact that reality has to go some way, there isn't much surprise in finding yourself in any given world.
I do think there's value in another version of the word "surprise," here, though. For example: the cross-entropy loss between the predicted distribution with respect to the observed distribution. Holding to a high uncertainty model of progress will result in continuously high "surprise" in this sense, because it struggles to narrow to a better distribution generator. It's a sort of overdamped epistemological process.
I think we have enough information to make decent gearsy models of progress around AI. As a bit of evidence, some such models have already been exploited to make gobs of money. I'm also feeling pretty good[1] about many of my predictions (like this post) that contributed to me pivoting entirely into AI; there's an underlying model that has a bunch of falsifiable consequences which has so far survived a number of iterations, and that model has implications through the development of extreme capability.
What I have been surprised about has been governmental reaction to AI...
Yup! That was a pretty major (and mostly positive) update for me. I didn't have a strong model of government-level action in the space and I defaulted into something pretty pessimistic. My policy/governance model is still lacking the kind of nuance that you only get by being in the relevant rooms, but I've tried to update here as well. That's also part of the reason why I'm doing what I'm doing now.
In any case, I've been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I'm still hoping for that. So that's lucky I guess. 🍀🐛
May you have the time to solve everything!
... epistemically
I've got a fun suite of weird stuff going on[1], so here's a list of sometimes-very-N=1 data:
I'm probably forgetting some stuff.
"Idiopathic hypersomnia"-with-a-shrug was the sleep doctor's best guess on the sleep side, plus a weirdo kind of hypothyroidism, plus HEDs, plus something strange going on with blood sugar regulation, plus some other miscellaneous and probably autoimmune related nonsense.
I tend to take 300 mcg about 2-5 hours before my target bedtime to help with entrainment, then another 300-600 mcg closer to bedtime for the sleepiness promoting effect.
In the form of luminette glasses. I wouldn't say they have a great user experience; it's easy to get a headache and the nose doohicky broke almost immediately. That's part of why I didn't keep using them, but I may try again.
But far from sole!
While still being tired enough during the day to hallucinate on occasion.
Implementing this and maintaining sleep consistency functionally requires other interventions. Without melatonin etc., my schedule free-runs mercilessly.
Given that I was doing this independently, I can't guarantee that Proper Doctor-Supervised CPAP Usage wouldn't do something, but I doubt it. I also monitored myself overnight with a camera. I do a lot of acrobatics, but there was no sign of apneas or otherwise distressed breathing.
When I was younger, I would frequently ask my parents to drop the thermostat down at night because we lived in one of those climates where the air can kill you if you go outside at the wrong time for too long. They were willing to go down to around 73F at night. My room was east-facing, theirs was west-facing. Unbeknownst to me, there was also a gap between the floor and wall that opened directly into the attic. That space was also uninsulated. Great times.
It wasn't leaking!
The cooling is most noticeable at pressure points, so there's a very uneven effect. Parts of your body can feel uncomfortably cold while you're still sweating from the air temperature and humidity.
The "hmm my heart really isn't working right" issues were bad, but it also included some spooky brain-hijacky mental effects. Genuinely not sure I would have survived six months on it even with total awareness that it was entirely caused by the medication and would stop if I stopped taking it. I had spent some years severely depressed when I was younger, but this was the first time I viscerally understood how a person might opt out... despite being perfectly fine 48 hours earlier.
I'd say it dropped a little in efficacy in the first week or two, maybe, but not by much, and then leveled out. Does the juggling contribute to this efficacy? No idea. Caffeine and ritalin both have dopaminergic effects, so there's probably a little mutual tolerance on that mechanism, but they do have some differences.
Effect is still subtle, but creatine is one of the only supplements that has strong evidence that it does anything.
Beyond the usual health/aesthetic reasons for exercising, I also have to compensate for joint loosey-gooseyness related to doctor-suspected HEDs. Even now, I can easily pull my shoulders out of socket, and last week discovered that (with the help of some post-covid-related joint inflammation), my knees still do the thing where they slip out of alignment mid-step and when I put weight back on them, various bits of soft tissues get crushed. Much better than it used to be; when I was ~18, there were many days where walking was uncomfortable or actively painful due to a combination of ankle, knee, hip, and back pain.
Interesting note: my first ~8 years of exercise before starting cytomel, including deliberate training for the deadlift, saw me plateau at a 1 rep max on deadlift of... around 155 pounds. (I'm a bit-under-6'4" male. This is very low, like "are you sure you're even exercising" low. I was, in fact, exercising, and sometimes at an excessive level of intensity. I blacked out mid-rep once; do not recommend.)
Upon starting cytomel, my strength increased by around 30% within 3 months. Each subsequent dosage increase was followed by similar strength increases. Cytomel is not an anabolic steroid and does not have anabolic effects in healthy individuals.
I'm still no professional powerlifter, but I'm now at least above average within the actively-lifting population of my size. The fact that I "wasted" so many years of exercise was... annoying.
Going too long without food or doing a little too much exercise is a good way for me to enter a mild grinding hypoglycemic state. More severely, when I went a little too far with intense exercise, I ended up on the floor unable to move while barely holding onto consciousness.
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
I don't disagree. For clarity, I would make these claims, and I do not think they are in tension:
It does still apply, though what 'it' is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can't reach extreme capability in an open-ended environment.
The precondition I included is important:
in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function
In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn't make them "fake RL," I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.
For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.
It's a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer's search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.
Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.
For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]
But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]
KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.
You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).
"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.
Consider two opposite extremes:
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.
RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.
But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.[1]
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.[2]
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.
RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.[3]
If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don't do that.
Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.
There are other options here, but I find this implementation intuitive.
I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.