I'm a long time on-off lurker here. I've made my way through the sequences quite a while ago with a mixed success in implementing some of them. Many of the ideas are intriguing and I would love to have enough spare cycles to play with them. Unfortunately, often enough, I find myself to not have enough capacity to do this properly due to life getting in way. With (not only that) in mind, I'm going to take a sabbatical this summer for at least three months and try to do an update and generally tend to stuff I've been putting off.&n... (read more)
Rob Miles' YouTube channel has some good explanations about why alignment is hard.
We can already do RLHF, the alignment technique that made ChatGPT and derivatives well-behaved enough to be useful, but we don't expect this to scale to superintelligence. It adjusts the weights based on human feedback, but this can't work once the humans are unable to judge actions (or plans) that are too complex.
Not following. We can already update the weights. That's training, tuning, RLHF, etc. How does that help?
No. We're talking about aligning general intelligence. We need to avoid all the dangerous behaviors, not just a single example we can think of, or even numerous examples. We need the AI to output things we haven't thought of, or why is it useful at all? If there's a finite and reasonably small number of inputs/outputs we want, there's a simpler solution: that's not an AGI—it's a lookup table.
You can think of the LLM weights as a lossy compression of the corpus it was trained on. If you can predict text better than chance, you don't need as much capacity to store it, so an LLM could be a component in a lossless text compressor as well. But these predictors generated by the training process generalize beyond their corpus to things that haven't been written yet. It has an internal model of possible worlds that could have generated the corpus. That's intelligence.
2CBiddulph
If I understand correctly, you're basically saying:
* We can't know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can't know in advance whether a program will run forever.
* If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn't.
* Since we can't prove that the machine won't do something catastrophic, the alignment problem is impossible.
The halting problem doesn't say that we can't know whether any program will halt, just that we can't determine the halting status of every single program. It's easy to "prove" that a program that runs an LLM will halt. Just program it to "run the LLM until it decides to stop; but if it doesn't stop itself after 1 million tokens, cut it off." This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn't necessarily about proving that a AI will never do something catastrophic. It's enough to have good informal arguments that it won't do something bad with (say) 99.99% probability over the length of its deployment.
4ProgramCrafter
A problem is that
* we don't know specific goal representation (actual string in place of "A"),
* we don't know how to evaluate LLM output (in particular, how to check whether the plan suggested works for a goal),
* we have a large (presumably infinite non-enumerable) set of behavior B we want to avoid,
* we have explicit representation for some items in B, mentally understand a bit more, and don't understand/know about other unwanted things.
Introduction
Hello everyone,
I'm a long time on-off lurker here. I've made my way through the sequences quite a while ago with a mixed success in implementing some of them. Many of the ideas are intriguing and I would love to have enough spare cycles to play with them. Unfortunately, often enough, I find myself to not have enough capacity to do this properly due to life getting in way. With (not only that) in mind, I'm going to take a sabbatical this summer for at least three months and try to do an update and generally tend to stuff I've been putting off.&n... (read more)