Efreet — LessWrong

Introduction

Hello everyone,

I'm a long time on-off lurker here. I've made my way through the sequences quite a while ago with a mixed success in implementing some of them. Many of the ideas are intriguing and I would love to have enough spare cycles to play with them. Unfortunately, often enough, I find myself to not have enough capacity to do this properly due to life getting in way. With (not only that) in mind, I'm going to take a sabbatical this summer for at least three months and try to do an update and generally tend to stuff I've been putting off.

As the sabbatical approaches, I've been looking around and got hit by some information about the AGI alignment issue in a wake-up call of sorts. For now I'm going through the materials, however it is not a field I'm all that familiar with. I'm a programmer by trade so I can parse through most of the stuff, but some of the ideas are somewhat difficult to properly understand. I think I will get to dig deeper in a next pass. For now I'm trying to get the overall feel for the area.

This brings me to a question that has popped to my mind and I've yet to stumble upon something at least resembling an answer - possibly because I don't know where to look yet. If someone can point me in a right direction, it would be appreciated.

Looking for a clarification

Context:

As I understand it, the core of the Alignment is the issue of "can we trust the machine to do what we want it to as opposed to something else"? The whole stuff about hidden complexity of wishes, orthogonality thesis etc. Basically not handing control over to a potentially dangerous agent.
The machine we're currently most worried about are the LLMs or their successors, potentially turning into AGI/superintelligence.
We would like to have a method to ensure that these are aligned. Many of these methods talk about having a machine validate another ones alignment as we will run out of "human based" capacity due to the intelligence disparity.

Since my background is in programming, I tend to see everything through these lens. So for me a LLM is "just" a large collection of weights that we feed some initial input and watch what comes at the other side ^[1] and a machine that does all these updates.
If we don't mind the process being slow, this could be achieved by a single "crawler" machine that would go through the matrix field by field and do the updates. Since the machine is finite (albeit huge), this would work.

Let's now do a rephrasing of the alignment problem. We have a goal A, that we want to achieve and some behavior B, that we want to avoid. So we do the whole training stuff that I don't know much about^[2] resulting in the whole "file with weights" thing. During this process we steer the machine towards producing A while avoiding B as much as we can observe.

Now we take the file of weights, and now we create the small updating program(accepting the slowness for the sake of clarity). Pseudocode:

Grab the first token of the input^[3]
Starting from the input layer go neuron by neuron to update the network
If output notes "A", stop
Else feed the output of the network + subsequent input back into the input layer and go to 1.

Of course, we want to avoid B. The only point in time when we can be sure that B is not on the table is when the machine is not moving anymore. E.g. when the machine halts.

So the clarification I'm seeking is: how is alignment different from the halting problem we already know about?

E.g. when we know that we can't predict whether the machine will halt with a similar power machine, why do we think alignment should follow different set of rules?

Afterword:

I'm aware this might be obvious question for someone already in the field, however considering this sounds almost silly I was somewhat dismayed I didn't find it somewhere spelled out. Maybe the answer is a result of something I don't see, maybe there is just a hole in my reasoning.

It bothered me enough for me to write this post, at the same time I'm not sure enough about my reasoning that I'm not doing this as a full article but rather in the introduction section. Any help is appreciated.

^{^}
Of course it is many orders of magnitude more complex under the hood. But stripped to the basics, this is it. There are no "weird magic-like" parts doing something weird.
^{^}
I've fiddled with some neural networks. Did some small training runs. Even tried implementing basic logic from scratch myself - though that was quite some time ago. So I have some idea about what is going on. However I'm not up-to-date on the state of the art approaches and I'm not an expert by any stretch of imagination.
^{^}
All input that we want to provide to the machine. Could be first frame of a video/text prompt/reading from sensors, whatever else.

LESSWRONG
LW