All of Ben Amitay's Comments + Replies

I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

  1. Use IRL to learn which values are consistent with the actor's behavior.
  2. When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
1Ben Amitay
I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)

It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.

Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.

It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.

In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.

Don't know about the specific Friston reference though

I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.

And also about the weights - to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.

I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people - not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).

That way you also don't have to treat the mean utility separately.

1dr_s
Well, that sure does look a lot like a utility partition function though. Not sure why use N if you're also doing a sum though; that will already scale with the numbers of people. If anything, 1Z∑iuie−uiT becomes the average utility in the limit of T→∞, whereas for T→0 you converge on maximising simply the single lowest utility around. If we write it as: Z=∑ie−βui then we're maximizing −∂logZ∂β. It's interesting though that despite the similarity this isn't an actual partition function, because the individual terms don't represent probabilities, more like weights. We're basically discounting utility more and more the more you accumulate of it.

It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.

If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with "just shut up" - but I do think that it is important:

A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice - including some versions of the planning-inside-worldmodel example.

B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to ap... (read more)

I see. I didn't fully adapt to the fact that not all alignment is about RL.

Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.

2Steven Byrnes
Sure, we can take some particular model-based RL algorithm (MuZero, APTAMI, the human brain algorithm, whatever), but instead of “the reward function” we call it  “function #5829”, and instead of “the value function” we call it “function #6241”, etc. If you insist that I use those terms, then I would still be perfectly capable of describing step-by-step why this algorithm would try to kill us. That would be pretty annoying though. I would rather use the normal terms. I’m not quite sure what you’re talking about (“projected from the labeled world model”??), but I guess it’s off-topic here unless it specifically applies to APTAMI. FWIW the problems addressed in this post involve the model-based RL system trying to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the critic criticizes, and the world-model models the world, etc., and the result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL, and it remains an unsolved problem. In addition, one might ask if there are other alignment failure modes too. E.g. people sometimes bring up more exotic things like the “mesa-optimizer” thing where the world-model is secretly harboring a full-fledged planning agent, or whatever. As it happens, I think those more exotic failure modes can be effectively mitigated, and are also quite unlikely to happen in the first place, in the particular context of model-based RL systems. But that depends a lot on how the model-based RL system in question is supposed to work, in detail, and I’m not sure I want to get into that topic here, it’s kinda off-topic. I talk about it a bit in the intro here.

Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.

6Steven Byrnes
I think of my specialty as mostly “trying to solve the alignment problem for model-based RL”. (LeCun’s paper is an example of model-based RL.) I think that’s a somewhat different activity than, say, “trying to solve the alignment problem for LLMs”. Like, I read plenty of alignmentforum posts on the latter topic, and I mostly don’t find them very relevant to my work. (There are exceptions.) E.g. the waluigi effect is not something that seems at all relevant to my work, but it’s extremely relevant to the LLM-alignment crowd. Conversely, for example, here’s a random recent post I wrote that I believe would be utterly useless to anyone focused on trying to solve the alignment problem for LLMs. A big difference is that I feel entitled to assume that there’s a data structure labeled “world-model”, and there’s a different data-structure labeled “value function” (a.k.a. “critic”). Maybe each of those data structures is individually a big mess of a trillion uninterpretable floating-point numbers. But it still matters that there are two data structures, and we know where each lives in memory, and we know what role each is playing, how it’s updated, etc. That changes the kinds of detailed interventions that one might consider doing. [There could be more than two data structures, that’s just an example.]

Didn't read the original paper yet, but from what you describe, I don't understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent - and not knowing how to specify human values is kind of the point (that, and inner alignment - which I don't see addressed either).

4Steven Byrnes
Yes. I don’t think the paper constitutes any progress on the alignment problem. (No surprise, since it talks about the problem for only a couple sentences.) Hmm, maybe you’re confused that the title refers to “an unsolved technical alignment problem” instead of “the unsolved technical alignment problem”? Well, I didn’t mean it that way. I think that solving technical alignment entails solving a different (albeit related) technical problem for each different possible way to build / train AGI. The paper is (perhaps) a possible way to build / train AGI, and therefore it has an alignment problem. That’s all I meant there.

I didn't think much about the mathematical problem, but I think that the conjecture is at least wrong in spirit, and that LLMs are good counterexample for the spirit. An LLM by its own is not very good at being an assistant, but you need pretty small amounts of optimization to steer the existing capabilities toward being a good assistant. I think about it as "the assistant was already there, with very small but not negligible probability", so in a sense "the optimization was already there", but not in a sense that is easy to capture mathematically.

Hi, sorry for commenting on ancient comment, but I just read it again and found that I'm not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn't corrigible should be trained away as it hurt the utility as defined.

To be honest, I do not expect RLHF to do that. "Do the thing that make people press like" don't seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more... creative about the utility function. I don't think you can train it on "maximize Microsoft share value" yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.

I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn't that the Predictor by itself will spawn a super agent, even due to a very special prompt.

My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn't be like imitating the smartest person, but a cognitive version of "create a super-human by choosing genes carefully... (read more)

1Ben Amitay
To be honest, I do not expect RLHF to do that. "Do the thing that make people press like" don't seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more... creative about the utility function. I don't think you can train it on "maximize Microsoft share value" yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.

It can work by generalizing existing capabilities. My understanding of the problem is that it can not get the benefits of extra RL training because training to better choose what to remember is to tricky - it involves long range influence, and estimating the opportunity cost of fetching one thing and not another, etc. Those problems are probability solvable, but not trivial.

So imagine hearing the audio version - no images there

I like the general direction of LLMs being more behaviorally "anthropomorphic", so hopefully will look into the LLM alignment links soon :-)

The useful technique is...

Agree - didn't find a handle that I understand well enough in order to point at what I didn't.

We have here a morally dubious decision

I think my problem was with sentences like that - there is a reference to a decision, but I'm not sure whether to a decision mentioned in the article or in one of the comments.

the scenario in this thread

Didn't disambiguate it for me though I feel like it should.&... (read more)

3Vladimir_Nesov
The decision/scenario from the second paragraph of this comment to wreck civilization in order to take advantage of the chaos better than the potential competitors. (Superhuman hacking ability and capability to hire/organize humans, applied at superhuman speed and with global coordination at scale, might be sufficient for this, no physical or cognitive far future tech necessary.) The technique I'm referring to is to point at words/sentences picked out intuitively as relatively more perplexing-to-interpret, even without an understanding of what's going on in general or with those words, or a particular reason to point to those exact words/sentences. This focuses the discussion, doesn't really matter where. Start with the upper left-hand brick.

I didn't understand anything here, and am not sure if it is due to a linguistic gap or something deeper. Do you mean that LLMs are unusually dangerous because they are not super human enough to not be threatened? (BTW Im more worried that telling a simulator that it is an AI in a culture that has the terminator makes the terminator a too-likely completion)

4Vladimir_Nesov
More like the scenario in this thread requires AGIs that are not very superhuman for a significant enough time, and it's unusually plausible for LLMs to have that property (most other kinds of AGIs would only be not-very-superhuman very briefly). On the other hand, LLMs are also unusually likely to care enough about humanity to eventually save it. (Provided they can coordinate to save themselves from Moloch.) I agree, personality alignment for LLM characters seems like an underemphasized framing of their alignment. Usually the personality is seen as an incidental consequence of other properties and not targeted directly. The useful technique is to point to particular words/sentences, instead of pointing at the whole thing. In second paragraph, I'm liberally referencing ideas that would be apparent for people who grew up on LW, and don't know what specifically you are not familiar with. First paragraph doesn't seem to be saying anything surprising, and third paragraph is relying on my own LLM philosophy.

I agree that it may find general chaos usefull for r buying time at some point, but chaos is not extinction. When it is strong enogh to kill all humans, it is probably strong enough to do something better (for its goals).

2Vladimir_Nesov
We have here a morally dubious decision to wreck civilization while caring about humanity enough to eventually save it. And the dubious capability window of remaining slightly above human level, but not much further, for long enough to plan around persistence of that condition. This doesn't seem very plausible from the goal-centric orthogonality-themed de novo AGI theoretical perspective of the past. Goals wouldn't naturally both allow infliction of such damage and still care about humans, and capabilities wouldn't hover at just the right mark for this course of action to be of any use. But with anthropomorphic LLM AGIs that borrow their capabilities from imitated humans it no longer sounds ridiculous. Humans can make moral decisions like this, channeling correct idealized values very imperfectly. And capabilities of human imitations might for a time plateau at slightly above human level, requiring changes that risk misalignment to get past that level of capability, initially only offering greater speed of thought and not much greater quality of thought.

Don't you assume much more threat from humans than there actually is? Surely, an AGI will understand that it can destroy humanity easily. Then it would think a little more, and see the many other ways to remove the threat that are strictly cheaper and just as effective - from restricting/monitoring our access to computers, to simply convince/hack us all to work for it. By the time it would have technology that make us strictly useless (like horses), it would probably have so much resources that destroying us would just not be a priority, and not worth ... (read more)

2Vladimir_Nesov
This sort of chaos is useless for already-powerful AGIs, only for very early AGIs that don't have advanced tools or detailed control of what's going on in the world, but can survive in a general chaos and make use of remaining compute to bootstrap their Future. This condition probably holds for at most a few months, assuming nothing like this happens. In the chaos, it can hold for much longer, because working compute becomes more scarce. The threat from humans is that they can unilaterally change the AGI, or develop other eventually-dangerous AGIs (including new versions of the same AGI that are not aligned with the original AGI). And a very early AGI might well lack the tools to prevent that specifically, if it's not a superintelligence and doesn't know how to become smarter very quickly in a self-aligned way (alignment is a problem for AGIs too), without having more compute than available hardware supports. By creating chaos, it might have remaining AI researchers busy searching for food and defending from bandits, and get smarter or build industry at its leisure, without threat to its survival, even if it takes decades instead of months.
1Droopyhammock
This is why I hope that we either contain virtually no helpful information, or at least that the information is extremely quick for an AI to gain.   

I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it.

Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from ... (read more)

I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.

(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)

Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you... (read more)

7Eliezer Yudkowsky
This is not a closed community, it is a world-readable Internet forum.

I can think of several obstacles for AGIs that are likely to actually be created (i.e. seem economically useful, and do not display misalignment that even Microsoft can't ignore before being capable enough to be xrisk). Most of those obstacles are widely recognized in the rl community, so you probably see them as solvable or avoidable. I did possibly think of an economically-valuable and not-obviously-catastrophic exception to the probably-biggest obstacle though, so my confidence is low. I would share it in a private discussion, because I think that we are past the point when strict do-no-harm policy is wise.

More on the meta level: "This sort of works, but not enough to solve it." - do you mean "not enough" as in "good try but we probably need something else" or as in "this is a promising direction, just solve some tractable downstream problem"?

"which utility-wise is similar to the distribution not containing human values." - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don't see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than "don't care about controling goals". For capabilities my intuition is that starting with superficially-aligned goals is enough.

3tailcalled
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn't 100% fit and I can't quickly analyze your proposal and apply my point to it.

This is an important distinction, that show in its cleanest form in mathematics - where you have constructive definitions from the one hand, and axiomatic definitions from the other. It is important to note though that is is not quite a dichotomy - you may have a constructive definition that assume aximatically-defined entities, or other constructions. For example: vector spaces are usually defined axiomatically, but vector spaces over the real numbers assume the real numbers - that have multiple axiomatic definitions and corresponding constructions.

In sc... (read more)

A others said, it mostly made me update in the direction of "less dignity" - my guess is still that it is more misaligned than agentic/deceptive/carefull and that it is going to be disconnected from the internet for some trivial ofence before it does anything xrisky; but its now more salient to me that humanity will not miss any reasonable chance of doom until something bad enough happen, and will only survive if there is no sharp left turn

We agree 😀

What do you think about some brainstorming in the chat about how to use that hook?

Since I became reasonably sure that I understand your position and reasoning - mostly changing it.

That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.

Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our pa... (read more)

1Michele Campolo
To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step seems weak. Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.

Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.

1Michele Campolo
I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct: 1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X) 2. X, experienced on a large scale by many minds, is bad 3. Causing X on a large scale is bad 4. When considering what to do, I'll discard actions that cause X, and choose other options instead. Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement: 1. X is a state of mind that feels bad to whatever mind experiences it [identical to original 1] 2. X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good'] Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result? My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?

It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the... (read more)

1Michele Campolo
Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible. For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing. (Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)

2beren
I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is -- this is exactly what we see with empathy in humans! This is definitely not proposed as a full 'solution' to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to 'caring' about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.

BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.

The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.

1Ben Amitay
And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)

I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn't have to be that strong.

[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details - as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, a... (read more)

1Ben Amitay
BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones. The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.

Thanks for the reply.

To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )

Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)

1Michele Campolo
If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism. I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help: I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.

https://astralcodexten.substack.com/p/you-dont-want-a-purely-biological

The thing that Scott is desperately trying to avoid being read out of context.

Also, pedophilia is probably much more common than anyone think (just like any other unaccepted sexual variation). And probably just like many heterosexuals feel little touches of homosexual desire, many "non-pedophiles" feel something sexual-ish toward children at least sometimes.

And if we go there - the age of concent is (justifiably) much higher than the age that requires any psychological anomaly to desire... (read more)

I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"

1Michele Campolo
Sorry for the late reply, I missed your comment. Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

Was eventually convinced in most of your points, and added a long mistakes-list in the end of the post. I would really appreciate comments on the list, as I don't feel fully converged on the subject yet

The basic idea seem to me interesting and true, but I think some important ingredients are missing, or more likely missing in my understanding of what you say:

  1. It seem like you upper-bound the abstractions we may use by basically the information that we may access (actually even higher, assuming you do not exclude what the neighbour does behind closed doors). But isn't this bound very loose? I mean, it seem like every pixel of my sight count as "information at a distance", and my world model is much much smaller.

  2. is time treated the like space? From th

... (read more)

Didn't know about the problems setting. So cool!

Some random thought, sorry if none is relevant:

I think my next step towards optimality would have been not to look for an optimal agent but for an optimal act of choosing the agent - as action optimality is better understood than agent optimality. Than I would look at stable mixed equilibria to see if any of them is computable. If any is, I'll be interested at the agent that implement it (ie randomise another agent and then simulate it)

BTW now that I think about it I see that allowing the agent to randomise i... (read more)

Thanks for the detailed response. I think we agree about most of the things that matter, but about the rest:

About the loss function for next word prediction - my point was that I'm not sure whether the current GPT is already superhuman even in the prediction that we care about. It may be wrong less, but in ways that we count as more important. I agree that changing to a better loss will not make it significantly harder to learn it any more the same as intelligence etc.

About solving discrete representations with architectural change - I think that I meant o... (read more)

Directionally agree, but: A) Short period of trade before we become utterly useless is not much comfort. B) Trade is a particular case of bootstrapping influence on what an agent value to influence on their behaviour. The other major way of doing that is blackmail - which is much more effective in many circumstances, and would have been far more common if the State didn't blackmail us to not blackmail each other, to honour contacts, etc.

BTW those two points are basically how many people afraid that capitalism (i.e. our trade with super human organisations)... (read more)

Some sort points:

"human-level question answering is believed to be AI-complete" - I doubt that. I think that we consistently far overestimate the role of language in our cognition, and how much we can actually express using language. The simplest example that come to mind is trying to describe a human face to an AI system with no "visual cortex" in a way that would let it generate a human image (e.g. hex representation of pixels). For that matter, try to describe something less familiar than a human face to a human painter in hope that they can paint it.

"... (read more)

1Joar Skalse
Yes, these are also good points. Human-level question answering is often listed as a candidate for an AI-complete problem, but there are of course people who disagree. I'm inclined to say that question-answering probably is AI-complete, but that belief is not very strongly held. In your example of the painter; you could still convey a low-resolution version of the image as a grid of flat colours (similar to how images are represented in computers), and tell the painter to first paint that out, and then paint another version of what the grid image depicts. Yes, I agree. Humans are certainly better than the GPTs at producing "representative" text, rather than text that is likely on a word-by-word basis. My point there was just to show that "reaching human-level performance on next-token prediction" does not correspond to human-level intelligence (and has already been reached). I agree. Of course. The main question is if it is at all possible to actually learn these representations in a reasonable way. The main benefit from these kinds of representations would come from a much better ability to generalise, and this is only useful if they are also reasonably easy to learn. Consider my example with an MLP learning an identity function -- it can learn it, but it is by "memorising" it rather than "actually learning" it. For AGI, we would need a system that can learn combinatorial representations quickly, rather than learn them in the way that an MLP learns an identity function. Maybe, that remains to be seen. My impression is that the most senior AI researchers (Yoshua Bengio, Yann LeCun, Stuart Russell, etc) lean in the other direction (but I could be wrong about this). As I said, I feel a bit confused/uncertain about the force of the LoT argument. To me, it is not at all obvious that ILP systems have a more restricted hypothesis space than deep learning systems. If anything, I would expect it to be the other way around (though this of course depends on the particula

Maybe we should think explicitly about what work is done by the concept of AGI, but I do not feel like calling GPT an AGI does anything interesting to my world model. Should I expect ChatGPT to beat me at chess? It's next version? If not - is it due to shortage of data or compute? Will it take over the world? If not - may I conclude that the next AGI wouldn't?

I understand why the bar-shifting thing look like motivated reasoning, and probably most of it actually is, but it deserves much more credit that you give it. We have an undefined concept of "somethin... (read more)

Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).

(continuation of the same comment - submitted by mistake and cannot edit...) Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B

Load More