All of Kerrigan's Comments + Replies

Why can't the true values live at the level of anatomy and chemistry?

2Charlie Steiner
If God has ordained some "true values," and we're just trying to find out what pattern has received that blessing, then yes this is totally possible, God can ordain values that have their most natural description at any level He wants. On the other hand, if we're trying to find good generalizations of the way we use the notion of "our values" in everyday life, then no, we should be really confident that generalizations that have simple descriptions in terms of chemistry are not going to be good.

Would this be solved if cresting a copy is creating someone functionally the same as you but who is someone else's identity, and not you?

Is there a page similar to this, but for alignment solutions?

2ChristianKl
Not as far as I know, feel free to create one.

What about from a quantum immortality perspective?

Kerrigan1-2

Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?

2Martin Randall
Possible but unlikely to occur by accident. Value-space is large. For any arbitrary biological species, most value systems don't optimize in favor of that species.

Both quotes are from your above post. Apologies for confusion.

“A sufficiently intelligent agent will try to prevent its goals[1] from changing, at least if it is consequentialist.”

It seems that in humans, smarter people are more able and likely to change their goals. A smart person may change his/her views about how the universe can best be arranged upon reading Nick Bostrom’s book Deep Utopia, for example.
 

‘I think humans are stable, multi-objective systems, at least in the short term. Our goals and beliefs change, but we preserve our important values over most of those changes. Even when gaining or losing... (read more)

2Seth Herd
I think my terminology isn't totally clear. By "goals" in that statement, I mean what we mean by "'values" in humans. The two are used in overlapping and mostly interchangable ways in my writing 1. Humans aren't sufficiently intelligent to be all that internally consistent 2. In many cases of humans changing goals, I'd say they're actually changing subgoals, while their central goal (be happy/satisfied/joyous) remains the same. This may be described as changing goals while keeping the same values. 3. Note 'in the short term' (I think you're quoting Bostrom? The context isn't quite clear). In the long term, with increasing intelligence and self-awareness, I'd expect some of people's goals to change as they become more self-aware and work toward more internal coherence (e.g., many people change their goal of eating delicious food when they realize it conflicts with their more important goal of being happy and living a a long life). Yes, humans may change exactly that way. A friend said he'd gotten divorced after getting a CPAP to solve his sleep apnea: "When we got married, we were both sad and angry people. Now I'm not." But that's because we're pretty random and biology determined.
Kerrigan120

How do humans, for example, read a philosophy book and update their views about what they value about the world?

1Kerrigan
Could there not be AI value drift in our favor, from a paperclipper AI to a moral realist AI?
KerriganΩ110

“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”
 

Can someone please why trading partners would lobotomize themselves?

2quetzal_rainbow
Let's suppose that you give in to threats if your opponent is not capable to predict that you do not give in to threats, so they carry the threat anyway. Therefore, other opponents are incentivised to pretend very hard to be such opponent, up to "literally turn themselves into sort of opponent that carries on useless threats".

How does inner misalignment lead to paperclips? I understand the comparison of paperclips to ice cream, and that after some threshold of intelligence is reached, then new possibilities can be created that satisfy desires better than anything in the training distribution, but humans want to eat ice cream, not spread the galaxies with it. So why would the AI spread the galaxies with paperclips, instead of create them and 
”consume“ them? Please correct any misunderstandings of mine,

2ChristianKl
Paperclips are a metaphor for some things but don't really help here.  The AIs that are productive need a lot of compute to do so. Spreading to other solar systems means accessing more compute.  

And  a subset might value drift towards optimizing the internal experiences of all conscious minds?

2ChristianKl
That's a much more complex goal than wireheading for a digital mind that can self-modify.  In any case, those agents that care a lot about getting more power over the world are more likely to get power than agents that don't.

If an AGI achieves consciousness, why would its values not drift towards optimizing its own internal experience, and away from tiling the lightcone with something?

2ChristianKl
If some AGI's only care about their internal experience and not affecting the outside world, they are basically wireheading.  If a subset of AGI wireheads and some AGIs don't wirehead the AGIs that don't wirehead will have all the power over the world. Wireheaded AGIs are also economically useless so people try to develop AGIs that don't do that.

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

How can utility be a function of worlds, if an agent doesn‘t have access to the state of the world, but only the sense data?

How can utility be a function of worlds, if the agent doesn’t have access to the state of the world, but only the sense data?

2abramdemski
The post is making the distinction between seeing preferences as a utility function of worlds (this is the regular old idea of utility functions as random variables) vs seeing preferences as an expectation function on events (the jeffrey-bolker view). Both perspectives hold that an agent can optimize things it does not have direct access to. Agency is optimization at a distance. Optimization that isn't at a distance is selection as opposed to control.

“The wanting system is activated by dopamine, and the liking system is activated by opioids. There are enough connections between them that there's a big correlation in their activity” But are they orthogonal in principle?

Possible addition: "If you, the ASI, have not solved the hard problem of consciousness, you should place weight on Open Individualism being true. 'Open individualism is the view in the philosophy of self, according to which there exists only one numerically identical subject, who is everyone at all times, in the past, present and future.' - https://en.wikipedia.org/wiki/Open_individualism. This can be though of that all conscious minds are the same identity, at different points in time. Consider then, that exterminating humanity wo... (read more)

How many LessWrong users/readers are there total?

2Yoav Ravid
If you go to the user tab on the search page with no search term you can see there's currently 113,654 users (of course, how many of those are active or are 'readers' is a completely different question).

What ever caused the CEV to fall out of favor? Is it because it is not easily specifiable, that if we program it then it won’t work, or some other reason?

I now think that people are way more misaligned with themselves than I had thought.

Drugs addicts may be frowned upon for evolutionary psychological reasons, but that doesn’t mean that their quality of life must be bad, especially if drugs were developed without tolerance and bad comedowns.

1mruwnik
Drug addicts tend to be frowned upon not because they have a bad life, or even for evo-psych reasons but because their lifestyle is bad for the rest of society, in that they tend to have various unfortunate externalities

Will it think that goals are arbitrary, and that the only thing it should care about is its pleasure-pain axis? And then it will lose concern for the state of the environment?

1mruwnik
You're adding a lot of extra assumptions here, a couple being: * there is a problem with having arbitrary goals * it has a pleasure-pain axis * it notices it has a pleasure-pain axis * it cares about its pleasure-pain axis * its pleasure-pain axis is independent of its understanding of the state of the environment The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.  Although I'm criticizing your specific criticism, my main issue with it is that it's a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you've provided a good example of WHY inner alignment is a big problem, i.e. it's very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.

Could you have a machine hooked up to a person‘s nervous system, change the settings slightly to change consciousness, and let the person choose whether the changes are good or bad? Run this many times.

2AnthonyC
I don't think this works. One, it only measure short term impacts, but any such change might have lots of medium and long term effects, second and third order effects, and effects on other people with whom I interact. Two, it measures based on the values of already-changed me, not current me, and it is not obvious that current-me cares what changed-me will think, or why I should so care if I don't currently. Three, I have limited understanding of my own wants, needs, and goals, and so would not trust any human's judgement of such changes far enough to extrapolate to situations they didn't experience, let alone to other people, or the far future, or unusual/extreme circumstances.

Would AI safety be easy if all researchers agreed that the pleasure-pain axis is the world’s objective metric of value? 

3Rafael Harth
No. It would make a difference but it wouldn't solve the problem. The clearest reason is that it doesn't help with Inner Alignment at all.

Seems like I will be going with CI, as I currently want to pay with a revocable trust or transfer-on-death agreement.

Do you know how evolution created minds that eventually thought about things such as the meaning of life, as opposed to just optimizing inclusive genetic fitness in the ancestral environment? Is the ability to think about the meaning of life a spandrel?

1mruwnik
I'm assuming you're not asking about the mechanism (i.e. natural selection + mutations)? A trite answer would be something like "the same way it created wings, mating dances, exploding beetles, and parasites requiring multiple hosts". Thinking about the meaning of life might be a spandrel, but a quick consideration of it comes up with various evo-psych style reasons why it's actually very useful, e.g. it can propel people to greatness, which massively can increase their genetic fitness. Fitness is an interesting thing, in that it can be very non-obvious. Everything is a trade-off, where the only goal is for your genes to propagate. So if thinking about the meaning of life will get your genes spread more (e.g. because you decide that your children have inherit meaning, because you become a high status philosopher and your sister can marry well, because it's a social sign that you have enough resources to waste them on fruitless pondering) then it's worth having around.

In order to get LLMs to tell the truth, can we set up a multi-agent training environment, where there is only ever an incentive for them to tell the truth to each other? For example, an environment such that each agent has partial information available to each of them, with full info needed for rewards.

1mruwnik
The first issue that comes to mind is having an incentive that would achieve that. The one you suggest doesn't incentivize truth - it incentivizes collaboration in order to guess the password, which would fine in training, but then you're going into deceptive alignment land: Aleya Cotra has a good story illustrating that

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward

Is there an already written expansion of this?

Does Eliezer think the alignment problem is something that could be solved if things were just slightly different, or that proper alignment would require a human smarter than the smartest human ever?

Why can't you build an AI that is programmed to shut off after some time? or after some number of actions?

3[anonymous]
You might be interested in this paper and this LessWrong tag.
KerriganΩ240

How was Dall-E based on self-supervised learning? The datasets of images weren't labeled by humans? If not, how does it get form text to image?

2Gabriel Adriano de Melo
The text-to-image from Dall-E was based on another model called CLIP, which had learned to caption images (generate image-to-text). This captioning could be thought as supervised learning, but the caveat is that they weren't labeled by humans (in the ML sense) but extracted from web data. This is just a part of the Dall-E model, another one is the diffusion process that is based on recovering an image from noise, which is un-supervised as we can just add noise to images and ask it to recover the original image.
1InvidFlower
Not sure on DALL-E, but I think many image generators use an image classifier as part of their process. The classifier uses labels for its training, but the image AI doesn’t have direct intervention. I think you take the classifier like CLIP and run it on an image to tell you it is likely “car” and “ red”. Then add noise to the image. Then provide the noisy image and classifications to the image AI. So it will try to find “red” and “car” and add more of it to the details. Then the resulting image is run through CLIP and the classifications compared to the original classifications to define the loss function.
-1Millon Madhur Das
Just like language models are trained using masked language modelling and next token prediction, Dall-E was trained for image inpainting(predicting cropped-out parts of an image).  This doesn't require explicit labels; hence it's self-supervised learning. Note this is only a part of the training procedure, which is self-supervised and not the whole training process.
2gwern
The 'labels' aren't labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text 'labels' will have nothing whatsoever to do with the image - they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don't need text 'label' inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren't 'labels' in any traditional sense. They're just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.

Does the utility function given to the AI have to be in code? Can you give the utility function in English, if it has a language model attached?

1mruwnik
You could, but should you? English in particular seems a bad choice. The problem with natural languages is their ambiguity. When you're providing a utility function, you want it to be as precise and robust as possible. This is actually an interesting case where folklore/mythology has known about these issues for millennia. There are all kinds of stories about genies, demons, monkey paws etc. where wishes were badly phrased or twisted. This is a story explanation of the issue.

Why aren't CEV and corrigibility combinable?
If we somehow could hand-code corrigibility, and also hand-code the CEV, why would the combination of the two be infeasible? 

Also, is it possible that the result of an AGI calculating the CEV would include corrigibility in its result? Afterall, might one of our convergent desires "if we knew more, thought faster, were more the people we wished we were" be to have the ability to modify the AI's goals?

How much does the doomsday argument factor into people's assessments of the probability of doom?

If AGI alignment is possibly the most important problem ever, why don't concerned rich people act like it? Why doesn't Vitalik Buterin, for example, offer one billion dollars to the best alignment plan proposed by the end of 2023? Or why doesn't he just pay AI researchers money to stop working on building AGI, in order to give alignment research more time?

1mruwnik
* There isn't a best alignment plan yet - that's part of the problem * The issue isn't intuitive - you need quite a bit of context to even understand it. Like evolution 150 years ago * The best long term protection against unaligned AGI (or aliens, or whatever) is to have an aligned AGI, so the faster it appears, the better * People assume it will all work out somehow, as it always has in the past (basic black swan issues) * People are selfish and hope someone else will sort it out * People have other priorities which are more important for them to fund * Basically the whole suite of biases etc. (check Kahneman or the sequences for more) Paying researchers to not work on AGI is a very bad idea, as it incentivizes pretending to work in order to not work. The general idea behind not paying Dane-geld is that just encourages more of it. You could sort of make it better by suggesting that they work on alignment rather than capabilities, but the problem with that is that they both often look the same from the outside. You'd end up with the equivalent of gain of function research.

If a language model reads many proposals for AI alignment, is it, or will any future version, be capable of giving opinions on which proposals are good or bad?

1mruwnik
Yes, of course. The question then is whether its opinions are any good. Check out iterated amplification.

What about multiple layers (or levels) of anthropic capture? Humanity, for example, could not only be in a simulation, but be multiple layers of simulation deep.

If an advanced AI thought that it could be 1000 layers of simulation deep, it could be turned off by agents in any of the 1000 "universes" above. So it would have to satisfy the desires of agents in all layers of the simulation.

It seems that a good candidate for behavior that would satisfy all parties in every simulation layer would be optimizing "moral rightness", or MR. (term taken from Nick Bost... (read more)

1mruwnik
Check out this article: https://www.lesswrong.com/posts/vCQNTuowPcnu6xqQN/distinguishing-test-from-training

I'll ask the same follow-up question to similar answers: Suppose everyone agreed that the proposed outcome above is what we wanted. Would this scenario then be difficult to achieve?

3AnthonyC
I mean, yes, because the proposal is about optimizing our entire future light for an outcome we don't know how to formally specify.

Suppose everyone agreed that the proposed outcome is what we wanted. Would this scenario then be difficult to achieve?

Why do some people talking about scenarios that involve the AI simulating the humans in bliss states think that is a bad outcome? Is it likely that is actually a very good outcome we would want if we had a better idea of what our values should be?

1mruwnik
Check out the wireheading tag. Mainly because: * Happiness is not the only terminal value that is important * Drug addicts are frowned upon * Doubts about whether a state of permanent bliss is even possible (i.e. will it get boring or need to be ratcheted up)

How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?

1mruwnik
Yes, exactly. This is sort of the whole point. A basic answer is that if it actually cares about its goals and can think about them, it'll notice that it should also care about the state of the environment, as otherwise it's liable to not achieve its goals. Which is pretty much why rationality is valuable and the main lesson of the sequences. Check out inner alignment and shard theory for a lot of confusing info on this topic. 

I think it may want to prevent other ASIs from coming into existence elsewhere in the universe that can challenge its power.

What did smart people in the eras before LessWrong say about the alignment problem?

1mruwnik
Frankenstein is a tale about misalignment. Asimov wrote a whole book about it. Vernor Vinge also writes about it. People have been trying to get their children to behave in certain ways for ever. But before LW the alignment problem was just the domain of SF.  20 years ago the alignment problem wasn't a thing, so much that MIRI started out as an org to create a Friendly AI.

In addition, the sympathetic nervous system (in the body, removed in neuropreservation) seems to play a role in identity.  I would recommend you read this EA Forum post by a person who claims significant changes to identity, personality, cognitive abilities, etc. after having sympathetic nerves severed.

Would it make sense to tell Alcor to flip a coin after your death, to decide neuro or whole body? So if Quantum Immortality is true, there will be both branches of the multiverse where you get preserved as a neuro patient, and some branches where you become a whole body patient.

2Eli Tyre
What would be the advantage of that?

That is, personality changes are attributed to the brain alone, with no involvement from the central or enteric nervous systems. Any personality changes due to spinal or abdominal trauma would need to posit a totally new biological mechanism.

 

Every line of inquiry so far has failed to suggest that any important aspects of personality are located anywhere except the brain.

You should check out sympathectomies, that cut or clamp nerves from the sympathetic nervous system in the torso.  Here is a detailed post from the EA Forum, from a sympathectomy ... (read more)

Was this ever commercialized? Is the recipe still online and so people drink this?

How would AGI alignment research change if the hard problem of consciousness were solved?

1Rachel Freedman
Consciousness, intelligence and human-value-alignment are probably mostly orthogonal, so I don’t think that solving the hard problem of intelligence would directly impact AGI alignment research. (Perhaps consciousness requires general intelligence, so understanding how consciousness works on a mechanistic level might dramatically accelerate timelines? But that’s highly speculative.) However, if solving the hard problem of consciousness leads us to realize that some of our AI systems are conscious, then we have a whole new set of moral patients. (As an AGI researcher) I personally would become much more concerned with machine ethics in that case, and I suspect others would as well.