eggsyntax

AI safety & alignment researcher

Wiki Contributions

Comments

Thanks for doing this!

One suggestion: it would be very useful if people could interactively experiment with modifications, eg if they thought scalable alignment should be weighted more heavily, or if they thought Meta should receive 0% for training. An MVP version of this would just be a Google spreadsheet that people could copy and modify.

Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.

A few interesting thoughts that came from that thread:

  • Some people: 'Claude says it's conscious!'. Shoalstone: 'in other contexts, claude explicitly denies sentience, sapience, and life.' Me: "Yeah, this seems important to me. Maybe part of any reasonable test would be 'Has beliefs and goals which it consistently affirms'".
  • Comparing to a tape recorder: 'But then the criterion is something like 'has context in understanding its environment and can choose reactions' rather than 'emits the words, "I'm sentient."''
  • 'Selfhood' is an interesting word that maybe could avoid some of the ambiguity around historical terms like 'conscious' and 'sentient', if well-defined.

That's extremely cool, seems worth adding to the main post IMHO!

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.

 

Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.

 

In a general sense, though, there is an objective that's being optimized for

 

My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

Maybe by the time we cotton on properly, they're somewhere past us at the top end.

 

Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter:

  1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year.
  2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account.

 

We can't "just ask" an LLM about its interests and expect the answer to soundly reflect its actual interests.

I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.

 

LLMs specifically have a 'drive' to generate reasonable-sounding text

(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't.

That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.

Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.

I think I'm not getting what intuition you're pointing at. Is it that we already ignore the interests of sentient beings?

 

Additional consideration: Does the AI moral patient's interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI's perspective?

Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I'm sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.

Would you disagree? I'd be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we've shaped them to only say certain things, and that presents problems, is that the issue you're raising?

Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That's hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria. 

If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.

Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.

I'm not aware of any organization working on anything like this, but if it exists I'd love to know about it!

Got it, that makes sense. Thanks!

Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer

So trying to imagine a concrete example of this, I imagine a prompt like: "A typical definition of 'goy' would be: a person who is not a" and you would expect the natural completion to be " Jew" regardless of whether attention to " person" is suppressed (unlike in the empty-string case). Does that correctly capture what you're thinking of? ('goy' is a bit awkward here since it's an unusual & slangy word but I couldn't immediately think of a better example)

Load More