Tamsin Leake

I'm Tamsin Leake, co-founder and head of research at Orthogonal, doing agent foundations.

Wiki Contributions

Comments

Sorted by

Hi !

ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.

I agree it's neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).

Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that's fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group's values. If you want humanity's values to be satisfied, then "satisfying humanity's values" is not opposite to "satisfy your own values", it's merely the outcome of "satisfy your own values".

  1. I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can't do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we'd be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to "find someone smart and reasonable and likely to have good goal-content integrity", which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.

  2. One of the main reasons to do CEV is because we're gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don't die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

the main arguments for the programmers including all of [current?] humanity in the CEV "extrapolation base" […] apply symmetrically to AIs-we're-sharing-the-world-with at the time

I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.

If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn't the kind of thing that gets your more control of the future.

"Sorry clippy, we do want you to get some paperclips, we just don't want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn't seem to be a very fair way to allocate the future." — and in the same breath, "Sorry Putin, we do want you to get some of whatever-intrinsic-values-you're-trying-to-satisfy, we just don't want you to get as much as ruthlessly ruling Russia can get you, because that doesn't seem to be a very fair way to allocate the future."

And this can apply regardless of how much of clippy already exists by the time you're doing CEV.

trying to solve morality by themselves

It doesn't have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than "solve everything on your own inside CEV" might exist, they can figure those out and defer to them from inside CEV. At least that's the case in my own attempts at implementing CEV in math (eg QACI).

Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.

I guess one of the main things I'm worried about is that it seems to require that we either:

  • Be really good at timing when we pause it to look at its internals, such that we look at the internals after it's had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
  • Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.

Current AIs are not representative of what dealing with powerful optimizers is like; when we'll start getting powerful optimizers, they won't sit around long enough for us to look at them and ponder, they'll just quickly eat us.

So the formalized concept is Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…") ? Is an informal definition written in english?

It seems like "natural latent" here just means "simple (in some simplicity prior)". If I read the first line of your post as:

Has anyone thought about QACI could be located in some simplicity prior, by searching the prior for concepts matching(??in some way??) some informal description in english?

It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I'll give that a try.

To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.

I don't really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.

Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually "have you tried first formalizing CEV/QACI/ESP by hand?" because it feels like we've barely tried and like reasonable progress can be made on it that way.

Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker current-tech-level AI, itself pointed informally at QACI, and we tell the superintelligent optimizer "do what this guy says"; but that seems like it either leaves too many degrees of freedom to the superintelligent optimizer again, or it requires solving corrigibility (the superintelligent optimizer is corrigibly assisting the weaker AI) at which point why not just point the corrigibility at the human directly and ignore QACI altogether, at least to begin with.

The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not - if i understand so far - about an adversary. The adversary stuff is afaict mostly to make proofs work.

I don't think this makes a difference here? If you say "what's the best not-blacklisted-by-any-knightian-hypothesis action", then it doesn't really matter if you're thinking of your knightian hypotheses as adversaries trying to screw you over by blacklisting actions that are fine, or if you're thinking of your knightian hypotheses as a more abstract worst-case-scenario. In both cases, for any reasonable action, there's probly a knightian hypothesis which blacklists it.

Regardless of whether you think of it as "because adversaries" or just "because we're cautious", knightian uncertainty works the same way. The issue is fundamental to doing maximin over knightian hypotheses.

This is indeed a meaningful distinction! I'd phrase it as:

  • Values about what the entire cosmos should be like
  • Values about what kind of places one wants one's (future) selves to inhabit (eg, in an internet-like upload-utopia, "what servers does one want to hang out on")

"Global" and "local" is not the worst nomenclature. Maybe "global" vs "personal" values? I dunno.

my best idea is to call the former "global preferences" and the latter "local preferences", but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime

I mean, it's not unrelated! One can view a utility function with both kinds of values as a combination of two utility functions: the part that only cares about the state of the entire cosmos and the part that only cares about what's around them (see also "locally-caring agents").

(One might be tempted to say "consequentialist" vs "experiential", but I don't think that's right — one can still value contact with reality in their personal/local values.)

That is, in fact, a helpful elaboration! When you said

Most people who "work on AI alignment" don't even think that thinking is a thing.

my leading hypotheses for what you could mean were:

  • Using thought, as a tool, has not occured to most such people
  • Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.

Now, instead, my leading hypothesis is that you mean:

  • Most such people are failing to notice that there's an important process, called "thinking", which humans do but LLMs "basically" don't do.

This is a bunch more precise! For one, it mentions AIs at all.

Load More