Jeremy Gillen

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.

Posts

Sorted by New

6Jeremy Gillen's Shortform

70Detect Goodhart and shut down

5mo

31Context-dependent consequentialism

8mo

161Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

174Thomas Kwa's MIRI research experience

38AISC team report: Soft-optimization, Bayes and Goodhart

119Soft optimization makes the value target bigger

6Jeremy Gillen's Shortform

76Neural Tangent Kernel Distillation

37Inner Alignment via Superpowers

59Finding Goals in the World Model

Wikitag Contributions

Eurisko

3mo

Eurisko

3mo

(+7/-6)

Comments

Sorted by

Newest

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen4d20

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them.

Trivial and irrelevant though if true-obedience is part of it, since that's magic that gets you anything you can describe.

if the way integrity is implemented is at all kinda similar to how humans implement it.

How do humans implement integrity?

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

You're just stating that you don't expect any reflective instability, as an agent learns and thinks over time? I've heard you say this kind of thing before, but haven't heard an explanation. I'd love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it'd take some effort and I'd want to know that you were going to engage with it).

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen4dΩ120

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.

I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes.

If I try to condition on the assumptions that you're using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.

I'm not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as "imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness". This is locally valid.

I don't think its relevant because we don't know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.

(I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Some people reflectively endorse their own disgust at picking up insects, and wouldn't remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.

deontological constraints we want are like the human notions of integrity, loyalty, and honesty

Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these "deontological" constraints, there's a lot of consequentialist machinery involved (but it's mostly shorter-term and more local than normal consequentialist preferences).

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen5dΩ570

(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.

It feels like you're rounding off Eliezer's words in a way that removes the important subtlety. What you're doing here is guessing at the upstream generator of Eliezer's conclusions, right? As far as I can see in the links, he never actually says anything that translates to "I expect all ASI preferences to be over future outcomes"? It's not clear to me that Eliezer would disagree with "impure consequentialism".

I think you get closest to an argument that I believe with (2):

(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference^[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I think you're misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for "loopholes" in that constraint, i.e. weird ways to achieve all of your goals that don't involve violating the constraint. To you, they aren't loopholes, they're clever ways to achieve all goals.

^{^}
Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

the void

Jeremy Gillen14d42

I don't really know what you're referring to, maybe link a post or a quote?

the void

Jeremy Gillen15d62

whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?

If you read this post, starting at "The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.", and read the following 20 or so paragraphs, you'll get some idea of 2018!Eliezer's models about imitation agents.

I'll highlight

If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I'd talk about how this creates a differential ease of development between "build a system that does X" and "build a system that does X and only X and not Y in some subtle way". If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can't write a simple loss function for that the way you can for X.
[...]
On the other other other hand, suppose the inexactness of the imitation is "This agent passes the Turing Test; a human can't tell it apart from a human." Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can't detect, which covers an enormous amount of lethal territory, which is why we can't just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.

I think with a fair reading of that post, it's clear that Eliezer's models at the time didn't say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn't made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn't expect, then this answers your question.

johnswentworth's Shortform

Jeremy Gillen19d*113

Have you personally done the thing successfully with another person, with both of you actually picking up on the other person's hints?

Yes. But usually the escalation happens over weeks or months, over multiple conversations (at least in my relatively awkward nerd experience). So it'd be difficult to notice people doing this. Maybe twice I've been in situations where hints escalated within a day or two, but both were building from a non-zero level of suspected interest. But none of these would have been easy to notice from the outside, except maybe at a couple of moments.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo50

Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn't a very useful fact, since it's difficult to intuit how unbalanced the game needs to be.

If you can win against a god with queen+knight odds you'll have no trouble reliably beating Leela with the same odds. I'd bet you can't win more than 6 out of 10? $20?

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo20

Yeah I didn't expect that either, I expected earlier losses (although in retrospect that wouldn't make sense, because stockfish is capable of recovering from bad starting positions if it's up a queen).

Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.

Just tried hex, that's fun.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo90

I don't think that'd help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:

This makes me think that Leela is pushing me into a complex position and then letting me blunder. I'd guess that looking at optimal moves in these complex positions would be good training, but probably wouldn't have easy to learn patterns.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo40

I haven't heard of any adversarial attacks, but I wouldn't be surprised if they existed and were learnable. I've tried a variety of strategies, just for fun, and haven't found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it's working but almost never does. As you can see, my record isn't great.

I think I started playing it when I read simplegeometry's comment you linked in your shortform.

It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memorise a specialised opening much deeper than usual? That could be enough. But it'd feel like cheating to me if I used an engine to find that opening. It'd also feel like cheating because it's exploiting Leela's lack of memory of past games. It'd be easy to modify it to deliberately play diverse games when playing against the same person.