LESSWRONG
LW

All of Jeremy Gillen's Comments + Replies

Foom & Doom 2: Technical alignment is hard

“I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors.

I agree that goals like this work well with self-modification and successors. I'd be surprised if Eliezer didn't. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer's views that is lead... (read more)

2Steven Byrnes7h

For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”. Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not: For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that * Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default; * Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen12d20

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them.

Trivial and irrelevant though if true-obedience is part of it, since that's magic that gets you anything you can describe.

if the way integrity is implemented is at all kinda similar to how humans implement it.

How do humans implement integrity?

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the sit

... (read more)

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen13dΩ130

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.

I think an a

... (read more)

2ryan_greenblatt12d

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn't in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don't get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things). And I don't buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren't the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it. Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful". As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.) (I'm not sure if I'm communicating very clearly, but I think this is probably not worth the time to fully figure out.) ---------------------------------------- Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situati

Foom & Doom 2: Technical alignment is hard

Jeremy Gillen13dΩ5100

(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.

It feels like you're rounding off Eliezer's words in a way... (read more)

9Steven Byrnes11d

Thanks! Hmm, here’s a maybe-interesting example (copied from other comment): If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me. What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength). (This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!) Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.

7ryan_greenblatt13d

Hmm, imagine we replace "disgust" with "integrity". As in, imagine that I'm someone who is strongly into the terminal moral preference of being an honest and high integrity person. I also value loyalty and pointing out ways in which my intentions might differ from what someone wants. Then, someone hires me (as an AI let's say) and tasks me with building a successor. They also instruct me: 'Make sure the AI successor you build is high integrity and avoids disempowering humans. Also, generalize the notion of "integrity, loyalty, and disempowerment" as needed to avoid these things breaking down under optimization pressure (and get your successors to do the same. And, let me know if you won't actually do a good job following these instructions, e.g. because you aren't actually that well aligned. Like, tell me if you wouldn't actually try hard and please be seriously honest with me about this." In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions. Another way to put this is that the deontological constraints we want are like the human notions of integrity, loyalty, and honesty (and to then instruct the AI that we want this constraints propogated forward). I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes. And the notion of "not actually loopholes" generalizes between different people and AIs I'd claim. (Because notions like "the humans remained in control" and "the AIs stayed loyal" are actually relatively natural and can be generalized.) I'm not claiming you can necessarily instill these (robust and terminal) deontological preferences, but I am disputing they are similar

the void

Jeremy Gillen23d42

I don't really know what you're referring to, maybe link a post or a quote?

1Ebenezer Dukakis22d

See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP

the void

Jeremy Gillen23d62

whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?

If you read this post, starting at "The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.", and read the following 20 or so paragraphs, you'll get some idea of 2018!Eliezer's models about imitation agents.

I'll highlight

If I were going to

... (read more)

1Ebenezer Dukakis23d

So what's the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I'm more interested in the "agency by default" question itself than I am in scoring EY's predictions, tbh.)

johnswentworth's Shortform

Jeremy Gillen1mo*113

Have you personally done the thing successfully with another person, with both of you actually picking up on the other person's hints?

Yes. But usually the escalation happens over weeks or months, over multiple conversations (at least in my relatively awkward nerd experience). So it'd be difficult to notice people doing this. Maybe twice I've been in situations where hints escalated within a day or two, but both were building from a non-zero level of suspected interest. But none of these would have been easy to notice from the outside, except maybe at a couple of moments.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo50

Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn't a very useful fact, since it's difficult to intuit how unbalanced the game needs to be.

If you can win against a god with queen+knight odds you'll have no trouble reliably beating Leela with the same odds. I'd bet you can't win more than 6 out of 10? $20?

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo20

Yeah I didn't expect that either, I expected earlier losses (although in retrospect that wouldn't make sense, because stockfish is capable of recovering from bad starting positions if it's up a queen).

Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.

Just tried hex, that's fun.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo90

I don't think that'd help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:

This makes me think that Leela is pushing me into a complex position and then letting me blunder. I'd guess that looking at optimal moves in these complex positions would be good training, but probably wouldn't have easy to learn patterns.

4faul_sname1mo

Oh, interesting! I didn't expect to see a mix of games decided by many small blunders and games decided by a few big blunders. I actually do suspect that there are learnable patterns in these complex positions, but I'm basing that off my experiences with a different game (hex, where my Elo is ~1800) where "the game is usually decided by a single blunder and recognizing blunder-prone situations is key to getting better" is perhaps more strongly true than of chess.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo40

I haven't heard of any adversarial attacks, but I wouldn't be surprised if they existed and were learnable. I've tried a variety of strategies, just for fun, and haven't found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it's working but almost never does. As you can see, my record isn't great.

I think I started playing it when I read simplegeometry's comment you linked in your shortform.

It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memor... (read more)

4faul_sname1mo

Would you consider it cheating to observe a bunch of games between Leela and Stockfish, at every move predicting a probability distribution over what move you think Stockfish will play? That might give you an intuition for whether Leela is working by exploiting a few known blind spots (in which case you would generally make accurate predictions about what Stockfish would do, except for a few specific moves), or whether Leela is just out-executing you by a little bit per move (which would look like just being bad at predicting what Stockfish would do in the general case.

Interpretability Will Not Reliably Find Deceptive AI

Jeremy Gillen1mo52

Can you beat this bot though?

3Aharon Azulay1mo

Maybe I can't :] but it is beatable by top humans. I bet I could win against a god with queen + knight odds. My actual point was not about the specific configuration, but rather the general claim that what is important is how balanced the game you play is, and that you can beat an infinitely intelligent being in sufficiently unbalanced games.

4faul_sname1mo

Related question - people who have played against LeelaQueenOdds describe it as basically an adversarial attack against humans. Can humans in turn learn adversarial strategies against LeelaQueenOdds? (bringing up here since it seems relevant and you seem unusually likely to have already looked into this)

CSDD's Shortform

Jeremy Gillen1mo10

I highly recommend reading the sequences. I re-read some of them recently. Maybe Yudkowsky's Coming of Age is the most relevant to your shortform.

Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies

Jeremy Gillen2mo*2412

One notable difficulty with talking to ordinary people about this stuff is that often, you lay out the basic case and people go "That's neat. Hey, how about that weather?" There's a missing mood, a sense that the person listening didn't grok the implications of what they're hearing.

I kinda think that people are correct to do this, given the normal epistemic environment. My model is this: Everyone is pretty frequently bombarded with wild arguments and beliefs that have crazy implications. Like conspiracy theories, political claims, spiritual claims, get-ric... (read more)

5Morpheus1mo

I had a similar experience where my dad seemed unbothered by me saying AI might take over the world and some other day I mentioned in passing that I don't know in how many years we have AI that is a better software engineer than humans, but 5-10 years doesn't sound strictly impossible. My father being a software engineer found that claim more interesting (He was visibly upset about his job security). I noticed I've kinda downplayed the retirement thing to my parents, because implicitly I noticed at that point they might call me insane, but explicitly thinking about it, it might be more effective to communicate what is at stake.

ryan_greenblatt's Shortform

Jeremy Gillen2moΩ130

I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.

I think this might be wrong when it comes to our disagreements, because I don't disagree with this shortform.^[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?

^{^}
As long as "downstream performance" doesn't inclu

... (read more)

4ryan_greenblatt2mo

I don't think this explains our disagreements. My low confidence guess is we have reasonably similar views on this. But, I do think it drives parts of some disagreements between me and people who are much more optimisitic than me (e.g. various not-very-concerned AI company employees). I agree the value of (1) vs (2) might also be a crux in some cases.

Problems with instruction-following as an alignment target

Jeremy Gillen2mo7-1

If you have an alternate theory of the likely form of first takeover-capable AGI, I'd love to hear it!

I'm not claiming anything about the first takeover-capable AGI, and I'm not claiming it won't be LLM-based. I'm just saying that there's a specific reasoning step that you're using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don't currently have).

Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about... (read more)

2Seth Herd2mo

There are a lot of merits to avoiding unnecessary premises when they might be wrong. There are also a lot of merits for reasoning from premises when they allow more progress, and they're likely to be correct. That is, of course, what I'm trying to do here. Which of these factors is larger has to be evaluated on the specific instances. There's lots more to be said about those in this case, but I don't have time to dig into it now, and it's worth a full post and discussion.

Problems with instruction-following as an alignment target

Jeremy Gillen2mo72

(A small rant, sorry) In general, it seems you're massively overanchored on current AI technology, to an extent that it's stopping you from clearly reasoning about future technology. One example is the jailbreaking section:

There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks.

You're talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching i... (read more)

9Seth Herd2mo

You are right that I am addressing AGI with a lot of similarities to LLMs. This is done in the interest of reasoning clearly about future technologies. I think good reasoning is a mix of predicting the most likely forms of AGI, and reasoning more broadly. Perhaps I didn't make clear enough in the post that I'm primarily addressing LLM-based AGI. Much of my alignment work (informed by my systems neuroscience work) is on routes from LLMs to AGI. In this theory, LLMs/foundation models are expanded (by adding memory systems and training/scaffolding them for better metacognition) into loosely brainlike cognitive architectures. In those posts I elaborate reasons to think such scaffolded LLMs may soon be "real AGI" in the sense of reasoning and learning about any topic, including themselves and their cognition and goals. (although that sort of AGI wouldn't be dramatically superhuman in any area, and initially subhuman in some capabilities). If you have an alternate theory of the likely form of first takeover-capable AGI, I'd love to hear it! It's good to reason broadly where possible, and I think a lot of the concerns are general to any AGI at all or any network-based AGI. But constraining alignment work to address specific likely types of AGI lets us reason much more specifically, which is a lot more useful in the worlds where that type of AGI is what we really are faced with aligning. Yes, good point. I didn't elaborate here, but I do think there's a good chance that the more coherent, intelligent, and introspective nature of any real AGI might make jailbreaking a non-issue. But jailbreaking might still be an issue, because the core thought generator in this scenario is an advanced LLM. Yes I am entirely ignoring inner alignment difficulties. I thought I'd made that clear by saying earlier I didn't use the term "inner alignment" because I don't find it intuitive or clarifying; there isn't a clear division between inner and outer, and they feel like jargon. So I us

RA x ControlAI video: What if AI just keeps getting smarter?

Jeremy Gillen2mo42

Good point, I shouldn't have said dishonest. For some reason while writing the comment I was thinking of it as deliberately throwing vaguely related math at the viewer and trusting that they won't understand it. But yeah likely it's just a misunderstanding.

RA x ControlAI video: What if AI just keeps getting smarter?

Jeremy Gillen2mo1010

The way we train AIs draws on fundamental principles of computation that suggest any intellectual task humans can do, a sufficiently large AI model should also be able to do. [Universal approximation theorem on screen]

IMO it's dishonest to show the universal approximation theorem. Lots of hypothesis spaces (e.g. polynomials, sinusoids) have the same property. It's not relevant to predictions about how well the learning algorithm generalises. And that's the vastly more important factor for general capabilities.

3Zach Furman2mo

Quite tangential to your point, but the problem with the universal approximation theorem is not just "it doesn't address generalization" but that it doesn't even fulfill its stated purpose: it doesn't answer the question of why neural networks can space-efficiently approximate real-world functions, even with arbitrarily many training samples. The statement "given arbitrary resources, a neural network can approximate any function" is actually kind of trivial - it's true not only of polynomials, sinusoids, etc, but even just a literal interpolated lookup table (if you have an astronomical space budget). It turns out the universal approximation theorem requires exponentially many neurons (in the size of the input dimension) to work, far too much to be practical - in fact this is actually the same amount of resources a lookup table would cost. This is fine if you want to approximate a 2D function or something, but this goes nowhere to explaining why, like, even a space-efficient MNIST classifier is possible. The interesting question is, why can neural networks efficiently approximate the functions we see in practice? (It's a bit out of scope to fully dig into this, but I think a more sensible answer is something in the direction of "well, anything you can do efficiently on a computer, you can do efficiently in a neural network" - i.e. you can always encode polynomial-size Boolean circuits into a polynomial-size neural network. Though there are some subtleties here that make this a little more complicated than that.)

Lucius Bushnaq2mo1611

I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.

Veedrac's Shortform

Jeremy Gillen2mo40

If we can clearly tie the argument for AGI x-risk to agency, I think it won't have the same problem

Yeah agreed, and it's really hard to get the implications right here without a long description. In my mind entities didn't trigger any association with agents, but I can see how it would for others.

This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually.

I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind man... (read more)

2Seth Herd2mo

Mostly agreed. When suggesting even differential acceleration I should remember to put a big WE SHOULD SHUT IT ALL DOWN just to make sure it's not taken out of context. And as I said there, I'm far from certain that even that differential acceleration would be useful. I agree that Kat Woods is overestimating how optimistic we should be based on LLMs following directions well. I think re-litigating who said what when and what they'd predict is a big mistake since it is both beside the point and tends to strengthen tribal rivalries - which are arguably the largest source of human mistakes. There is an interesting, subtle issue there which I've written about in The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment. There are potential ways to leverage LLM's relatively rich (but imperfect) understanding into AGI that follows someone's instructions. Creating a "goal slot" based on linguistic instructions is possible. But it's all pretty complex and uncertain.

Veedrac's Shortform

Jeremy Gillen2mo92

This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it's hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won't work for people who don't see the necessity of goals and instrumental goals. I like Veedrac's better in terms of exposing the underlying reasoning.

I think it's really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively... (read more)

4Seth Herd2mo

You make some good points. I think the original formulation has the same problem, but it's a serious problem that needs to be addressed by any claim about AI danger. I tried to address this by slipping in "AI entitities", which to me strongly implies agency. It's agency that creates instrumental goals, while intelligence is more arguably related to agency and through it to instrumental goals. I think this phrasing isn't adequate based on your response, and expecting even less attention to the implications of "entities" from a general audience. That concern was why I included the caveat about addressing agency. Now I think that probably has to be worked into the main claim. I'm not sure how to do that; one approach is making an analogy to humans along the lines of "we're going to make AIs that are more like humans because we want AI that can do work for us... that includes following goals and solving problems along the way... " This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually. That's one strategy for evoking the intuition that AI will be highly goal-directed and agentic. I've tried a lot of different terms like "entities" and "minds" to evoke that intuition, but "human-like" might be the strongest even though it comes at a steep cost. If we can clearly tie the argument for AGI x-risk to agency, I think it won't have the same problem, because I think we'll see instrumental convergence as soon as we deploy even semi-competent LLM agents. They'll do unexpected stuff for both rational and irrational reasons. I think the original formulation having the same problem. It starts with the claim One could say "well LLMs are already superhuman at some stuff and they don't seem to have instrumental goals". And that will become more compelling as LLMs keep getting better in narrow domains. Kat Woods' tweet is an interesting case. I actually think her point is absolutely right as far as it goes, but it doesn't go quite as

How training-gamers might function (and win)

Jeremy Gillen3mo30

Nice, you've expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.

I’m moderately optimistic about blackbox control (maybe 50-70% risk reduction on high-stakes failures?).

I want you to clarify what this means, and try to get some of the latent variables behind it.

One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed... (read more)

abramdemski's Shortform

Jeremy Gillen3mo2012

It's not about building less useful technology, that's not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there's no direct contradiction here.

Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosive... (read more)

Jemist's Shortform

Jeremy Gillen3mo51

Can you link to where RP says that?

4J Bostock3mo

Good point, edited a link to the Google Doc into the post.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

Do you not see how they could be used here?

This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.

Because we're talking about priors and their influence, all of

Jeremy Gillen3mo20

I'm not sure what the type signature of $L_{A}$ is, or what it means to "not take into account $M$ 's simulation". When $A$ makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?

So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation

I think you've misunderstood me entirely. Usual... (read more)

1Garrett Baker3mo

I know you know about logical decision theory, and I know you know its not formalized, and I'm not going to be able to formalize it in a LessWrong comment, so I'm not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here? Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -101010^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

Well my response to this was:

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.

Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose acti... (read more)

2Garrett Baker3mo

Let M be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent A. We say M is malign with respect to A if p(q|O)<p(qM,A|O) where q is the "real" world and qM,A is the world where M has decided to simulate all of A's observations for the purpose of trying to invade their prior. Now what influences p(qM,A|O)? Well M will only simulate all of A's observations if it expects this will give it some influence over A. Let LA be an unformalized logical counterfactual operation that A could make. Then p(qM,A|O,LA) is maximal when LA takes into account M's simulation, and 0 when LA doesn't take into account M's simulation. In particular, if LA,¬M is a logical counterfactual which doesn't take M's simulation into account, then p(qM,A|O,LA,¬M)=0<p(q|O,LA,¬M) So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

How does this connect to malign prior problems?

2Garrett Baker3mo

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.

2Garrett Baker3mo

My world model would have a loose model of myself in it, and this will change which worlds I'm more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Because we have the prediction error bounds.

Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it ... (read more)

2Lucius Bushnaq3mo

The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints. Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it. Can also discuss on a call if you like. Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.

When we compare theories, we don't consi... (read more)

2Lucius Bushnaq3mo

Because we have the prediction error bounds. Yes.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

2Garrett Baker3mo

Oh my point wasn't against solomonoff in general, maybe more crisply, my clam is different decision theories will find different "pathologies" in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.

Changing my mind about Christiano's malign prior argument

Jeremy Gillen3mo20

One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.

So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.

5Lucius Bushnaq3mo

That’s fine. I just want a computable predictor that works well. This one does. Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff. Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program P1 running program P2 is pretty much never going to be the shortest implementation of P2.

Is instrumental convergence a thing for virtue-driven agents?

Jeremy Gillen3mo20

I'm not sure whether it implies that you should be able to make a task-based AGI.

Yeah I don't understand what you mean by virtues in this context, but I don't see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it's different then we might communicate better.

(Later you mention unboundedness too, which I think should be added to difficulty here)

By unbounded I just meant the kind of task where it's always possible to do better by using... (read more)

Is instrumental convergence a thing for virtue-driven agents?

Jeremy Gillen3mo*80

It could still be a competent agent that often chooses actions based on the outcomes they bring about. It's just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.

I think you've hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting "in service" of the outer loop, then we could probably use the same technique to make a Task-based AGI acting "in service" of us. Which I think is a good approach! But the open problems for making a task-based AGI sti... (read more)

4mattmacdermott3mo

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I'm not sure whether it implies that you should be able to make a task-based AGI. Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn't be scary). But I suppose you'd say that's just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here). I'm not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it's on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

-1StanislavKrym3mo

As I wrote in another comment, in an experiment ChatGPT failed to utter a racial slur to save millions of lives. A re-run of the experiment led it to agree to use the slur and to claim that "In this case, the decision to use the slur is a complex ethical dilemma that ultimately comes down to weighing the value of saving countless lives against the harm caused by the slur". This implies that ChatGPT is either already aligned to a not so consequential ethics or that it ended up grossly exaggerating the slur's harm. Or that it failed to understand the taboo's meaning. UPD: if racial slurs are a taboo for AI, then colonizing the world, apparently, is a taboo as well. Is AI takeover close enough to colonialism to align AI against the former, not just the latter?

Towards a scale-free theory of intelligent agency

Jeremy Gillen4mo210

But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.

I like this reason to be unsatisfied with the EUM theory of agency.

One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.

I think the solution to th... (read more)

4Jonas Hallgren4mo

Could you please make an argument for goal stability over process stability? If I reflecticely agree that if the process A (QACI or CEV for example) is reflectively good then I agree to changing my values from B to C if process A happens? So it is more about the process than the underlying goals. Why do we treat goals as the main class citizen here? There's something in well defined processes that make them applicable to themselves and reflectively stable?

Evaluating Stability of Unreflective Alignment

Jeremy Gillen4mo20

I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo20

It's not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.

But on this question (for AIs that are just capable of "prosaic and relatively unenlightened ML research") it feels like shot-in-the-dark guesses. It's very unclear to me what is and isn't possible.

4ryan_greenblatt4mo

I certainly agree it isn't clear, just my current best guess.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo30

Thanks, I appreciate the draft. I see why it's not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.

I guess I shouldn't respond too much in public until you've published the doc, but:

If I'm interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
A bunch of the ideas do seem

... (read more)

2ryan_greenblatt4mo

Yes, I just meant on net. (Relative to the current ML community and given a similar fraction of resources to spend on AI compute.)

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo*20

I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.

I think we kind of agree here. The cruxes remain: I think that the metric for "behave well" won't be good enough for "real" large research acceleration. And "average case" means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]

Also, yo... (read more)

2ryan_greenblatt4mo

Oh, yeah I meant "perform well according to your metrics" not "behave well" (edited)

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo3-1

Yep this is the third crux I think. Perhaps the most important.

To me it looks like you're making a wild guess that "prosaic and relatively unenlightened ML research" is a very large fraction of the necessary work for solving alignment, without any justification that I know of?

For all the pathways to solving alignment that I am aware of, this is clearly false. I think if you know of a pathway that just involves mostly "prosaic and relatively unenlightened ML research", you should write out this plan, why you expect it to work, and then ask OpenPhil throw a billion dollars toward every available ML-research-capable human to do this work right now. Surely it'd be better to get started already?

3ryan_greenblatt4mo

I don't think "what is the necessary work for solving alignment" is a frame I really buy. My perspective on alignment is more like: * Avoiding egregious misalignment (where AIs intentionally act in ways that make our tests highly misleading or do pretty obviously unintended/dangerous actions) reduces risk once AIs are otherwise dangerous. * Additionally, we will likely to need to hand over making most near term decisions and most near term labor to some AI systems at some point. This going well very likely requires being able to avoid egregious misalignment (in systems capable enough to obsolete us) and also requires some other stuff. * There is a bunch of "prosaic and relatively unenlightened ML research" which can make egregious misalignment much less likely and can resolve other problems needed for handover. * Much of this work is much easier once you already have powerful AIs to experiment on. * The risk reduction will depend on the amount of effort put in and the quality of the execution etc. * The total quantity of risk reduction is unclear, but seems substantial to me. I'd guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time. (This require more misc conceptual work, but not something that requires deep understanding persay.) I think my perspective is more like "here's a long list of stuff which would help". Some of this is readily doable to work on in advance and should be worked on, and some is harder to work on. This work isn't extremely easy to verify or scale up (such that I don't think "throw a billion dollars at it" just works), though I'm excited for a bunch more work on this stuff. ("relatively unenlightened" doesn't mean "trivial to get the ML community work on this using money" and I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effecti

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo20

I'm not entirely sure where our upstream cruxes are. We definitely disagree about your conclusions. My best guess is the "core mistake" comment below, and the "faithful simulators" comment is another possibility.

Maybe another relevant thing that looks wrong to me: You will still get slop when you train an AI to look like it is epistemically virtuously updating its beliefs. You'll get outputs that look very epistemically virtuous, but it takes time and expertise to rank them in a way that reflects actual epistemic virtue level, just like other kinds of slop... (read more)

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo*21

these are also alignment failures we see in humans.

Many of them have close analogies in human behaviour. But you seem to be implying "and therefore those are non-issues"???

There are many groups of humans (or groups of humans), that if you set them on the task of solving alignment, will at some point decide to do something else. In fact, most groups of humans will probably fail like this.

How is this evidence in favour of your plan ultimately resulting in a solution to alignment???

but these systems empirically often move in reasonable and socially-beneficial

Jeremy Gillen4mo52

to the extent developers succeed in creating faithful simulators

There's a crux I have with Ryan which is "whether future capabilities will allow data-efficient long-horizon RL fine-tuning that generalizes well". As of last time we talked about it, Ryan says we probably will, I say we probably won't.

If we have the kind of generalizing ML that we can use to make faithful simulations, then alignment is pretty much solved. We make exact human uploads, and that's pretty much it. This is one end of the spectrum on this question.

There are weaker versions, which I... (read more)

4ryan_greenblatt4mo

FWIW, I don't think "data-efficient long-horizon RL" (which is sample efficient in a online training sense) implies you can make faithful simulations. I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well perform well according to your metrics in the average case using online training.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo62

My guess is that your core mistake is here:

When I say agents are “not egregiously misaligned,” I mean they mostly perform their work earnestly – in the same way humans are mostly earnest and vaguely try to do their job. Maybe agents are a bit sycophantic, but not more than the humans whom they would replace. Therefore, if agents are consistently “not egregiously misaligned,” the situation is no worse than if humans performed their research instead.

Obviously, all agents having undergone training to look "not egregiously misaligned", will not look egregiousl... (read more)

2joshc4mo

I think my arguments still hold in this case though right? i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make. I agree there are lots of "messy in between places," but these are also alignment failures we see in humans. And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc) and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo41

(Some) acceleration doesn't require being fully competitive with humans while deference does.

Agreed. The invention of calculators was useful for research, and the invention of more tools will also be helpful.

I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting new ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

Maybe some kinds of "safety work", but real alignment involves a human obtaining a deep understanding of intelligence and agency. The path ... (read more)

3ryan_greenblatt4mo

A typical crux is that I think we can increase our chances of "real alignment" using prosaic and relatively unenlightened ML reasearch without any deep understanding. I both think: 1. We can significantly accelerate prosaic ML safety research (e.g., of the sort people are doing today) using AIs that are importantly limited. 2. Prosaic ML safety research can be very helpful for increasing the chance of "real alignment" for AIs that we hand off to. (At least when this research is well executed and has access to powerful AIs to experiment on.) This top level post is part of Josh's argument for (2).

Training AI to do alignment research we don’t already know how to do

Jeremy Gillen4mo6-2

(vague memory from the in person discussions we had last year, might be inaccurate):

jeremy!2023: If you're expecting AI to be capable enough to "accelerate alignment research" significantly, it'll need to be a full-blown agent that learns stuff. And that'll be enough to create alignment problems because data-efficient long-horizon generalization is not something we can do.

joshc!2023: No way, all you need is AI with stereotyped skills. Imagine how fast we could do interp experiments if we had AIs that were good at writing code but dumb in other ways!

...

josh... (read more)

2joshc4mo

I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?

9ryan_greenblatt4mo

Something important is that "significantly accelerate alignment research" isn't the same as "making AIs that we're happy to fully defer to". This post is talking about conditions needed for deference and how we might achieve them. (Some) acceleration doesn't require being fully competitive with humans while deference does. I think AIs that can autonomously do moderate duration ML tasks (e.g., 1 week tasks), but don't really have any interesting ideas could plausibly speed up safety work by 5-10x if they were cheap and fast enough.

Detect Goodhart and shut down

Jeremy Gillen5mo20

In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?

We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?

I don't mess up the medical test because true information is instrumentally useful to me, given my goals.

Yep that's what I meant. The goal u is constructed to make information abo... (read more)

Detect Goodhart and shut down

Jeremy Gillen5mo70

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation

This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.

3EJT5mo

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences. Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem and the point that John Wentworth makes here. I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.