All of Lauro Langosco's Comments + Replies

Yeah fair point. I do think labs have some some nonzero amount of responsibility to be proactive about what others believe about their commitments. I agree it doesn't extend to 'rebut every random rumor'.

I agree in principle that labs have the responsibility to dispel myths about what they're committed to. OTOH, in defense of the labs I imagine that this can be hard to do while you're in the middle of negotiations with various AISIs about what those commitments should look like.

4No77e
I don't know, this sounds weird. If people make stuff up about someone else and do so continually, in what sense it's that someone "responsibility" to rebut such things? I would agree with a weaker claim, something like: don't be ambiguous about your commitments with the objective of making it seem like you are committing to something and then walk back at the time you should make the commitment. 

The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.

evolution does not grow minds, it grows hyperparameters for minds.

Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.

I want to distinguish two possible takes here:

  1. The argument from direct implication: "Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives"
  2. Evolution as an
... (read more)
5TurnTrout
I think it's extremely relevant, if we want to ensure that we only analogize between processes which share enough causal structure to ensure that lessons from e.g. evolution actually carry over to e.g. AI training (due to those shared mechanisms). If the shared mechanisms aren't there, then we're playing reference class tennis because someone decided to call both processes "optimization processes."

I'm not saying that GPT-4 is lying to us - that part is just clarifying what I think Matthew's claim is.

Re cauldron: I'm pretty sure MIRI didn't think that. Why would they?

3dsj
Okay. I do agree that one way to frame Matthew’s main point is that MIRI thought it would be hard to specify the human value function, and an LM that understands human values and reliably tells us the truth about that understanding is such a specification, and hence falsifies that belief. To your second question: MIRI thought we couldn’t specify the value function to do the bounded task of filling the cauldron, because any value function we could naively think of writing, when given to an AGI (which was assumed to be a utility argmaxer), leads to all sorts of instrumentally convergent behavior such as taking over the world to make damn sure the cauldron is really filled, since we forgot all the hidden complexity of our wish.

I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion.

IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much.

Mor... (read more)

2Matthew Barnett
Can you explain how you're defining outer alignment and value specification? I'm using this definition, provided by Hubinger et al. Evan Hubinger provided clarification about this definition in his post "Clarifying inner alignment terminology", I deliberately avoided using the term "outer alignment" in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as, This was based on the Arbital entry for the value identification problem, which was defined as a I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else. I'd appreciate if you clarified whether you are saying: 1. That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares' 2016 paper or their 2017 technical agenda to make your point. 2. That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem. 3. That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem. I'm more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.

Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values?

(I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).

I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.

I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that... (read more)

3Matthew Barnett
I don't think it's necessary for them to have made that exact claim. The point is that they said value specification would be hard. If you solve value specification, then you've arguably solved the outer alignment problem a large part of the outer alignment problem. Then, you just need to build a function maximizer that allows you to robustly maximize the utility function that you've specified. [ETA: btw, I'm not saying the outer alignment problem has been fully solved already. I'm making a claim about progress, not about whether we're completely finished.] I interpret MIRI as saying "but the hard part is building a function maximizer that robustly maximizes any utility function you specify". And while I agree that this represents their current view, I don't think this was always their view. You can read the citations in the post carefully, and I don't think they support the idea that they've consistently always considered inner alignment to be the only hard part of the problem. I'm not claiming they never thought inner alignment was hard. But I am saying they thought value specification would be hard and an important part of the alignment problem.

I think maybe there's a parenthesis issue here :)

I'm saying "your claim, if I understand correctly, is that MIRI thought AI wouldn't (understand human values and also not lie to us)".

4dsj
Okay, that clears things up a bit, thanks. :) (And sorry for delayed reply. Was stuck in family functions for a couple days.) This framing feels a bit wrong/confusing for several reasons. 1. I guess by “lie to us” you mean act nice on the training distribution, waiting for a chance to take over the world while off distribution. I just … don’t believe GPT-4 is doing this; it seems highly implausible to me, in large part because I don’t think GPT-4 is clever enough that it could keep up the veneer until it’s ready to strike if that were the case. 2. The term “lie to us” suggests all GPT-4 does is say things, and we don’t know how it’ll “behave” when we finally trust it and give it some ability to act. But it only “says things” in the same sense that our brain only “emits information”. GPT-4 is now hooked up to web searches, code writing, etc. But maybe I misunderstand the sense in which you think GPT-4 is lying to us? 3. I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior. (If I’ve misunderstood your point, sorry! Please feel free to clarify and I’ll try to engage with what you actually meant.)

I think we agree - that sounds like it matches what I think Matthew is saying.

2dsj
Hmm, you say “your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values”. I’m disagreeing with this. I think Matthew isn’t claiming that MIRI thought AI wouldn’t understand human values.

You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of a

... (read more)
2Matthew Barnett
I agree the claim is "similar". It's actually a distinct claim, though. What are the reasons why it's false? (And what do you mean by saying that what I wrote is "false"? I think the historical question is what's important in this case. I'm not saying that solving the value specification problem means that we have a full solution to the alignment problem, or that inner alignment is easy now.)
5dsj
I think you’re misunderstanding the paragraph you’re quoting. I read Matthew, in that paragraph as acknowledging the difference between the two problems, and saying that MIRI thought value specification (not value understanding) was much harder than it’s looking to actually be.

My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.

(End paraphrase)

I t... (read more)

2Writer
I don't speak for Matthew, but I'd like to respond to some points. My reading of his post is the same as yours, but I don't fully agree with what you wrote as a response. My objection to this is that if an LLM can substitute for a human, it could train the AI system we're trying to align much faster and for much longer. This could make all the difference. I suspect (and I could be wrong) that Q(observation, action) is basically what Matthew claims GPT-N could be. A human who gives moral counsel can only say so much and, therefore, can give less information to the model we're trying to align. An LLM wouldn't be as limited and could provide a ton of information about Q(observation, action), so we can, in practice, consider it as being our specification of Q(observation, action).  Edit: another option is that GPT-N, for the same reason of not being limited by speed, could write out a pretty huge Q(observation, action) that would be good, unlike a human.
4Matthew Barnett
I'm sympathetic to some of these points, but overall I think it's still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I'm not saying that the whole alignment problem is now easy. I'm making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve. I think the most plausibly correct interpretation here of "a genie must share the same values" is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn't mean we've solved the other. However, again, I'm not claiming the whole problem has been solved. Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn't know whether the solutions would work well. I think we've now gotten more evidence about how hard this part of the problem is.

(Newbie guest fund manager here) My impression is there are plans re individuals but they're not very developed or put into practice yet. AFAIK there are currently no plans to fundraise from companies or governments.

IMO a good candidate is anything that is object-level useful for X-risk mitigation. E.g. technical alignment work, AI governance / policy work, biosecurity, etc.

Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing

... (read more)

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour i
... (read more)

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).


Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field
... (read more)
2Lauro Langosco
There are positive feedback loops between prongs: * Successfully containing & using more capable models (p1) gives you more scary demos for p2 * Success in p1 also speeds up p3 a lot, because: * 1) You can empirically study AGI directly,  * 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4) * 3) Maybe you can even have (proto-)AGIs do research for you * You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research. * Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)   If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.

whether or not this is the safest path, important actors seem likely to act as though it is

It's not clear to me that this is true, and it strikes me as maybe overly cynical. I get the sense that people at OpenAI and other labs are receptive to evidence and argument, and I expect us to get a bunch more evidence about takeoff speeds before it's too late. I expect people's takes on AGI safety plans to evolve a lot, including at OpenAI. Though TBC I'm pretty uncertain about all of this―definitely possible that you're right here.

Whether or not this is the safest path, the fact that OpenAI thinks it’s true and is one of the leading AI labs makes it a path we’re likely to take. Humanity successfully navigating the transition to extremely powerful AI might therefore require successfully navigating a scenario with short timelines and slow, continuous takeoff.

You can't just choose "slow takeoff". Takeoff speeds are mostly a function of the technology, not company choices. If we could just choose to have a slow takeoff, everything would be much easier! Unfortunately, OpenAI can't jus... (read more)

3FinalFormal2
You need to think about your real options and expected value of behavior. If we're in a world where technology allows for a fast takeoff world and alignment is hard, (EY World) I imagine the odds of survival with company acceleration is 0% and the odds of survival without is 1%. But if we live in a world where compute/capital/other overhangs are a significant influence in AI capabilities and alignment is just tricky, company acceleration would seem like it could improve the chances of survival pretty significantly, maybe from 5% to 50%. These obviously aren't the only two possible worlds, but if they were and both seemed equally likely, I would strongly prefer a policy of company acceleration because the EV for me breaks down way better over the probabilities. I guess 'company acceleration' doesn't convey as much information or sell as well which is why people don't use that phrase, but that's the policy they're advocating for- not 'hoping really hard that we're in a slow takeoff world.'
1rosehadshar
Yeah, good point. I guess the truer thing here is 'whether or not this is the safest path, important actors seem likely to act as though it is'. Those actors probably have more direct control over timelines than takeoff speed, so I do think that this fact is informative about what sort of world we're likely to live in - but agree that no one can just choose slow takeoff straightforwardly.

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other ove
... (read more)
2Nate Showell
Some other possible thresholds: 10. Ability to perform gradient hacking 11. Ability to engage in acausal trade 12. Ability to become economically self-sustaining outside containment 13. Ability to self-replicate

Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".

Thanks for link-posting this! I'd find it useful to have the TLDR at the beginning of the post, rather than at the end (that would also make the last paragraph easier to understand). You did link the TLDR at the beginning, but I still managed to miss it on the first read-through, so I think it would be worth it.

Also: consider crossposting to the alignmentforum.

Edit: also, the author is Eliezer Yudkowsky. Would be good to mention that in the intro.

I like that mini-game! Thanks for the reference

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data,

... (read more)

It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).

1David Johnston
If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view. Given that your post misses this, I don’t think it succeeds as an defence of high P(doom). I think a defence of high P(doom) that addresses the issue above would be quite valuable. Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.

Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?

1EJT
I don't think so, Suppose the agent first chooses A when we offer it a choice between A and B. After that, the agent must act as if it prefers A to B-. But it can still lack a preference between A and B, and this lack of preference can still be insensitive to some sweetening or souring: the agent could also lack a preference between A and B+, or lack a preference between A+ and B, or lack a preference between B and A-. What is true is that, given a sufficiently wide variety of past decisions, the agent must act as if its preferences are complete. But depending on the details, that might never happen or only happen after a very long time. If you're interested, these kinds of points got discussed in a bit more detail over in this comment thread.

Newtonian gravity states that objects are attracted to each other in proportion to their mass. A webcam video of two apples falling will show two objects, of slightly differing masses, accelerating at the exact same rate in the same direction, and not towards each other. When you don’t know about the earth or the mechanics of the solar system, this observation points against Newtonian gravity. [...] But it requires postulating the existence of an unseen object offscreen that is 25 orders of magnitude more massive than anything it can see, with a center of

... (read more)

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sur... (read more)

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.

However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mista... (read more)

dxu2011

For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.

Yes, and this can be framed as a consequence of a more general principle, which is that model uncertainty doesn't save you from pessimistic outcomes unless your prior (which after all is what you fall back t... (read more)

To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don't see continuity or "Things will be weird before getting extremely weird" as a crux. (I don't know why you think he does, and don't know what he thinks, but would guess he doesn't think it's a crux either)

7Jan_Kulveit
I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie. I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.   - pivotal acts vs. pivotal processes -- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal acts? - assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates? -- if yes, do you expect large-scale regulations around that? --- if yes, will it be also partially automated? - different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI? -- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this? - different route: assuming continuity, do you expect a lot of "how different entities in the world coordinate" to be done partially by AI systems? -- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics? - assuming the world notices AI safety as problem (it did much more since writing this post) -- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment? ---  would you expect this will be somehow related to the technical problems and how we understand them? --- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it? ---- if it makes a difference, does it make a difference how is the attention allocated? Not sure if the doublecrux between us would rest on the same cruxes, but I'm happy to try :)

Thanks for doing this! I think this could be valuable. What's your current plan for developing this further / onboarding collaborators?

Some observations / thoughts from interacting with the QA system for a few minutes:

  • The current version has trouble answering basic AI safety questions. For example:

Why do people think AI is an existential risk? People think AI is an existential risk because of the possibility of a superintelligent AI system with recursive self-improvement capabilities, which could lead to catastrophic consequences like turning humans i

... (read more)
2BionicD0LPH1N
Thanks for the comment! At this point, we don't have a very clear plan, other than thinking of functionalities and adding them as fast as possible in an order that seems sensible. The functionalities we want to add include: * Automatic update of the dataset relatively often. * Stream completions. * Test embeddings using SentenceTransformers + Finetuning instead of OpenAI for cost and quality, and store them in Pinecone/Weaviate/Other (tbd); this will enable us to use the whole dataset for semantic search, and for the semantic similarity to have more 'knowledge' about technical terms used in the alignment space, which I expect to produce better results. We also want to test and add biases to favor 'good' sources to maximize the quality of semantic search. It's also possible that we'll make a smaller, more specialized dataset of curated content. * Add modes and options. HyDE, Debate, Comment, Synthesis, temp, etc. Possibly add options to make use of GPT-4, depending on feasibility. * Figure out how to make this scale without going bankrupt. * Add thumbs-up/down for A/B testing prompt, the bias terms, and curated vs uncurated datasets. * Add recommended next questions the user can ask, possibly taken from a question database. * Improve UX/UI. We have not taken much time (we were very pressed for it!) to consider the best way to onboard collaborators. We are communicating on our club's Discord server at the moment, and would be happy to add people who want to contribute, especially if you have experience in any of the above. DM me on Discord at BionicD0LPH1N#5326 or on LW. That's true sometimes, and a problem. We observe fewer such errors on the full dataset, and are currently working on having that up. Additional modes, like HyDE, and the bias mentioned earlier, might further improve results. Getting better embeddings + finetuning them on our dataset might improve search. Finally, when the thumbs up/down feature is up, we will be able to quickly search over

This seems wrong. Here's an incomplete list of reasons why:

  1. If the 3 leading labs join the moratorium and AGI is stealthily developed by the 4th, then the arrival of AGI will in fact have been slowed by the lead time of the first 3 labs + the slowdown that the 4th incurs by working in secret.
  2. The point of this particular call for a 6-month moratorium is not to particularly slow down anyone (and as has been pointed out by others, it is possible that OpenAI wasn't even planning to start training GPT-5 in the next few months). Rather, the point is to form a
... (read more)

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the prob... (read more)

2Johannes Treutlein
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

Yeah that seems reasonable! (Personally I'd prefer a single break between sentence 3 and 4)

2gwern
Yes, with one linebreak, I'd put it at (4). With 2 linebreaks, I'd put it at 4+5. With 3 breaks, 4/5/6. (Giving the full standard format: introduction/background, method, results, conclusion.) If I were annotating that, I would go with 3 breaks. I wouldn't want to do a 4th break, and break up 1-3 at all, unless (3) was unusually long and complex and dug into the specialist techniques more than usual so there really was a sort of 'meaningless super universal background of the sort of since-the-dawn-of-time-man-has-yearned-to-x' vs 'ok real talk time, you do X/Y/Z but they all suck for A/B/C reasons; got it? now here's what you actually need to do:' genuine background split making it hard to distinguish where the waffle ends and the meat begins.

IMO ~170 words is a decent length for a well-written abstract (well maybe ~150 is better), and the problem is that abstracts are often badly written. Steve Easterbrook has a great guide on writing scientific abstracts; here's his example template which I think flows nicely:

(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how

... (read more)
2Raemon
I still claim this should be three paragraphs. In this breaking at section 4 and section 6 seems to carve it at reasonable joints.

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

3Steven Byrnes
I’m gonna pause to make sure we’re on the same page. We’re talking about this claim I made above: And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”. And then by “prerequisites” we’re referring to the thing you wrote above: OK, now to respond. For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right? For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.) For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situa
... (read more)
2Steven Byrnes
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second. I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place. But anyway… First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote. Second bullet point → Ditto Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head. Fourth bullet point → I disagree for reasons here.

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particu
... (read more)
1baturinsky
* Honesty is an attractor in the cooperative multi-agent system, where one agent relies on the other agents having accurate information to do their part of the work. * I don't think understanding an intent is the hardest part. Even the curent LLMs are mostly able to do that.

(Crossposting some of my twitter comments).

I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.

  1. I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on

... (read more)

Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

Is there an open-source implementation of causal scrubbing available?

4Arthur Conmy
It is open sourced here and there is material from REMIX to get used to the codebase here
5Buck
My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.
3Pranav Gade
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
4Buck
nope, but hopefully we'll release one in the next few weeks.

I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).

And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.

I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").

Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fa... (read more)

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

3Ajeya Cotra
Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

I agree with your general point here, but I think Ajeya's post actually gets this right, eg

There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”

and

What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from

... (read more)
2Lauro Langosco
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed". (Though of course it's important to spell the argument out)

FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.

Answer by Lauro Langosco10
  • importance / difficulty of outer vs inner alignment
  • outlining some research directions that seem relatively promising to you, and explain why they seem more promising than others
3Charlie Steiner
I feel like I'm pretty off outer vs. inner alignment. People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI's self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn't about what's in the loss function. People have had a go at outer alignment too, but (if they're named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.
Load More