All of Chris van Merwijk's Comments + Replies

Overall, compared to the previous question, there was more of a consensus, with 55% of people responding that there is a 0% chance that technologically induced vacuum decay is possible.

Since anywhere near 0% seems way overconfident to me at first sight, just a random highly speculative unsubstantiated thought: Could this be partly motivated reasoning, that they're afraid of a backlash against physics funding or something?

They stated justification was primarily that the Standard Model of particle physics predicts metastability

Just to be sure, does this mean 
1. That the standard model predicts that metastability is possible? i.e. it is consistent with the standard model for there to be metastability; or
2. If the standard model is correct, and certain empirical observations are correct, then we must be in a metastable state. i.e. the standard model together with certain empirical observations implies our actual universe is metastable?

I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that "the free operad generated by the input alphabet of the tree automaton" is an operad with just one color, and the "operations" are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assi... (read more)

2Vanessa Kosoy
You now understand correctly. The reason I switch to colored operads is to add even more generality. My key use case is when the operad consists of terms-with-holes in a programming language, in which case the colors are the types of the terms/holes.

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

2Vanessa Kosoy
No? The elements of an operad have fixed arity. When defining a free operad you need to specify the arity of every generator.

Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn't what he meant? 

2Noosphere89
Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn't interpret Eliezer's clarification to rule that out immediately, so I wanted to rule that interpretation out. I didn't see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don't know how to verify.

Here is an additional reason why it might seem less useful than it actually is: Maybe the people whose research direction is being criticized do process the criticism and change their views, but do not publicly show that they change their mind because it seems embarrassing. It could be that it takes them some time to change their mind, and by that time there might be a bigger hurdle to letting you know that you were responsible for this, so they keep it to themselves. Or maybe they themselves aren't aware that you were responsible. 

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions. 

I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing.

"The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rul... (read more)

Minor quibble: It's a bit misleading to call B "experience curves", since it is also about capital accumulation and shifts in labor allocation. Without any additional experience/learning, if demand for candy doubles, we could simply build a second candy factory that does the same thing as the first one, and hire the same number of workers for it.

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

I suspect you're misinterpreting EY's comment.

Here was the context:
"I think controlling Earth's destiny is only modestly harder than understanding a sentence in English - in the same sense that I think Einstein was only modestly smarter than George W. Bush. EY makes a similar point.

You sound to me like someone saying, sixty years ago: "Maybe some day a computer will be able to play a legal game of chess - but simultaneously defeating multiple grandmasters, that strains credibility, I'm afraid." But it only took a few decades to get from point A to point B.... (read more)

"It's fine to say that this is a falsified prediction"

I wouldn't even say it's falsified. The context was: "it only took a few decades to get from [chess computer can make legal chess moves] to [chess computer beats human grandmaster]. I doubt that going from "understanding English" to "controlling the Earth" will take that long."

So insofar as we believe ASI is coming in less than a few decades, I'd say EY's prediction is still on track to turn out correct.

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

2Daniel Kokotajlo
"The Conquest of Mexico and the Conquest of Peru" "1493" and... the one about Afonso I forget the title... it might have been this History of Portugal : Marques, Antonio Henrique R. de Oliveira : Free Download, Borrow, and Streaming : Internet Archive or this https://www.amazon.com/gp/search?index=books&tag=NYTBSREV-20&field-keywords=Conquerors%3A+How+Portugal+Forged+the+First+Global+Empire+Roger+Crowley I also ended up reading "1491" and "1492."  

One of the more interesting dynamics of the past eight-or-so years has been watching a bunch of the people who [taught me my values] and [served as my early role models] and [were presented to me as paragons of cultural virtue] going off the deep end.

I'm curious who these people are.

We should expect regression towards the mean only if the tasks were selected for having high "improvement from small to Gopher-7". Were they?

The reasoning was given in the comment prior to it, that we want fast progress in order to get to immortality sooner.

"But yeah, I wish this hadn't happened."

Who else is gonna write the article? My sense is that no one (including me) is starkly stating publically the seriousness of the situation. 

"Yudkowsky is obnoxious, arrogant, and most importantly, disliked, so the more he intertwines himself with the idea of AI x-risk in the public imagination, the less likely it is that the public will take those ideas seriously"
 

I'm worried about people making character attacks on Yudkowsky (or other alignment researchers) like this. I think the people who think they can ... (read more)

4Daniel Kokotajlo
I agree that there's a need for this sort of thing to be said loudly. (I've been saying similar things publicly, in the sense of anyone-can-go-see-that-I-wrote-it-on-LW, but not in the sense of putting it into major news outlets that are likely to get lots of eyeballs) I do agree with that. I think Yudkowsky, despite his flaws,* is a better human being than most people, and a much better rationalist/thinker. He is massively underrated. However, given that he is so disliked, it would be good if the Public Face of AI Safety was someone other than him, and I don't see a problem with saying so. (*I'm not counting 'being disliked' as a flaw btw, I do mean actual flaws--e.g. arrogance, overconfidence.)

"We finally managed to solve the problem of deceptive alignment while being capabilities competitive"

??????

-1Noosphere89
Good question to ask, and I'll explain. So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term. So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities. And in a sense, that's what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it's either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes. In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn't incentivized deceptive alignment.

"But I don't think you even need Eliezer-levels-of-P(doom) to think the situation warrants that sort of treatment."

Agreed. If a new state develops nuclear weapons, this isn't even close to creating a 10% x-risk, yet the idea of airstrikes on nuclear enrichment facillities, even though it is very controversial, has for a long time very much been an option on the table.

"if I thought the chance of doom was 1% I'd say "full speed ahead!"

This is not a reasonable view. Not on Longtermism, nor on mainstream common sense ethics. This is the view of someone willing to take unacceptable risks for the whole of humanity. 

2pseud
Why not ask him for his reasoning, then evaluate it? If a person thinks there's 10% x-risk over the next 100 years if we don't develop superhuman AGI, and only a 1% x-risk if we do, then he'd suggest that anybody in favour of pausing AI progress was taking "unacceptable risks for the whole of himanity".

Also, there is a big difference between "Calling for violence", and "calling for the establishment of an international treaty, which is to be enforced by violence if necessary". I don't understand why so many people are muddling this distinction.

You are muddling the meaning of "pre-emptive war", or even "war". I'm not trying to diminish the gravity of Yudkowsky's proposal, but a missile strike on a specific compound known to contain WMD-developing technology is not a "pre-emptive war" or "war". Again I'm not trying to diminish the gravity, but this seems like an incorrect use of the term.

"For instance, personally I think the reason so few people take AI alignment seriously is that we haven't actually seen anything all that scary yet. "

And if this "actually scary" thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

2abramdemski
A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction. 

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples  0 i.i.d. with 100%). 


Suppose we use a perfect Bayesian reasoner to sample bitstrings, bu... (read more)

2Cleo Nardo
Yep, you're correct. The original argument in the Waluigi mega-post was sloppy. * If μ updated the amplitudes in a perfectly bayesian way and the context window was infinite, then the amplitudes of each premise must be a martingale. But the finite context breaks this. * Here is a toy model which shows how the finite context window leads to Waluigi Effect. Basically, the finite context window biases the Dynamic LLM towards premises which can be evidenced by short strings (e.g. waluigi), and biases away from premises which can't be evidenced by short strings (e.g. luigis). * Regarding your other comment, a long context window doesn't mean that the waluigis won't appear quickly. Even with an infinite context window, the waluigi might appear immediately. The assumption that the context window is short/finite is only necessary to establish that the waluigi is an absorbing state but luigi isn't.

Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv

"When LessWrong was ~dead"

Which year are you referring to here?

5Ben Pace
2016-17 Added: To give context, here's a list of number of LW posts by year: * 2009: 852 * 2010: 1143 * 2011: 3002 * 2012: 2583 * 2013: 1973 * 2014: 1797 * 2015: 2002 (<– This should be ~1880, as we added all ~120 HPMOR posts and backdated them to 2015) * 2016: 1303 (<– This is the most 'dead' year according to me, and the year with the fewest posts) * 2017: 1671 (<– LW 2.0 revived in the second half of this year) * 2018: 1709 * 2019: 2121 * 2020: 3099 * 2021: 3230 * 2022: 4538 * First quarter of 2023: 1436, if you 4x that it is 5744  (My, it's getting to be quite a lot of posts these days.)

A lot of people in AI Alignment I've talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.

What do you think is the mechanism behind this?

habryka113

I think the biggest thing is a strong, high-stakes but still quite ambiguous status-hierarchy in the Bay Area.

I think there are lots of contributors to this, but I definitely feel a very huge sense of needing to adopt certain views, to display "good judgement", and to conform to a bunch of epistemic and moral positions in order to operate in the space. This is particularly harsh since the fall of FTX with funding being less abundant and a lot of projects being more in-peril and the stakes of being perceived as reasonable and competent by a very messy and in-substantial parts social process are even higher.

There is a general phenomenon where:

  • Person A has mental model X and tries to explain X with explanation Q
  • Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
  • Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contai... (read more)

2TurnTrout
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.

Very late reply, sorry.

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not ... (read more)

2TurnTrout
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It's that I don't perceive this knowledge to be engaged when some people reason about "optimization processes" and "selecting for high-reward models" on e.g. LW.  I just continue to think "I wouldn't write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP", but it's possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point. FWIW, I think a bunch of my historical frustration here has been an experience of: 1. Pointing out the "reward chisels computation" point 2. Having some people tell me it's obvious, or already known, or that they already invented it  3. Seeing some of the same people continue making similar mistakes (according to me) 4. Not finding instances of other people making these points before OP 5. Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so. 1. If I found several comments explaining what is clearly the "reward chisels computation" point, where the comments were posted before this post, by people who weren't me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.  2. IIRC there's one comment from Wei_Dai from a few years back in this vein, but IDK of others.

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Is the following a typo?
"So, the  ( works"

first sentence of "CoCo Equilbiria".

2Diffractor
It was a typo! And it has been fixed.

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

  • A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't ac
... (read more)
4TurnTrout
Where did RFLO point it out? RFLO talks about a mesa objective being different from the "base objective" (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn't a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan. Like, from my POV, people are reliably reasoning about what RL "selects for" via "lots of optimization pressure" on "high reward by the formal metric", but who's reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion? Yeah, I think it just doesn't communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn't call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It's like if I said "My cake is red" when the cake is blue, I guess? IMO it's just not how to communicate the concept. Why is this reasonable? 

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

4TurnTrout
"Wireheading is improbable" is only half of the point of the essay.  The other main point is "reward functions are not the same type of object as utility functions." I haven't reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as "objectives": Which is reasonable parlance, given that everyone else uses it, but I don't find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an 'objective' at all.  (You might have privately known about this distinction. Fine by me! But I can't back it out from a skim of RFLO, even already knowing the insight and looking for it.)

"I think in the defense-offense case the actions available to both sides are approximately the same"

If attacker has the action "cause a 100% lethal global pandemic" and the defender has the task "prevent a 100% lethal global pandemic", then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states). 

If you build an OS that you're trying to make safe against attacks, you might do e.g. what the seL4 mic... (read more)

1Rohin Shah
Yes, all of that mostly sounds right to me. I agree the formal strategy-stealing argument relies on literal symmetry; I would say the linked post is applying it to asymmetric situations, where you can recover something roughly symmetric, by assuming that both players need to first accumulate resources and power. (I think this is basically what you said.)

Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say "this is sort of related", or are you trying to say "the strategy stealing assumption and this defense-offense symmetry thesis is the same thing"?

In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:

  • Strategy-stealing assumption is (in the context of AI alignment): f
... (read more)
1Rohin Shah
... And the humans have a majority of the resources / power, which requires having competitive aligned AI systems. More broadly strategy-stealing is "the player with majority resources / power can just copy the strategy of the other player". I wouldn't say the strategy-stealing assumption is about a symmetric game; it's symmetric only in that the actions available to both sides are approximately the same. The goals of the two sides are pretty different and aren't zero-sum. Similarly I think in the defense-offense case the actions available to both sides are approximately the same but the goals are pretty different (defend X vs attack X). The strategy-stealing argument as applied to defense-offense would say something like "whatever offense does to increase its resources / power is something that defense could also do to increase resources / power". E.g. if the terrorists secretly go around shooting people to decrease state power, the state could also go around secretly shooting terrorists to decrease terrorist power. Often the position with majority resources / power (i.e. the state) will have a better action than that available, and so you'll see the two groups doing different things, but "use the same strategy as the less-resourced group" is an available baseline that helps you preserve your majority resources / power. This isn't the same as your thesis. Your thesis says "the defender needs to have the same capabilities as the attacker". The strategy-stealing argument directly assumes that the defender has the same capabilities (i.e. assumes the conclusion of your thesis), and then uses that to argue that there is a lower bound on how well the majority-resourced player can do. So anyway I'd say the relation is that both theses are talking about the same sort of game / environment, and defense-offense is a central example application of the strategy-stealing argument (especially in AI alignment, where humanity + aligned AI are defending against misaligned AI at

I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn't find it immediately.

5Lone Pine
I just found this: http://www.transistorcount.com/
1Lone Pine
Nope. It's a site by and for collectors, and apparently what they care about is reference images of the face of old chips. You'd think that ChipDB would be a database of chips, but this one is sorely lacking. I added this comment in hopes that someone knows of a more useful (to us) database.

Yeah, I know they don't understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn't really communicated by my phrasing though, so maybe I should edit that

I think I communicated unclearly and it's my fault, sorry for that: I shouldn't have used the phrase "any easily specifiable task" for what I meant, because I didn't mean it to include "optimize the entire human lightcone w.r.t. human values". In fact, I was being vague and probably there isn't really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by "hard problem of alignment" is : "develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and ... (read more)

I'm surprised if I haven't made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the "melt the GPUs" strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don't have to convince me of "'any easily human-specifiable task' is asking for a really mature alignment", because in my model this is basically equivalent to fully solving the hard problem of AI alignment. 

Some reasons:

  • I
... (read more)
2Rob Bensinger
This seems very implausible to me. One task looks something like "figure out how to get an AGI to think about physics within a certain small volume of space, output a few specific complicated machines in that space, and not think about or steer the rest of the world". The other task looks something like "solve all of human psychology and moral philosophy, figure out how to get an AGI to do arbitrarily specific tasks across arbitrary cognitive domains with unlimited capabilities and free reign over the universe, and optimize the entire future light cone with zero opportunity to abort partway through if you screw anything up". The first task can be astoundingly difficult and still be far easier than that. If you're on the Moon, on Mars, deep in the Earth's crust, etc., or if you've used AGI to build fast-running human whole-brain emulations, then you can go without AGI-assisted modeling like that for a very long time (and potentially indefinitely). None of the pivotal acts that seem promising to me involve any modeling of humans, beyond the level of modeling needed to learn a specific simple physics task like 'build more advanced computing hardware' or 'build an artificial ribosome'. If humanity has solved the weak alignment problem, escaped imminent destruction via AGI proliferation, and ended the acute existential risk period, then we can safely take our time arguing about what to do next, hashing out whether the pivotal act that prevented the death of humanity violated propriety, etc. If humanity wants to take twenty years to hash out that argument, or for that matter a hundred years, then go wild! I feel optimistic about the long-term capacity of human civilization to figure things out, grow into maturity, and eventually make sane choices about the future, if we don't destroy ourselves. I'm much more concerned with the "let's not destroy ourselves" problem than with the finer points of PR and messaging when it comes to discussing afterwards whatever it was so

"you" obviously is whoever would be building the AI system that ended up burning all the GPU's (and ensuring no future GPU's are created). I don't know such sequence of events just as I don't know the sequence of events for building the "burn all GPU's" system, except at the level of granularity of "Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU's indefintely/build security services that prevent misaligned AI from destroying the worl... (read more)

2Rob Bensinger
I'd guess this is orders of magnitude harder than, e.g., 'build an AGI that can melt all the GPUs, build you a rocket to go to the Moon, and build you a Moon base with 10+ years of supplies'. Both sound hard, but 'any easily human-specifiable task' is asking for a really mature alignment science in your very first AGI systems -- both in terms of 'knowing how to align such a wide variety of tasks' (e.g., you aren't depending on 'the system isn't modeling humans' as a safety assumption), and in terms of 'being able to actually do the required alignment work on fairly short timescales'. If we succeed in deploying aligned AGI systems, I expect the first such systems to be very precariously aligned -- just barely able to safely perform a very minimal, limited set of tasks. I expect humanity, if it survives at all, to survive by the skin of our teeth. Adding any extra difficulty to the task (e.g., an extra six months of work) could easily turn a realistic success scenario into a failure scenario, IMO. So I actually expect it to matter quite a lot exactly how much extra research and engineering work and testing we require; we may not be able to afford to waste a month.

I wonder if there is a bias induced by writing this on a year-by-year basis, as opposed to some random other time interval, like 2 years. I can somehow imagine that if you take 2 copies of a human, and ask one to do this exercise in yearly intervals, and the other to do it in 2-year intervals, they'll basically tell the same story, but the second one's story takes twice as long. (i.e. the second one's prediction for 2022/2024/2026 are the same as the first one's predictions for 2022/2023/2024). It's probably not that extreme, but I would be surprised if there was zero such effect, which would mean these timelines are biased downwards or upwards.

2Daniel Kokotajlo
Probably there's all sorts of subtle biases, yeah. It would be cool to see a more rigorous evaluation of them by e.g. getting a bunch of humans to generate stories with different methodologies.

yeah, I probably overstated. Nevertheless:

"CEV seems way harder to me than ..."
yes, I agree it seems way harder, and I'm assuming we won't need to do it and that we could instead "run CEV" by just actually continuing human society and having humans figure out what they want, etc. It currently seems to me that the end game is to get to an AI security service (in analogy to state security services) that protects the world from misaligned AI, and then let humanity figure out what it wants (CEV). The default is just to do CEV directly by actual human brains, b... (read more)

2Rob Bensinger
Who is "you"? What sequence of events are you imagining resulting in a permanent security service (= a global surveillance and peacekeeping force?) that prevents AGI from destroying the world, without an AGI-enabled pivotal act occurring?

Ok I admit I read over it. I must say though that this makes the whole thing more involved than it sounded at fist, since it would maybe require essentially escalating a conflict with all major military powers and still coming out on top? One possible outcome of this would be that the entire global intellectual public opinion turns against you, meaning you also possibly lose access to a lot of additional humans working with you on further alignment research? I'm not sure if I'm imagining it correctly, but it seems like this plan would either require so many elements that I'm not sure if it isn't just equivalent to solving the entire alignment problem, or otherwise it isn't actually enough.

2Rob Bensinger
This seems way too extreme to me; I expect the full alignment problem to take subjective centuries to solve. CEV seems way harder to me than, e.g., 'build nanotech that helps you build machinery to relocate your team and your AGI to the Moon, then melt all the GPUs on Earth'. Leaving the Earth is probably overkill for defensive purposes, given the wide range of defensive options nanotech would open up (and the increasing capabilities gap as more time passes and more tasks become alignable). But it provides another proof of concept that this is a much, much simpler engineering feat than aligning CEV and solving the whole of human values. Separately, I do in fact think it's plausible that the entire world would roll over (at least for ten years or so) in response to an overwhelming display of force of that kind, surprising and counter-intuitive as that sounds. I would feel much better about a plan that doesn't require this assumption; but there are historical precedents for world powers being surprisingly passive and wary-of-direct-conflict in cases like this.

But assuming that law enforcement figures out that you did this, then puts you in jail, you wouldn't be able to control the further use of such nanotech, i.e. there would just be a bunch of systems indefinitely destroying GPU's, or maybe you set a timer or some conditions on it or something. I certainly see no reason why Iceland or anyone in iceland could get away with this unless those systems rely on completely unchecked nanosystems to which the US military has no response. Maybe all of this is what Eliezer means by "melt the GPU's", but I thought he did... (read more)

3Rob Bensinger
This would violate Eliezer's condition "including the reaction of existing political entities to that event". If Iceland melts all the GPUs but then the servers its AGI is running on get bombed, or its AGI researchers get kidnapped or arrested, then I assume that the attempted pivotal act failed and we're back to square one. (I assume this because (a) I don't expect most worlds to be able to get their act together before GPUs proliferate again and someone destroys the world with AGI; and (b) I assume there's little chance of Iceland recovering from losing its AGI or its AGI team.)

I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down. 

2Daniel Kokotajlo
I'll send you a DM.

I would be interested in reading a draft and giving feedback (FYI I'm currently a researcher in the AI safety team at FHI). 

1Darren McKee
Thank you.  I'll follow up. 

I'm also interested to read the draft, if you're willing to send it to me.

Load More