All of Rob Bensinger's Comments + Replies

I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)

This paragrap... (read more)

Your comments' points seem like further evidence for my position. That said, your comment appears to serve the function of complicating the conversation, and that happens to have the consequence of diffusing the impact of my point. I do not allege that you are doing so on purpose, but I think it's important to notice. I would have been more convinced by a reply of "no, you're wrong, here's the concrete bet(s) EY made or was willing to make but Paul balked." 

I will here repeat a quote[1] which seems relevant: 

[Christiano][12:29] 

my desir

... (read more)

If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!? 

Quoting the abstract of MIRI's "The Value Learning Problem" paper (emphasis added):

Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarte

... (read more)
4Matthew Barnett
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions. I don't think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense. I suspect you are probably conflating two separate concepts: 1. It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past) 2. It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for. I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly. Your quote from the paper illustrates that it's very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1). There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter. From the paper, Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!

But the benefit of a Pause is that you use the extra time to do something in particular. Why wouldn't you want to fiscally sponsor research on problems that you think need to be solved for the future of Earth-originating intelligent life to go well? 

MIRI still sponsors some alignment research, and I expect we'll sponsor more alignment research directions in the future. I'd say MIRI leadership didn't have enough aggregate hope in Agent Foundations in particular to want to keep supporting it ourselves (though I consider its existence net-positive).

My mo... (read more)

3Raemon
I realize if you had a good answer here the org would be doing different stuff, but, do you (or other MIRI folk) have any rough sense of the sort of alignment work that'd plausibly be in the left two quadrants there? (also, when you say "high EV", are you setting the "high" bar at a level that means "good enough that anyone should be prioritizing?" or "MIRI is setting a particularly high bar for alignment research right now because it doesn't seem like the most important thing to be focusing on?")
9Ebenezer Dukakis
In terms of "improve the world's general understanding of the situation", I encourage MIRI to engage more with informed skeptics. Our best hope is if there is a flaw in MIRI's argument for doom somewhere. I would guess that e.g. Matthew Barnett he has spent something like 100x as much effort engaging with MIRI as MIRI has spent engaging with him, at least publicly. He seems unusually persistent -- I suspect many people are giving up, or gave up long ago. I certainly feel quite cynical about whether I should even bother writing a comment like this one.

I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).

Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where ch

... (read more)

one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.

I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).

Even if this exists, verification seems extraordinarily difficu... (read more)

5Rob Bensinger
It's also important to keep in mind that on Leopold's model (and my own), these problems need to be solved under a ton of time pressure. To maintain a lead, the USG in Leopold's scenario will often need to figure out some of these "under what circumstances can we trust this highly novel system and believe its alignment answers?" issues in a matter of weeks or perhaps months, so that the overall alignment project can complete in a very short window of time. This is not a situation where we're imagining having a ton of time to develop mastery and deep understanding of these new models. (Or mastery of the alignment problem sufficient to verify when a new idea is on the right track or not.)

As a start, you can prohibit sufficiently large training runs. This isn't a necessary-and-sufficient condition, and doesn't necessarily solve the problem on its own, and there's room for debate about how risk changes as a function of training resources. But it's a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous train... (read more)

Alternatively, they either don't buy the perils or believes there's a chance the other chance may not?

If they "don't buy the perils", and the perils are real, then Leopold's scenario is falsified and we shouldn't be pushing for the USG to build ASI.

If there are no perils at all, then sure, Leopold's scenario and mine are both false. I didn't mean to imply that our two views are the only options.

Separately, Leopold's model of "what are the dangers?" is different from mine. But I don't think the dangers Leopold is worried about are dramatically easier to und... (read more)

2O O
It is well known nuclear weapons result in MAD, or localized annihilation. It was still built. But my more important point is this sort of thinking requires most to be convinced there is a high p(doom) and more importantly, also convinced that the other side believes that there is a high p(doom). If either of those are false, then not building doesn't work. If the other side is building it, then you have to build it anyways just in case your theoretical p(doom) arguments are wrong. Again this is just arguing your way around a pretty basic prisoner's dilemma.  And think about the fact that we will develop AGIs (note not ASI) anyways and alignment (or at least control) will almost certainly work for them.[1] Prisoner's dilemma indicates you have to match the drone warfare capabilities of the other side regardless of p(doom). In the world where the USG understands there are risks but thinks of it closer to something with decent odds of being solvable, we build it anyways. The gameboard is 20% of dying, 80% of handing the light cone to your enemy if the other side builds it and you do not. I think this is the most probable option, making all Pause efforts doomed. High p(doom) folks can't even convince low p(doom) folks in Lesswrong, the subset of optimists most likely to be receptive to their arguments, that they are wrong. There is no chance you won't simply be a faction in the USG like environmentalists are.  But let's pretend for a moment that the USG buys the high risk doomer argument for superintelligence. The USG and CCP are both rushing to build AGIs regardless, since AGI can be controlled and not having a drone swarm means you lose military relevance. Because of how fuzzy the line between ASI and AGI in this world will be, I think it's very plausible enough people will be convinced the CCP isn't convinced alignment is too hard and will build it anyways.    Even people with high p(doom)'s might have a nagging part of their mind saying that what if alignment  

Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.

For that matter, why would the USG want to build AGI if they considered it a coinflip whether this will kill everyone or not? The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself. "Sit back and watch other countries build doomsday weapons" and "build doomsday weapons yourself" are not the only two options.

Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.

If AI itself leads to doom, it likely doesn't matter whether it was developed by US Americans or by the Chinese. But if it doesn't lead to doom (the remaining 5%) it matters a lot which country is first, because that country is likely to achieve world domination.

The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself.

Short of choosing a nuclear war with China, the US can't do much to d... (read more)

Leopold's scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don't have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech's development and maintain the status quo at minimal risk.

Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough... (read more)

1O O
Alternatively, they either don't buy the perils or believes there's a chance the other chance may not? I think there is an assumption made in this statement and a lot of proposed strategies in this thread. If not everyone is being cooperative and doesn't buy the high p(doom) arguments then this all falls apart. Nuclear war essentially has a localized p(doom) of 1, yet both superpowers still built them. I am highly skeptical of any potential solution to any of this. It requires everyone (and not just say half) to buy the arguments to begin with.

Indeed, forecasters have been surprised by how slowly safety/robustness/etc. have progressed in recent years

Interesting, do you have a link to these safety predictions? I was not aware of this.

[anonymous]127

when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.

I would be interested in reading more about the methods that could be used to prohibit the proliferation of this technology (you can assume a "wake-up" from the USG). 

I think one of the biggest fears would be that any sort of international alliance would not have perfect/robust detection capabilities, so you're always risking the fact that someone might be running a rogue AGI project.

Also, separately, there's the issue of "at some point, doesn... (read more)

Responding to Matt Reardon's point on the EA Forum:

Leopold's implicit response as I see it:

  1. Convincing all stakeholders of high p(doom) such that they take decisive, coordinated action is wildly improbable ("step 1: get everyone to agree with me" is the foundation of many terrible plans and almost no good ones)
  2. Still improbable, but less wildly, is the idea that we can steer institutions towards sensitivity to risk on the margin and that those institutions can position themselves to solve the technical and other challenges ahead

Maybe the key insight is that

... (read more)

I do have a lot of reservations about Leopold's plan. But one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable. So his plan does seem to have a good chance to overcome the factor that AI existential safety research is a

field that has not been particularly productive or fast in the past

and also to overcome other factors requiring overreliance on humans and on current ... (read more)

5Rob Bensinger
Leopold's scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don't have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech's development and maintain the status quo at minimal risk. Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough foresight to see what's coming, but paired with a weird compulsion to race to build the very thing that puts its existence at risk, when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.

As is typical for Twitter, we also signal-boosted a lot of other people's takes. Some non-MIRI people whose social media takes I've recently liked include Wei Dai, Daniel Kokotajlo, Jeffrey Ladish, Patrick McKenzie, Zvi Mowshowitz, Kelsey Piper, and Liron Shapira.

The stuff I've been tweeting doesn't constitute an official MIRI statement — e.g., I don't usually run these tweets by other MIRI folks, and I'm not assuming everyone at MIRI agrees with me or would phrase things the same way. That said, some recent comments and questions from me and Eliezer:

  • May 17: Early thoughts on the news about OpenAI's crazy NDAs.
  • May 24: Eliezer flags that GPT-4o can now pass one of Eliezer's personal ways of testing whether models are still bad at math.
  • May 29: My initial reaction to hearing Helen's comments on the TED AI podcast. Inc
... (read more)
9Rob Bensinger
As is typical for Twitter, we also signal-boosted a lot of other people's takes. Some non-MIRI people whose social media takes I've recently liked include Wei Dai, Daniel Kokotajlo, Jeffrey Ladish, Patrick McKenzie, Zvi Mowshowitz, Kelsey Piper, and Liron Shapira.

Every protest I've witnessed seemed to be designed to annoy and alienate its witnesses, making it as clear as possible that there was no way to talk to these people, that their minds were on rails. I think most people recognize that as cult shit and are alienated by that.

In the last year, I've seen a Twitter video of an AI risk protest (I think possibly in continental Europe?) that struck me as extremely good: calm, thoughtful, accessible, punchy, and sensible-sounding statements and interview answers. If I find the link again, I'll add it here as a model ... (read more)

Could we talk about a specific expert you have in mind, who thinks this is a bad strategy in this particular case?

AI risk is a pretty weird case, in a number of ways: it's highly counter-intuitive, not particularly politically polarized / entrenched, seems to require unprecedentedly fast and aggressive action by multiple countries, is almost maximally high-stakes, etc. "Be careful what you say, try to look normal, and slowly accumulate political capital and connections in the hope of swaying policymakers long-term" isn't an unconditionally good strategy, i... (read more)

1Erich_Grunewald
I don't really have a settled view on this; I'm mostly just interested in hearing a more detailed version of MIRI's model. I also don't have a specific expert in mind, but I guess the type of person that Akash occasionally refers to -- someone who's been in DC for a while, focuses on AI, and has encouraged a careful/diplomatic communication strategy. I agree with this. I also think that being more outspoken is generally more virtuous in politics, though I also see drawbacks with it. Maybe I'd wished OP mentioned some of the possible drawbacks of the outspoken strategy and whether there are sensible ways to mitigate those, or just making clear that MIRI thinks they're outweighed by the advantages. (There's some discussion, e.g., the risk of being "discounted or uninvited in the short term", but this seems to be mostly drawn from the "ineffective" bucket, not from the "actively harmful" bucket.) Yeah, I guess this is a difference in worldview between me and MIRI, where I have longer timelines, am less doomy, and am more bullish on forceful government intervention, causing me to think increased variance is probably generally bad. That said, I'm curious why you think AI risk is highly counterintuitive (compared to, say, climate change) -- it seems the argument can be boiled down to a pretty simple, understandable (if reductive) core ("AI systems will likely be very powerful, perhaps more than humans, controlling them seems hard, and all that seems scary"), and it has indeed been transmitted like that successfully in the past, in films and other media. I'm also not sure why it's relevant here that AI risk is relatively unpolarized -- if anything, that seems like it should make it more important not to cause further polarization (at least if highly visible moral issues being relatively unpolarized represent unstable equilibriums)?

I'm interpreting "realize" colloquially, as in, "be aware of". I don't think the people discussed in the post just haven't had it occur to them that pre-singularity wealth doesn't matter because a win singularity society very likely wouldn't care much about it. Instead someone might, for example...

  • ...care a lot about their and their people's lives in the next few decades.
  • ...view it as being the case that [wealth mattering] is dependent on human coordination, and not trust others to coordinate like that. (In other words: the "stakeholders" would have to
... (read more)

Two things:

  • For myself, I would not feel comfortable using language as confident-sounding as "on the default trajectory, AI is going to kill everyone" if I assigned (e.g.) 10% probability to "humanity [gets] a small future on a spare asteroid-turned-computer or an alien zoo or maybe even star". I just think that scenario's way, way less likely than that.
    • I'd be surprised if Nate assigns 10+% probability to scenarios like that, but he can speak for himself. 🤷‍♂️
    • I think some people at MIRI have significantly lower p(doom)? And I don't expect those people to u
... (read more)
5ryan_greenblatt
Thanks, this is clarifying from my perspective. My remaining uncertainty is why you think AIs are so unlikely to keep humans around and treat them reasonably well (e.g. let them live out full lives). From my perspective the argument that it is plausible that humans are treated well [even if misaligned AIs end up taking over the world and gaining absolute power] goes something like this: * If it only cost >1/million of overall resources to keep a reasonable fraction of humans alive and happy, it's reasonably likely that misaligned AIs with full control would keep humans alive and happy due to either: * Acausal trade/decision theory * The AI terminally caring at least a bit about being nice to humans (perhaps because it cares a bit about respecting existing nearby agents or perhaps because it has at least a bit of human like values). * It is pretty likely that it costs <1/million of overall resources (from the AI's perspective) to keep a reaonable fraction of humans alive and happy. Humans are extremely keep to keep around asymptotically and I think it can be pretty cheap even initially, especially if you're a very smart AI. (See links in my prior comment for more discussion.) (I also think the argument goes through for 1/billion, but I thought I would focus on the higher value for now.) Where do you disagree with this argument?

Note that "everyone will be killed (or worse)" is a different claim from "everyone will be killed"! (And see Oliver's point that Ryan isn't talking about mistreated brain scans.)

Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.

I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don't think we'd be comfortable saying "AI is going to kill everyone". (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)

I also personally have maybe half my probability mass on "the AI just doesn't stor... (read more)

FWIW I do think "don't trust this guy" is warranted; I don't know that he's malicious, but I think he's just exceptionally incompetent relative to the average tech reporter you're likely to see stories from.

Like, in 2018 Metz wrote a full-length article on smarter-than-human AI that included the following frankly incredible sentence:

During a recent Tesla earnings call, Mr. Musk, who has struggled with questions about his company’s financial losses and concerns about the quality of its vehicles, chastised the news media for not focusing on the deaths that a

... (read more)

FWIW, Cade Metz was reaching out to MIRI and some other folks in the x-risk space back in January 2020, and I went to read some of his articles and came to the conclusion in January that he's one of the least competent journalists -- like, most likely to misunderstand his beat and emit obvious howlers -- that I'd ever encountered. I told folks as much at the time, and advised against talking to him just on the basis that a lot of his journalism is comically bad and you'll risk looking foolish if you tap him.

This was six months before Metz caused SSC to shu... (read more)

Sounds like a lot of political alliances! (And "these two political actors are aligned" is arguably an even weaker condition than "these two political actors are allies".)

At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)

To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".

If I hear that Russia an... (read more)

Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.

Fair enough. I guess it just seems somewhat incongruous to say. "Oh yes, the AI is aligned. Of course it might desperately crave murdering all of us in its heart (we certainly haven't ruled this out with our current approach), but it is aligned because we've made it so that it wouldn't get away with it if it tried."

"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?

In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.

Suppose someone draws a "personal identity" line to exclude this future sunrise-witnessing person.  Then if you claim that, by not anticipating, they are degrading the accuracy of the sunrise-witness's beliefs, they might reply that you are begging the question.

Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?

We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".

If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?

It also doesn't force us to believe that a bunch of w

... (read more)

You're missing the bigger picture and pattern-matching in the wrong direction. I am not saying the above because I have a need to preserve my "soul" due to misguided intuitions. On the contrary, the reason for my disagreement is that I believe you are not staring into the abyss of physicalism hard enough. When I said I'm agnostic in my previous comment, I said it because physics and empiricism lead me to consider reality as more "unfamiliar"  than you do (assuming that my model of your beliefs is accurate). From my perspective, your post and your conc... (read more)

Yeah, at some point we'll need a proper theory of consciousness regardless, since many humans will want to radically self-improve and it's important to know which cognitive enhancements preserve consciousness.

Yeah. My point was, we can't even be sure which behavior-preserving optimizations (of the kind done by optimizing compilers, say) will preserve consciousness. It's worrying because these optimizations can happen innocuously, e.g. when your upload gets migrated to a newer CPU with fancier heuristics. And yeah, when self-modification comes into the picture, it gets even worse.

You can easily clear this confusion if you rephrase it as "You should anticipate having any of these experiences". Then it's immediately clear that we are talking about two separate screens.

This introduces some other ambiguities. E.g., "you should anticipate having any of these experiences" may make it sound like you have a choice as to which experience to rationally expect.

And it's also clear that our curriocity isn't actually satisfied. That the question "which one of these two will actually be the case" is still very much on the table.

... And the answer... (read more)

My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.

There are always going to be many different ways someone could object to a view. If you were a Christian, you'd perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I'd be devoting the first half of the post to arguing against the Christian Soul.

Rather than trying to anticipate these o... (read more)

First off, would you agree with my model of your beliefs? Would you consider it an accurate description?

Also, let me make clear that I don't believe in cartesian souls. I, like you, lean towards physicalism, I just don't commit to the explanation of consciousness  based on the idea of the brain as a **classical** electronic circuit. I don't fully dismiss it either, but I think it is worse on philosophical grounds than assuming that there is some (potentially minor) quantum effect going on inside the brain that is an integral part of the explanation fo... (read more)

Wouldn't it follow that in the same way you anticipate the future experiences of the brain that you "find yourself in" (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?

What's the empirical or physical content of this belief?

I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there's no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirel... (read more)

1Brent
  I'll take a stab at explaining this with a simple thought experiment. Say there are two people, Alice and Bob, each with their own unique brain states. If Alice's brain state changes slightly, from getting older, learning something new, losing some neurons to a head injury, etc, she will still be Alice. Changing, adding, or removing a neuron does not change this fact. Now what if instead part of her brain state was changing slowly to match Bob's? You could think of this as incrementally removing Alice's neurons and replacing them with a copy of Bob's, I find it hard to believe that any discrete small change will make Alice's conscious experience suddenly disappear, and by the end of it she will have the exact same brain state as Bob. If you believe that when Bob steps into a teleporter that also makes a copy, they are both the same Bob, then it is reasonable to assume that this transformed Alice is also Bob. Then for the same reason your older self is the same "self" as your younger self, the younger Alice is also Bob. The transition between their brain states doesn't even need to happen, it just has to be possible. From here it is easy to extrapolate that all brain states are the same "self".
1Edralis
I apologize for not getting back to you sooner, I didn’t notice your reply until yesterday. And I apologize for the length of my response, too - I bolded the most important parts. Re: Whether there is empirical difference between worlds where OI is true and where OI is false. The difference between all experiences being mine and only some being mine is that if all experiences are mine, then they all exist in the same way this experience now exists, i.e. for me (where me = just this immediacy/this-here-now character, i.e. the way it exists, NOT Edralis's memories, personality etc.). There is no empirical difference in the usual sense, since the way experiences exist cannot be objectively assessed. I can’t be sure that you even have any experiences – this is not something that is available for empirical investigation in the way I can assess e.g. the number of someone’s fingers. And I can’t know that, given there are experiences from that point of view, that they exist in the same way as this experience, does, i.e. for me.  That is only clear in those experiences. If I am there, I do ultimately know that I am there (obviously) – but I have no way to know that when experiencing this person, Edralis. So the empirical difference in the usual sense between OI being true and not being true is none. However, there are other than empirical facts. The existential difference between those two worlds is vast. If OI is true, then I (i.e. the thisness, the here-now-this that at least Edralis's experiences have) am Rob Bensinger, and everybody else – if it’s not true, then I am not. The difference is in the being of those experiences, in how they exist. But since experiences (consciousness) don’t exist empirically (or better: objectively), there is no empirical (objective) difference. There is existential, subjective difference, though. That is not what I mean when I think about anticipating a future brain state. What I am interested in is not the content of experience, but how t

As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit 

I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.

Is there even anybody claiming there is an experiential difference?

Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of "this stream of consciousness will end, what comes next is only oblivion", not "oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word 'me' to refer to the later parts of that stream of ... (read more)

1cubefox
This doesn't show they believe there is a difference in experience. It can be simply a different analysis of the meaning of "the current stream of consciousness continuing". That's a semantic difference, not an empirical one.
5Rob Bensinger
As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit  I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.

The problem was that you first seemed to belittle questions about word meanings ("self") as being "just" about "definitions" that are "purely verbal".

I did no such thing!

Luckily now you concede that the question about the meaning of "I" isn't just about (arbitrary) "definitions"

Read the blog post at the top of this page! It's my attempt to answer the question of when a mind is "me", and you'll notice it's not talking about definitions.

But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other e

... (read more)
-1cubefox
Is there even anybody claiming there is an experiential difference? It seems you may attacking a strawman. The alternative to this is that there is a disagreement about the appropriate semantic interpretation/analysis of the question. E.g. about what we mean when we say "I will (not) experience such and such". That seems more charitable than hypothesizing beliefs in "ghosts" or "magic".

You're also free to define "I" however you want in your values.

Sort of!

  • It's true that no law of nature will stop you from using "I" in a nonstandard way; your head will not explode if you redefine "table" to mean "penguin".
  • And it's true that there are possible minds in abstract mindspace that have all sorts of values, including strict preferences about whether they want their brain to be made of silicon vs. carbon.
  • But it's not true that humans alive today have full and complete control over their own preferences.
  • And it's not true that humans can never be m
... (read more)
1Signer
Why not both? I can imagine that someone would be persuaded to accept teleportation/uploading if they stopped believing in physical Cartesian Ghost. But it's possible that if you remind them that continuity of experience, like table, is just a description of physical situation and not divinely blessed necessary value, that would be enough to tip the balance toward them valuing carbon or whatever. It's bad to be wrong about Cartesian Ghosts, but it's also bad to think that you don't have a choice about how you value experience.

FWIW, I typically use "alignment research" to mean "AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI" (with an emphasis on "safely"). So I'd include things like Chris Olah's interpretability research, even if the proximate impact of this is just "we understand what's going on better, so we may be more able to predict and finely control future systems" and the proximate impact is not "the AI is now less inclined to kill you".

Some examples: I wouldn't necessarily think of "figure out how we want to airgap the... (read more)

It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions". So, this makes me sad to call this alignment research. Of course, this type of research maybe instrumentally useful for making AIs more aligned, but so will a bunch of other stuff (e.g. earning to give).

Fair enough if you think we should just eat this terminology issue and then coin a new term like "actually real-alignment-targeting-directly alignment research". Idk what the right term is obviously.

But that isn't an experience. It's two experiences. You will not have an experience of having two experiences. Two experiences will experience having been one person.

Sure; from my perspective, you're saying the same thing as me.

Are you going to care about 1000 different copies equally?

How am I supposed to choose between them?

3TAG
By "equally" I meant: "in the same ways (and to the same degree)". If you actually believe in florid many worlds, you would end up pretty insuoucient, since everything possible happens, and nothing can be avoided.

Why? If "I" is arbitrary definition, then “When I step through this doorway, will I have another experience?" depends on this arbitrary definition and so is also arbitrary.

Which things count as "I" isn't an arbitrary definition; it's just a fuzzy natural-language concept.

(I guess you can call that "arbitrary" if you want, but then all the other words in the sentence, like "doorway" and "step", are also "arbitrary".)

Analogy: When you're writing in your personal diary, you're free to define "table" however you want. But in ordinary English-language discourse... (read more)

3cubefox
The problem was that you first seemed to belittle questions about word meanings ("self") as being "just" about "definitions" that are "purely verbal". Luckily now you concede that the question about the meaning of "I" isn't just about (arbitrary) "definitions", which makes calling it a "purely verbal" (read: arbitrary) question inappropriate. Now of course the meaning of "self" is no more arbitrary than the meaning of "I", indeed those terms are clearly meant to refer to the same thing (like "me" or "myself"). The wider point is that the following seems not true: Wenn we evaluate statements or questions of any kind, including the one above, we need to know two things: 1) Its meaning, in particular the meaning of the involved terms, 2) what the empirical facts are. But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other end and sees something. So the issue can only be about the semantic interpretation of that question, about what we mean with expressions like "I will see x". Do we mean "A future person that is psychologically continuous with current-me sees x"? That's not an empirical question, it's a semantic one, but it's not in any way arbitrary, as expressions like "just about definitions" or "purely verbal" would suggest. Conceptual analysis is neither arbitrary nor trivial.
9Signer
You're also free to define "I" however you want in your values. You're only wrong if your definitions imply wrong physical reality. But defining "I" and "experiences" in such a way that you will not experience anything after teleportation is possible without implying anything physically wrong. You can be wrong about physical reality of teleportation. But even after you figured out that there is no additional physical process going on that kills your soul, except for the change of location, you still can move from "my soul crashes against an asteroid" to "soul-death in my values means sudden change in location" instead of to "my soul remains alive". It's not like I even expect you specifically to mean "don't liking teleportation is necessary irrational" much. It's just that saying that there should be an actual answer to questions about "I" and "experiences" makes people moral-realist.

The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.

The pivotal acts I usually think about actually don't route through physically messing with anyone else. I'm usually thinking about using aligned AGI to bootstrap to fast human whole-brain emulation, then using the ems to bootstrap to fully aligned CEV AI.

If someone pushes a "destroy the world" button then the ems or CEV ... (read more)

6Lao Mein
Are people actually working on human enhancement? Many talk about how it's the best chance humanity has, but I see zero visible efforts other than Neurolink. No one's even seriously trying to clone Von Neumann!

To pick out a couple of specific examples from your list, Wei Dai:

14. Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity

This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd ... (read more)

Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:

we just call 'em like we see 'em

[...]

insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy

I'll also add: Nate (u... (read more)

4Wei Dai
The items on my list are of roughly equal salience to me. I don't have specific suggestions for people who might be interested in spreading awareness of these risks/arguments, aside from picking a few that resonate with you and are also likely to be well received by the intended audience. And maybe link back to the list (or some future version of such a list) so that people don't think the ones you choose to talk about are the only risks. For me personally, I tend to talk about "philosophy is hard" (which feeds into "alignment is hard" and beyond) and "humans aren't safe" (humans suffer from all kinds of safety problems just like AIs do, including being easily persuaded of strange beliefs and bad philosophy, calling "alignment" into question even as a goal). These might not work well on a broader audience though, the kind that MIRI is presumably trying to reach. Some adjacent messages might, for example, "even if alignment succeeds, humans can't be trusted with God-like powers yet; we need to become much wiser first" and "AI persuasion will be a big problem" (but honestly I have little idea due to lack of experience talking outside my circle).
3Rob Bensinger
To pick out a couple of specific examples from your list, Wei Dai: This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd instead see the priority as "use AI to prevent anyone else from destroying the world with AI", and I wouldn't want to trade off probability of that plan working in exchange for (e.g.) more probability of the US and the EU agreeing in advance to centralize and monitor large computing clusters. After someone has done a pivotal act like that, you might then want to move more slowly insofar as you're worried about subtle moral errors creeping in to precursors to CEV. I currently assign very low probability to humans being able to control the first ASI systems, and redirecting governments' attention away from "rogue AI" and toward "rogue humans using AI" seems very risky to me, insofar as it causes governments to misunderstand the situation, and to specifically misunderstand it in a way that encourages racing. If you think rogue actors can use ASI to achieve their ends, then you should probably also think that you could use ASI to achieve your own ends; misuse risk tends to go hand-in-hand with "we're the Good Guys, let's try to outrace the Bad Guys so AI ends up in the right hands". This could maybe be justified if it were true, but when it's not even true it strikes me as an especially bad argument to make.

I expect it makes it easier, but I don't think it's solved.

Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.

  1. "Plan the factory layout of the diamond synthesis plant with these requirements".
  2. "Order the equipment needed, here's the payment credentials".
  3. "Supervise construction this workday comparing to original plans"
  4. "Given this step of the plan, do it"
  5. (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".

That is how MIRI imagines a sane developer using just-b... (read more)

2faul_sname
Thanks for the reply. This sounds like a good and reasonable approach, and also not at all like the sort of thing where you're trying to instill any values at all into an ML system. I would call this "usable and robust tool construction" not "AI alignment". I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format. "How can I build a system powerful enough to stop everyone else from doing stuff I don't like" sounds like more of a capabilities problem than an alignment problem. Yeah, this sounds right to me. I expect that there's a lot of danger inherent in biological gain-of-function research, but I don't think the solution to that is to create a virus that will infect people and cause symptoms that include "being less likely to research dangerous pathogens". Similarly, I don't think "do research on how to make systems that can do their own research even faster" is a promising approach to solve the "some research results can be misused or dangerous" problem.

To be clear: The diamond maximizer problem is about getting specific intended content into the AI's goals ("diamonds" as opposed to some random physical structure it's maximizing), not just about building a stable maximizer.

2faul_sname
Thanks for the clarification! If you relax the "specific intended content" constraint, and allow for maximizing any random physical structure, as long as it's always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still "no, that's also not a thing we know how to do".

From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:

  • Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."

We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how... (read more)

Roko13-7

getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.

The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.

On reflection, I do not think that it is a wise idea to factor the path to a good future through a global AI-assisted coup.

Instead one should tr... (read more)

In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:

example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)

i'd also still be impressed by simple theories of aimable co

... (read more)

I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".

I do need to answer that question u

... (read more)

When the post says:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

It seems like it's saying that if you prompt an LM with "Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way," and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.

Which is a fine definition to pick. But the point is that in thi... (read more)

3David Johnston
A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.
4Seth Herd
Thinking about it a little more, there may be a good reason to consider how humans pursue mid-horizon goals. I think I do make a goal of answering Paul's question. It's not a subgoal of my primary values of getting food, status, etc, because backward-chaining is too complex. It's based on a vague estimate of the value (total future reward) of that action in context. I wrote about this in Human preferences as RL critic values - implications for alignment, but I'm not sure how clear that brief post was. I was addressing a different part of Paul's comment than the original question. I mentioned that I didn't have an answer to the question of whether one can make long-range plans without wanting anything. I did try an answer in a separate top-level response: it doesn't matter much whether a system can pursue long-horizon tasks without wanting, because agency is useful for long-horizon tasks, and it's not terribly complicated to implement. So AGI will likely have it built in, whether or not it would emerge from adequate non-agentic training. I think people will rapidly agentize any oracle system. It's useful to have a system that does things for you. And to do anything more complicated than answer one email, the user will be giving it a goal that may include instrumental subgoals. The possibility of emergent wanting might still be important in an agent scaffolded around a foundation model. Perhaps I'm confused about the scenarios you're considering here. I'm less worried about LLMs achieving AGI and developing emergent agency, because we'll probably give them agency before that happens.

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.

See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.

Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than... (read more)

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-me... (read more)

I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".

I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MI... (read more)

7Ben Pace
Don't forget lc had a 624 karma post on it on April 4th 2022.
Load More