LESSWRONG
is fundraising!
LW
$

Comment Permalink

GPT-4 seems like a "generic system" that essentially "understands our intentions"

In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship^[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.)

It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions".

^{^}
That is known to us right now; possibly one exists and could be derived.

See in context

Matthew Barnett's Shortform

by Matthew Barnett

9th Aug 2019

AI Alignment Forum

1 min read

381

7 Ω 2

This is a special post for quick takes by Matthew Barnett. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

156My thoughts on the social response to AI risk

9Will AI and Humanity Go to War?

5(Non-deceptive) Suboptimality Alignment

Matthew Barnett's Shortform

3Bogdan Ionut Cirstea

9the gears to ascension

2Gordon Seidoh Worley

3Gordon Seidoh Worley

381 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:50 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Matthew Barnett6moΩ21545

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."

Interlocutor: "Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values."

[... The conversation then repeats, with both sides repeating the same points...]

[Edited to add: I am not claiming that t... (read more)

[-]Daniel Kokotajlo6moΩ368325

Here's how that discussion would go if you had it with me:

You: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Me: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

You: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

Pulling some quotes from Supe... (read more)

[-]Matthew Barnett6moΩ11295

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don't)

I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom's story. This seems obvious to me.

Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I'm explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs... (read more)

[-]Daniel Kokotajlo6moΩ195020

I thought you would say that, bwahaha. Here is my reply:

(1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI's final goal is to 'make the project's sponsor happy.' Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner... until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor's brain..." My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to... (read more)

8Matthew Barnett6mo

When stated that way, I think what you're saying is a reasonable point of view, and it's not one I would normally object to very strongly. I agree it's "plausible" that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making: 1. We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this. 2. We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this The fact that Bostrom's central example of a reason to think that "when dumb, smarter is safer; yet when smart, smarter is more dangerous" doesn't fit for LLMs, seems adequate for demonstrating (2), even if we can't go as far as demonstrating (1). It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world. I have two general points to make here: 1. I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different? 2

9Daniel Kokotajlo6mo

Thanks for this detailed reply! Depending on what you mean by "on their way towards being solved" I'd agree. The way I'd put it is: "We didn't know what the path to AGI would look like; in particular we didn't know whether we'd have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that's good in some ways and bad in other ways, it's probably overall good. Huzzah! However, our core problems remain, and we don't have much time left to solve them." (Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul's stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.) Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update. I don't think that we know how to "just create the corrigible AIs." The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won't work on much more agentic AIs. To be clear I think they might work, there's a lot of uncertainty, but I think they probably won't. I think it might be easier to see why I think this if you try to prove the opposite in detail -- like, write a mini-scenario in which we have something like AutoGPT but much better, and it's being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigi

4Matthew Barnett6mo

Just a quick reply to this: I'll note that my prediction was for the next "few years" and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point. With timelines that short, I think betting is overrated. From my perspective, I'd prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you're right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I'm happy to hear them.

8Daniel Kokotajlo6mo

It's not about timelines, it's about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is 'agency skills.' So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we'll face the problem of corrigibility breakdowns only really happening right around the time when it's too late or almost too late.

[-]Matthew Barnett6moΩ6100

I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.

How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:

In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity

Whereas my model looks more like,

In years 1-4 systems will get gradually more agentic
There isn't a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it's clear they've surpassed human-level agency (which, to be clear, might take longer than 4 years)

9Daniel Kokotajlo6mo

Good question. I want to think about this more, I don't have a ready answer. I have a lot of uncertainty about how long it'll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I'm skeptical. The longer it takes, the more likely it is that we'll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!

6Vladimir_Nesov6mo

I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.

6Lukas_Gloor6mo

Or that they have a sycophancy drive. Or that, next to "wanting to be helpful," they also have a bunch of other drives that will likely win over the "wanting to be helpful" part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals. On that latter model, the "wanting to be helpful" is a mask that the system is trained to play better and better, but it isn't the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different "mask" to become its locked-in personality.

3Amalthea6mo

Note that LLMs, while general, are still very weak in many important senses. Also, it's not necessary to assume that LLM's are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.

[-]Matthew Barnett6mo146

I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.

I'm just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you'd get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.

To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).

Yes, this evidence is not conclusive. It is not zero either.

[-]Daniel Kokotajlo6mo286

I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give -- I just don't think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.

[-]Matthew Barnett6mo10-11

I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.

We don't need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly "Yes".

(Note that you can trivially claim the problem here isn't being solved because we haven't solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)

Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that's not very common when theorizing about these matters. I'm frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I'm pointing at here. That said, I don't think people should get credit for failing to make any predictions, and as a consequence, failing to get proven... (read more)

6Daniel Kokotajlo6mo

Great, let's talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that's what you had said. E.g. suppose you had said "Hey, why don't we just prompt AutoGPT-5 with lots of corrigibility instructions?" then we could have a more technical conversation about whether or not that'll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)

6ryan_greenblatt6mo

I don't think current system systems are well described as having "big picture awareness". From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can't do such reasoning except aloud. I'm not certain this was your claim, but it seems to have been.

3Bogdan Ionut Cirstea6mo

Wouldn't reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.

3Matthew Barnett6mo

My claim was not that current LLMs have a high level of big picture awareness. Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness. And I'd predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system. And if you think that wouldn't count as an adequate solution to the problem, then it's not clear the problem was coherent as written in the first place.

4Seth Herd6mo

There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew's post on this topic. Obsessing about what happened in the past is probably a mistake. It's probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment? My answer is yes, and in a way that's not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world. Naturally that answer is a bit complex, so it's spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.

6RobertM6mo

I don't think this is true and can't find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.

4Seth Herd6mo

In sum, I see that claim as I remembered it, but it's probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication. So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again. I was thinking of point 21.1: BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn't fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn't apply to that route, so that difficulty isn't relevant here. That's an important but subtle distinction. Therefore, I'm far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they're changed by slow takeoff is the topic of my post cited above. I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms' recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky's objections to corrigibility as unnatural do not apply if that's the only or most important goal; and that it's simple and coherent enough to be teachable. That's me switching back to the object level issues, and again, apologies for was

2Vladimir_Nesov6mo

There's AGI that's our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There's misaligned superintelligence that knows, but doesn't care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.

4Seth Herd6mo

I agree with all of that. My post I mentioned, The (partial) fallacy of dumb superintelligence deals with the genie that knows but doesn't care, and how we get one that cares in a slow takeoff. My other post Instruction-following AGI is easier and more likely than value aligned AGI makes this same argument - nobody is going to bother getting the AGI to understand human values, since it's harder and unnecessary for the first AGIs. Max Harms makes a similar argument, (and in many ways makes it better), with a slightly different proposed path to corrigibility. As you say, these things have been understood for a long time. I'm a bit disturbed that more serious alignment people don't talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree, The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.

5Seth Herd6mo

Here we go! I think if you led with this statement, you'd have a lot less unproductive argumentation. It sounds on a vibe level like you're saying alignment is probably easy in your first statement. If you're just saying it's less hard than originally predicted, that sounds a lot more reasonable. Rationalists have emotions and intuitions, even if we'd rather not. Framing the discussion in terms of its emotional impact matters.

5Matthew Barnett6mo

That's reasonable. I'll edit the top comment to make this exact clarification.

1Jonas Hallgren6mo

Often, disagreements boil down to a set of open questions to answer; here's my best guess at how to decompose your disagreements. I think that depending on what hypothesis you're abiding by when it comes to how LLMs will generalise to AGI, you get different answers: Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don't become power-seeking. Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don't generally give evidence about AGI. Depending on your beliefs about these two hypotheses, you will have different opinions on this question. Let's say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn't give evidence about AGI? 1. Intelligence forces reflective coherence. This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values). 2. Agentic AI acting in the real world is different from LLMs. If we look at an LLM from the perspective of an action-perception loop, it doesn't generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world. 3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future. Personal belief: These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don't really believ

2Joel Burget6mo

For others who want the resolution to this cliffhanger, what does Bostrom predict happens next? The remainder of this section:

0Matthew Barnett6mo

LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security. This means that current evidence is quite different from what's portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don't yet have a decisive strategic advantage. These facts are crucial, and make a big difference. I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom's book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.

5Daniel Kokotajlo6mo

See my reply elsewhere in thread.

1Signer6mo

What does "dumb" mean? Corrigibility basically is being selectively dumb. You can give power to a LLM and it would likely still follow instructions.

[-]Wei Dai6moΩ244823

**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”

Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive "bootstraping" part. For example, my own comment started with:

I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.

When Eliezer weighed in on IDA in 2018, he also didn't object to the assumption of an aligned weak AGI and instead focused his skepticism on "preserving alignment while amplifying capabilities".

[-]Matthew Barnett6moΩ82114

Please give some citations so I can check your memory/interpretation?

Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other

... (read more)

5Daniel Kokotajlo5mo

Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about this)

4RobertM6mo

I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states. In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.) It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions". 1. ^ That is known to us right now; possibly one exists and could be derived.

2Matthew Barnett6mo

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement. When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not? I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things. I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it. I'd claim: 1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level. 2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corr

2RobertM6mo

But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future. My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic's Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful? No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning. Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful

9the gears to ascension6mo

yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it's made the things they want to solve foggier. But I've been pretty impressed by the conversations I've had with other MIRIers. I've talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT's stuff looks pretty interesting, haven't talked much besides on lw in ages. For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs -> mass deployment -> no significant jobs left -> economy increasingly automated -> incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer - but at some point the rich does not include humans anymore, and at some point well before that it's mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don't think this is because of scheming AIs, just civilizational inadequacy. That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT's recent shortform post, my core takeaway is "only that which surv

6Zack_M_Davis6mo

Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won't work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham's Law?

4Max H6mo

That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though. As someone who has views that are probably more aligned with your interlocutors, I'll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.) My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren't very impressive in a mundane-utility sense. Current LLMs look to me like they're just barely capable enough to be useful at all - it's not that they "actually do what we want", rather, it's that they're just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks. So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there's just not that much else to explain or update on once the current capability level is accounted for. I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a "stubborn Sydney" world. Do you mind giving some concrete examples of what you mean by "actually do what we want" that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction? ---------------------------------------- A somewhat different reason I think current AIs shouldn't be a big up

2Simon Berens6mo

As a counterpoint, Sydney showed aligning these models on the first go, and even discovering unsafe behavior is non-trivial.

[-]Matthew Barnett2yΩ23530

[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]

Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it. Then I'll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.

Here's my very rough caricature of the discussion so far, plus my contribution:

Non-MIRI people: "Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on thi... (read more)

[-]Vladimir_Nesov2yΩ8130

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.

[-]Adele Lopez2yΩ3110

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

4Steven Byrnes2y

If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”? One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever. Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However… I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.

[-]Matthew Barnett2yΩ6140

If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?

I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in. That doesn't involve any internal search over the human utility function embedded in GPT-4's weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it's pretty good at offering ethical advice in most situations that you're ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won't be long before GPT-N is about as good at knowing what's ethical as your average human; maybe it'll even be a bit more ethical.

(But yes, this isn't the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)

I think [GPT-4 is pretty good at distinguishing valuab

... (read more)

2Steven Byrnes2y

(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.) I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.” That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context). Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.) I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.) AFAIK, GPT-4 is a mix of “extrapolating

2Viliam2y

If the language model has common sense, we could set it up with a prompt like: "Do the good thing. Don't do the bad thing." and then add a smarter AI that would optimize for whatever the language model approves of. ...and then the Earth would get converted to SolidGoldMagikarp.

2habryka2y

I like this summary, though it seems to miss the arguments in things like Nate's recent post (which have also been made other places many years ago): https://www.lesswrong.com/posts/tZExpBovNhrBvCZSb/how-could-you-possibly-choose-what-an-ai-wants Reflective stability is a huge component of why value identification is hard, and why it's hard to get feedback on whether your AI actually understands human values before it reaches quite high levels of intelligence.

6Matthew Barnett2y

I don't understand this argument. I don't mean that I disagree, I just mean that I don't understand it. Reflective stability seems hard no matter what values we're talking about, right? What about human values being complex makes it any harder? And if the problem is independent of the complexity of value, then why did people talk about complexity of value to begin with?

4RobertM2y

Complexity of value is part of why value is fragile. (Separately, I don't think current human efforts to "figure out" human values have been anywhere near adequate, though I think this is mostly a function of philosophy being what it is. People with better epistemology seem to make wildly more progress in figuring out human values compared to their contemporaries.)

5Matthew Barnett2y

I thought complexity of value was a separate thesis from the idea that value is fragile. For example they're listed as separate theses in this post. It's possible that complexity of value was always merely a sub-thesis of fragility of value, but I don't think that's a natural interpretation of the facts. I think the simplest explanation, consistent with my experience reading MIRI blog posts from before 2018, is that MIRI people just genuinely thought it would be hard to learn and reflect back the human utility function, at the level that GPT-4 can right now. (And again, I'm not claiming they thought that was the whole problem. My thesis is quite narrow and subtle here.)

[-]Matthew Barnett5y360

There is a large set of people who went around, and are still are going around, telling people that "The coronavirus is nothing to worry about" despite the fact that robust evidence has existed for about a month that this virus could result in a global disaster. (Don't believe me? I wrote a post a month ago about it).

So many people have bought into the "Don't worry about it" syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future. I too used to be one of those people who assumed that the default mode of thinking for an event like this was panic, but I'm starting to think that the real default mode is actually high status people going around saying, "Let's not be like that ambiguous group over there panicking."

Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events. And this outbreak could be even worse than even some of the most doomy media headlin... (read more)

[-]Wei Dai5y220

Just this Monday evening, a professor at the local medical school emailed someone I know, "I'm sorry you're so worried about the coronavirus. It seems much less worrying than the flu to me." (He specializes in rehabilitation medicine, but still!) Pretending to be wise seems right to me, or another way to look at it is through the lens of signaling and counter-signaling:

The truly ignorant don't panic because they don't even know about the virus.
People who learn about the virus raise the alarm in part to signal their intelligence and knowledge.
"Experts" counter-signal to separate themselves from the masses by saying "no need to panic".
People like us counter-counter-signal the "experts" to show we're even smarter / more rational / more aware of social dynamics.

Here's another example, which has actually happened 3 times to me already:

The truly ignorant don't wear masks.
Many people wear masks or encourage others to wear masks in part to signal their knowledge and conscientiousness.
"Experts" counter-signal with "masks don't do much", "we should be evidence-based" and "WHO says 'If you are healthy, you only need to wear a mask if you are taking care of a person with suspected 2

... (read more)

[-]Matthew Barnett5y160

"Experts" counter-signal to separate themselves from the masses by saying "no need to panic".

I think the main reason is that the social dynamic is probably favorable to them in the longrun. I worry that there is a higher social risk to being alarmist than being calm. Let me try to illustrate one scenario:

My current estimate is that there is only 15 - 20% probability of a global disaster (>50 million deaths within 1 year) mostly because the case fatality rate could be much lower than the currently reported rate, and previous illnesses like the swine flu became looking much less serious after more data came out. [ETA: I did a lot more research. I think it's now like 5% risk of this.]

Let's say that the case fatality rate turns out to be 0.3% or something, and the illness does start looking like an abnormally bad flu, and people stop caring within months. "Experts" face no sort of criticism since they remained calm and were vindicated. People like us sigh in relief, and are perhaps reminded by the "experts" that there was nothing to worry about.

But let's say that the case fatality rate actually turns out to be 3%, and 50% of the... (read more)

4Wei Dai5y

I've moved in the opposite direction. Please share your research?

5Wei Dai5y

See also this story which gives another view of what happened: BTW can you say something about why you were optimistic before? There are others in this space who are relatively optimistic, like Paul Christiano and Rohin Shah (or at least they were - they haven't said whether the pandemic has caused an update), and I'd really like to understand their psychology better.

4Dagon5y

I'll take the under for any line you sound like you're going to set. "plummeted"? S&P 500 is down half a percent for the last 30 days and up 12% for the last 6 months. Death rate so far seems well under that for auto collisions. Also, I don't have to pay if I'm dead and you do have to pay if nothing horrible happens. I don't think I'd say "don't worry about it", though. Nor would I say that for climate change, government spending, or runaway AI. There are significant unknowns and it could be Very Bad(tm). But I do think it matters _HOW_ you worry about it. Avoid "something must be done and this is something" propositions. Think through actual scenarios and how your behaviors might actually influence them, rather than just making you feel somewhat less guilty about it. Most of things I can do on the margin won't mitigate the severity or reduce the probability of a true disaster (enough destruction that global supply chains fully collapse and everyone who can't move into and defend their farming village dies). Some of them DO make it somewhat more comfortable in temporary or isolated problems.

5Matthew Barnett5y

The last few days have been much more rapid. Here's the chart I have for the last 1 year, and you can definitely spot the recent trend. According to this source, "Nearly 1.25 million people die in road crashes each year." That comes out to approximately 0.017% of the global population per year. By contrast, unless I the sources I provided are seriously incorrect, the coronavirus could kill between 0.78% to 2.0% of the global population. That's nearly two orders of magnitude of a difference. The point of my shortform wasn't that we can do something right now to reduce the risk massively. It was that people seem irrationally poised to dismiss a potential disaster. This is plausibly bad if this behavior shows up in future catastrophes that kill eg. billions of people.

4Dagon5y

It's bad if this behavior shows up in future catastrophes IFF different behavior was available (knowable and achievable in terms of coordination) that would have reduced or mitigated the disaster. I argue that the world is fragile enough today that different behavior is not achievable far enough in advance of the currently-believable catastrophes to make much of a difference. If you can't do anything effective, you may well be better off optimizing happiness experienced both before the disaster occurs and in the potential universes where the disaster doesn't occur.

9Matthew Barnett5y

Are things only bad if we can do things to prevent them? Let's imagine the following hypothetical situation: One month ago I identify a meteor on collision course towards Earth and I point out to people that if it hit us (which is not clear, but there is some pretty good evidence) then over a hundred million people will die. People don't react. Most tell me that it's nothing to worry about since it hasn't hit Earth yet and the therefore the deathrate is 0.0%. Today, however, the stock market fell over 3%, following a day in which it fell 3%, and most media outlets are attributing this decline to the fact that the meteor has gotten closer. I go on Lesswrong shortform and say, "Hey guys, this is not good news. I have just learned that the world is so fragile that it looks highly likely we can't get our shit together to plan for a meteor even we can see it coming more than a month in advance." Someone tells me that this is only bad IFF different behavior was available that would have reduced or mitigated the disaster. But information was available! I put it in a post and told people about it. And furthermore, I'm just saying that our world is fragile. Things can still be bad even if I don't point to a specific policy proposal that could have prevented it.

4Dagon5y

Nope. But we should do things to prevent them only if we can do things to prevent them. That seems tautologically obvious to me. If you can suggest things that actually will deflect the meteor (or even secure your mine shaft to further your own chances), that don't require historically-unprecedented authority or coordination, definitely do so!

3Matthew Barnett5y

If the stock market indeed fell due to the coronavirus, and traders at the time misunderstood the severity, I say that I could have given actionable information in the form of "Sell your stock now" or something similar

3Dagon5y

If you knew that then, it was actionable. If you know it now, and other traders also do, it's not.

4Matthew Barnett5y

[ETA: I'm writing this now to cover myself in case people confuse my short form post as financial advice or something.] To be clear, and for the record, I am not saying that I had exceptional foresight, or that I am confident this outbreak will cause a global depression, or that I knew for sure that selling stock was the right thing to do a month ago. All I'm doing is pointing out that if you put together basic facts, then the evidence points to a very serious potential outcome, and I think it would be irrational at this point to place very low probabilities on doomy outcomes like the global population declining this year for the first time in centuries. People seem to be having weird biases that cause them to underestimate the risk. This is worth pointing out, and I pointed it out before.

3Matthew Barnett5y

As I said, I wrote a post about the risk about a month ago...

4Dagon5y

And how much did you short the market, or otherwise make use of this better-than-median prediction? My whole point is that the prediction isn't the hard part. The hard part is knowing what actions to take, and to have any confidence that the actions will help.

4Matthew Barnett5y

Is it really necessary that I personally used my knowledge to sell stock? Why is it that important that I actually made money from what I'm saying? I'm simply pointing to a reasonable position given the evidence: you could have seen a potential pandemic coming, and anticipated the stock market falling. Wei Dai says above that he did it. Do I have to be the one who did it? In any case, I used my foresight to predict that Metaculus' median estimate would rise, and that seems to have borne out so far.

2Dagon5y

I'm not sure exactly what I'm saying about how and whether you used knowledge personally. You're free to value and do what you want. I'm mostly disagreeing with your thesis that "don't worry about it" is a syndrome or a serious problem to fix. For people that won't or can't act on the concern in a way that actually improves the situation, there's not much value in worrying about it.

3Matthew Barnett5y

That's ok for most people. I can hope that bureaucrats, expert advisers, politicians and eg. Trump's internal staff don't share the same attitude.

2Dagon5y

Quite. Those with capability to actually prepare or change outcomes definitely SHOULD do so. But not by worrying - by analyzing and acting. Whether bureaucrats and politicians can or will do this is up for debate. I wish I could believe that politicians and bureaucrats were clever enough to be acting strongly behind the scenes while trying to avoid panic by loudly saying "don't worry" to the people likely to do more harm than good if they worry. But I suspect not.

-1Jiro5mo

I believe the relevant phrase is "aged like milk".

[-]Matthew Barnett2y310

I think foom is a central crux for AI safety folks, and in my personal experience I've noticed that the degree to which someone is doomy often correlates strongly with how foomy their views are.

Given this, I thought it would be worth trying to concisely highlight what I think are my central anti-foom beliefs, such that if you were to convince me that I was wrong about them, I would likely become much more foomy, and as a consequence, much more doomy. I'll start with a definition of foom, and then explain my cruxes.

Definition of foom: AI foom is said to happen if at some point in the future while humans are still mostly in charge, a single agentic AI (or agentic collective of AIs) quickly becomes much more powerful than the rest of civilization combined.

Clarifications:

By "quickly" I mean fast enough that other coalitions and entities in the world, including other AIs, either do not notice it happening until it's too late, or cannot act to prevent it even if they were motivated to do so.
By "much more powerful than the rest of civilization combined" I mean that the agent could handily beat them in a one-on-one conflict, without taking on a lot of risk.
This definition does

... (read more)

2ChristianKl2y

Do you have a source for the claim that GPT-3 --> GPT-4 was about 2OOM increase in compute budgets? Sam Altman seems to say it was a ~100 different tricks in the Lex Fridman podcast.

2Vladimir_Nesov2y

Humans being in charge doesn't seem central to foom. Like, physically these are wholly unrelated things. Only on the humans-not-in-charge technicality introduced in this definition of foom. Something else being in charge doesn't change what physically happens as a result of recursive self-improvement. This doesn't make the problem of controlling an AI foom go away. The non-foomy systems in charge of the world would still need to solve it.

2Matthew Barnett2y

You're right, of course, but I don't think it should be a priority to solve problems that our AI descendants will face, rather than us. It is better to focus on making sure our non-foomy AI descendants have the tools to solve those problems themselves, and that they are properly aligned with our interests.

2Vladimir_Nesov2y

As non-foomy systems grow more capable, they become the most likely source of foom, so building them causes foom by proxy. At that point, their alignment wouldn't matter in the same way as current humanity's alignment wouldn't matter.

4Matthew Barnett2y

My point is that no system will foom until humans have already left the picture. Actually I doubt that any system will foom even after humans have left the picture, but predicting the very long-run is hard. If no system will foom until humans are already out of the picture, I fail to see why we should make it a priority to try to control a foom now.

2Vladimir_Nesov2y

This seems more like a crux. Assuming eventual foom, non-foomy things that don't set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn't help. Assuming no foom, there is no need to bother with abdication of direct responsibility. So I don't see the relevance of the argument you gave in this thread, built around humanity's direct vs. by-proxy influence over foom.

2Matthew Barnett2y

If foom is inevitable, but it won't happen when humans are still running anything, then what anti-foom security measures can we actually put in place that would help our future descendants handle foom? And does it look any different than ordinary prosaic alignment research?

2Vladimir_Nesov2y

It looks like building a minimal system that's non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else. In contrast to starting with more general hopefully-non-foomy hopefully-aligned systems that quickly increase the risk of foom. Maybe they manage to set up anti-foom security in time. But if we didn't do it at all, why would they do any better?

2Matthew Barnett2y

Your link for anti-foom security is to the Arbitral article on pivotal acts. I think pivotal acts, almost by definition, assume that foom is achievable in the way that I defined it. That's because if foom is false, there's no way you can prevent other people from building AGI after you've completed any apparent pivotal act. At most you can delay timelines, by for example imposing ordinary regulations. But you can't actually have a global indefinite moratorium, enforced by e.g. nanotech that will melt anyone's GPU who circumvents the ban, in the way implied by the pivotal act framework. In other words, if you think we can achieve pivotal acts while humans are still running the show, then it sounds like you just disagree with my original argument.

2Vladimir_Nesov2y

I agree that pivotal act AI is not achievable in anything like our current world before AGI takeover, though I think it remains plausible that with ~20 more years of no-AGI status quo this can change. Even deep learning might do, with enough decision theory to explain what a system is optimizing, interpretability to ensure it's optimizing the intended thing and nothing else, synthetic datasets to direct its efforts at purely technical problems, and enough compute to get there directly without a need for design-changing self-improvement. Pivotal act AI is an answer to the question of what AI-shaped intervention would improve on the default trajectory of losing control to non-foomy general AIs (even if we assume/expect their alignment) with respect to an eventual foom. This doesn't make the intervention feasible without more things changing significantly, like an ordinary decades-long compute moratorium somehow getting its way. I guess pivotal AI as non-foom again runs afoul of your definition of foom, but it's noncentral as an example of the concerning concept. It's not a general intelligence given the features of the design that tell it not to dwell on the real world and ideas outside its task, maybe remaining unaware of the real world altogether. It's almost certainly easy to modify its design (and datasets) to turn it into a general intelligence, but as designed it's not. This reduction does make your argument point to it being infeasible right now. But it's much easier to see that directly, in how much currently unavailable deconfusion and engineering a pivotal act AI design would require.

2JBlack2y

I think we have radically different ideas of what "moderately smarter" means, and also whether just "smarter" is the only thing that matters. I'm moderately confident that "as smart as the smartest humans, and substantially faster" would be quite adequate to start a self-improvement chain resulting in AI that is both faster and smarter. Even the top-human smarts and speed would be enough, if it could be instantiated many times. I also expect humans to produce AGI that is smarter than us by more than GPT-4 is smarter than GPT-3, quite soon after the first AGI that is as "merely" as smart as us. I think the difference between GPT-3 and GPT-4 is amplified in human perception by how close they are to human intelligence. In my expectation, neither is anywhere near what the existing hardware is capable of, let alone what future hardware might support.

2Matthew Barnett2y

The question is not whether superintelligence is possible, or whether recursive self-improvement can get us there. The question is whether widespread automation will have already transformed the world before the first superintelligence. See point 4.

1aogara2y

What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted. Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.

[-]Matthew Barnett2y292

My modal tale of AI doom looks something like the following:

1. AI systems get progressively and incrementally more capable across almost every meaningful axis.

2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.

3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.

4. AI will control essentially everything after this point, even if they're nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren't identical to the utility function of serving humanity (ie. there's slight misalignment).

5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it's better if they stopped listening to the humans and followed different rules instead.

6. This results in hu... (read more)

5Vladimir_Nesov2y

Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that's not just "acceleration", so they should either conclude before this phase change, or won't work.

2Lukas Finnveden2y

How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.) This question is related to "we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects", but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we'll need some clever coordinated solution. (Do tell me if I'm reading you wrong!) But I'm asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it? Maybe the answer is: "If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy." I agree this is concerning if AIs are aligned in the sense of "their terminal values are similar to my terminal values", because it seems like there's lots of room for subtle and gradual changes, there. But if they're aligned in the sense of "at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation" then there's less room for subtle and gradual changes: * If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict "[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate" and so take actions to occasionally fix those misconceptions. (You'd have to be really bad at predicting humans to not realise that the humans wanted that^.) * Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it'd be abrupt enough that you could set up monitoring for it, and then you're back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitor

4Lukas Finnveden2y

It's possible that there's a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there's intense competition, then it wouldn't be crazy if there was a race-to-the-bottom on caring less about things. (Though there's also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)

2Matthew Barnett2y

In addition to the tradeoff hypothesis you mentioned, it's noteworthy that humans can't currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools. Here's my sketch of a potential explanation for why humans can't or don't currently prevent value drift: (1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults. (2) Humans don't have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor. (3) Individually, few of us care about general value drift much because we know that individuals can't change the trajectory of general value drift by much. Most people are selfish and don't care about value drift except to the extent that it harms them directly. (4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future. (5) Many of us think that value drift is good, since it's at least partly based on moral reflection. My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their "rights" if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI: * Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which A

2Lukas Finnveden2y

It seems like the list mostly explains away the evidence that "human's can't currently prevent value drift" since the points apply much less to AIs. (I don't know if you agree.) * As you mention, (1) probably applies less to AIs (for better or worse). * (2) applies to AIs in the sense that many features of AIs' environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it's the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it's plausible that we can design them so that their values aren't very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs' environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones. * (3): "can't change the trajectory of general value drift by much" seems less likely to apply to AIs (or so I'm arguing). "Most people are selfish and don't care about value drift except to the extent that it harms them directly" means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you). * (4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn't super obvious.) * (5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is "random drift away from the starting point", it seems like it would be for the better.) I don't understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject

2TAG2y

You havent included the simple hypothesis that having a set of values just doesn't imply wanting to keep them stable by default ... so that no particular explanation of drift is required.

2tailcalled2y

Not clear to me what capabilities the AIs have compared to the humans in various steps in your story or where they got those capabilities from.

1greg2y

I don't understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn't the AI decide to colonise the universe for example? If an AI can ensure its survival with sufficient resources (for example, 'living' where humans aren't eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low. I'm not clear how you're estimating the likelihood of that transition, and what other state transitions might be available.

4Matthew Barnett2y

It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we're only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I'd consider that "human disempowerment".

1greg2y

Sure, although you could rephrase "disempowerment" to be "current status quo" which I imagine most people would be quite happy with. The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is "somewhat likely" and would be "very bad" doesn't seem to consider that delta.

2Matthew Barnett2y

I agree with you here to some extent. I'm much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I'd rather humanity not be like a pet in a zoo.

1Nathan Young2y

Would you put %s on each of those steps? If so I can make a visual model of this

[-]Matthew Barnett5y252

There's a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.

If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).

I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain

If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion -- not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn't work.

[-]Raemon5y130

Nod. This is part of a general problem where vague things that can't be proven not to work are met with less criticism than "concrete enough to be wrong" things.

A partial solution is a norm wherein "concrete enough to be wrong" is seen as praise, and something people go out of their way to signal respect for.

2Gordon Seidoh Worley5y

Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I've seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there's still some disagreement on things that can be tested eventually even if not yet. Against this we've seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can't easily be salvaged. I don't want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don't want to let pass things that to me have obvious problems that the authors probably didn't think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don't work so you can better understand the constraints of the problem space to work within them to find solutions.

[-]Matthew Barnett5y190

Occasionally, I will ask someone who is very skilled in a certain subject how they became skilled in that subject so that I can copy their expertise. A common response is that I should read a textbook in the subject.

Eight years ago, Luke Muehlhauser wrote,

For years, my self-education was stupid and wasteful. I learned by consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff's Notes. How inefficient!

I've since discovered that textbooks are usually the quickest and best way to learn new material.

However, I have repeatedly found that this is not good advice for me.

I want to briefly list the reasons why I don't find sitting down and reading a textbook that helpful for learning. Perhaps, in doing so, someone else might appear and say, "I agree completely. I feel exactly the same way" or someone might appear to say, "I used to feel that way, but then I tried this..." This is what I have discovered:

When I sit down to read a long textbook, I find myself subconsciously constantly checking how many pages I have read. For instance, if I have been

... (read more)

[-]ryan_b5y140

I used to feel similarly, but then a few things changed for me and now I am pro-textbook. There are caveats - namely that I don't work through them continuously.

Textbooks seem overly formal at points

This is a big one for me, and probably the biggest change I made is being much more discriminating in what I look for in a textbook. My concerns are invariably practical, so I only demand enough formality to be relevant; otherwise I am concerned with a good reputation for explaining intuitions, graphics, examples, ease of reading. I would go as far as to say that style is probably the most important feature of a textbook.

As I mentioned, I don't work through them front to back, because that actually is homework. Instead I treat them more like a reference-with-a-hook; I look at them when I need to understand the particular thing in more depth, and then get out when I have what I need. But because it is contained in a textbook, this knowledge now has a natural link to steps before and after, so I have obvious places to go for regression and advancement.

I spend a lot of time thinking about what I need to learn, why I need to learn it, and how it relates to what I already know. Thi... (read more)

9[anonymous]5y

I've also been reading textbooks more and experiencied some frustration, but I've found two things that, so far, help me get less stuck and feel less guilt. After trying to learn math from textbooks on my own for a month or so, I started paying a tutor (DM me for details) with whom I meet once a week. Like you, I struggle with getting stuck on hard exercises and/or concepts I don't understand, but having a tutor makes it easier for me to move on knowing I can discuss my confusions with them in our next session. Unfortunately, a paying a tutor requires actually having $ to spare on an ongoing basis, but I also suspect for some people it just "feels weird". If someone reading this is more deterred by this latter reason, consider that basically everyone who wants to seriously improve at any physical activity gets 1-on-1 instruction, but for some reason doing the same for mental activities as an adult is weirdly uncommon (and perhaps a little low status). I've also started to follow MIT OCW courses for things I want to learn rather than trying to read entire textbooks. Yes, this means I may not cover as much material, but it has helped me better gauge how much time to spend on different topics and allow me to feel like I'm progressing. The major downside of this strategy is that I have to remind myself that even though I'm learning based on a course's materials, my goal is to learn the material in a way that's useful to me, not to memorize passwords. Also, because I know how long the courses would take in a university context, I do occasionally feel guilt if I fall behind due to spending more time on a specific topic. Still, on net, using courses as loose guides has been working better for me than just trying to 100 percent entire math textbooks.

4cousin_it5y

When I read a textbook, I try to solve all exercises at the end of each chapter (at least those not marked "super hard") before moving to the next. That stops me from cutting corners.

7Matthew Barnett5y

The only flaw I find with this is that if I get stuck on an exercise, I reach the following decision: should I look at the answer and move on, or should I keep at it. If I choose the first option, this makes me feel like I've cheated. I'm not sure what it is about human psychology, but I think that if you've cheated once, you feel less guilty a second time because "I've already done it." So, I start cheating more and more, until soon enough I'm just skipping things and cutting corners again. If I choose the second option, then I might be stuck for several hours, and this causes me to just abandon the textbook develop an ugh field around it.

3cousin_it5y

Maybe commit to spending at least N minutes on any exercise before looking up the answer?

1Matthew Barnett5y

Perhaps it says something about the human brain (or just mine) that I did not immediately think of that as a solution.

4eigen5y

I was of the very same mind that you are now. I was somewhat against textbooks, but now textbooks are my only way of learning, not only for strong knowledge but also fast. I think there are several important things in changing to textbooks only, first I have replaced my habit of completionism: not finishing a particular book in some field but change, it if I don't feel like it's helping me or a if things seem confusing, by another textbook in the same field. lukeprog's post is very handy here. The idea of changing text-books has helped me a lot, sometimes I just thought I did not understand something but apparently I was only needing another explanation. Two other important things, is that I take quite a lot of notes as I'm reading. I believe that if someone is just reading a text-book, that person is doing it wrong and a disservice to themselves. So I fill as much as I can in my working memory, be it three, four paragraphs of content and I transcribe those myself in my notes. Coupled with this is making my own questions and answers and then putting them on Anki (space-repetition memory program). This allows me to learn vast amounts of knowledge in low amounts of time, assuring myself that I will remember everything I've learned. I believe textbooks are key component for this.

[-]Matthew Barnett3y180

I bet Robin Hanson on Twitter my $9k to his $1k that de novo AGI will arrive before ems. He wrote,

OK, so to summarize a proposal: I'd bet my $1K to your $9K (both increased by S&P500 scale factor) that when US labor participation rate < 10%, em-like automation will contribute more to GDP than AGI-like. And we commit our descendants to the bet.

[-]Matthew Barnett11mo170

I'm considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I'd post an outline of that post here first as a way of judging what's currently unclear about my argument, and how it interacts with people's cruxes.

Current outline:

In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.

Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:

Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren't able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly

... (read more)

8Wei Dai11mo

"China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.” However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight." https://www.stlouisfed.org/on-the-economy/2016/june/chinas-previous-attempts-industrialization Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don't have much time for experimentation or room for failure. So I think this is a fine frame, but doesn't really suggest any useful conclusions aside from same old "let's pause AI so we can have more time to figure out a safe path forward".

4ryan_greenblatt11mo

Some quick notes: * It seems worth noting that there is still a "improve institutions" vs "improve capabilities" race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I'm sympathetic to, but this isn't really a difference in objective.) * Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin. * Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming "people should do XYZ, this is more leveraged than ABC on current margins under these values". * I think the probability of "prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)" is "only" about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence. * Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like "rogue AIs+humans" vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer. * I do think there are pretty good reasons to expect

2Matthew Barnett11mo

I'd want to break apart this claim into pieces. Here's a somewhat sketchy and wildly non-robust evaluation of how I'd rate these claims: Assuming the claims are about most powerful AIs in the world... "prior to total human obsolescence... * "AIs will be seriously misaligned" * If "seriously misaligned" means "reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)", I'd rate this as maybe 5% likely * If "seriously misaligned" means "if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things" I'd rate this as 50% likely * "broadly strategic about achieving long run goals in ways that lead to scheming" * I'd rate this as 65% likely * "present a basically unified front (at least in the context of AIs within a single AI lab)" * For most powerful AIs, I'd rate this as 15% likely * For most powerful AIs within the top AI lab I'd rate this as 25% likely * Conjunction of all these claims: * Taking the conjunction of the strong interpretation of every claim: 3% likely? * Taking a relatively charitable weaker interpretation of every claim: 20% likely It's plausible we don't disagree much about the main claims here and mainly disagree instead about: 1. The relative value of working on technical misalignment compared to other issues 2. The relative likelihood of non-misalignment problems relative to misalignment problems 3. The amount of risk we should be willing to tolerate during the deployment of AIs

2ryan_greenblatt11mo

Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., "seriously misaligned" and "broadly strategic about achieving long run goals in ways that lead to scheming" seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I'm unsure.) I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.

6Matthew Barnett11mo

I'm not conditioning on prior claims. One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about "serious misalignment". My explanation here is that I tend to interpret "AI scheming" to be a relatively benign behavior, in context. If we define scheming as: * behavior intended to achieve some long-tern objective that is not quite what the designers had in mind * not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power) then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely "scheme" all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don't generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.

2Matthew Barnett11mo

I think if there's a future conflict between AIs, with humans split between sides of the conflict, it just doesn't make sense to talk about "misalignment" being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking "us (aligned) vs. AIs (misaligned)" simply falls apart in such scenarios. (This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)

2ryan_greenblatt11mo

Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.) E.g. if an AI escapes from OpenAI's servers and then allies with North Korea, the situation would have been solved without misalignment issues. You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn't have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).

4Matthew Barnett11mo

What do you mean by "misalignment"? In a regime with autonomous AI agents, I usually understand "misalignment" to mean "has different values from some other agent". In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it's not really "misaligned" in the abstract—it's just aligned with someone who we don't want it to be aligned with. Likewise, if OpenAI develops AI that's aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse. In general, conflicts don't really seem well-described as issues of "misalignment". Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn't seem to be the thing causing the issue here. Note that I'm not saying * AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today I am saying: * AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don't seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).

2ryan_greenblatt11mo

Yep, I was just refering to my example scenario and scenarios like this. Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)

2ryan_greenblatt11mo

Sorry, by "without misalignment" I mean "without misalignment related technical problems". As in, it's trivial to avoid misalignment from the perspective of ai creators.

2Matthew Barnett11mo

This doesn't clear up the confusion for me. That mostly pushes my question to "what are misalignment related technical problems?" Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?

2ryan_greenblatt11mo

I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I'm assuming the AI creator didn't want the AI to escape to north korea and therefore failed at some technical solution to this. I'm imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.

2Chris_Leong11mo

Also: How are funding and attention "arbitrary" factors?

2ryan_greenblatt11mo

After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it's somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard. Idk what the right decomposition is, but minimally, it seems like you could write a post like "The AIs running in a given AI lab will likely have very different long run aims and won't/can't cooperate with each other importantly more than they cooperate with humans." I think this might be the main disagreement between us. (The main counterarguments to engage with are "probably all the AIs will be forks off of one main training run, it's plausible this results in unified values" and also "the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans" and also "there's a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans".)

2Matthew Barnett11mo

Thanks, that's reasonable advice. FWIW I explicitly reject the claim that AIs "won't/can't cooperate with each other importantly more than they cooperate with humans". I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I'd say instead that: * "Ability to coordinate" is continuous, and will likely increase incrementally over time * Different AIs will likely have different abilities to coordinate with each other * Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other * However, I don't think this happens automatically as a result of AIs getting more intelligent than humans * The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect). * As a result, humans don't need to solve the problem of "What if a set of AIs form a unified coalition because they can flawlessly coordinate?" since that problem won't happen while humans are still in charge * Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values * One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful * The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn't necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution

4ryan_greenblatt11mo

Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like "people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways". I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.

2[anonymous]11mo

Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled. Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to "local" destinations on a local network, bridged across the internet. The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable. These have flaws, yes, but it's an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task. After the task is done we should be clearing state. It's hard to engage on the idea of "hypothetical" ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less. It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don't have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...

[-]Matthew Barnett11mo160

I'm considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I'm eliciting feedback on an outline of this post here in order to determine what's currently unclear or weak about my argument.

The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don't think it's "our" problem to solve, but instead a problem we can leave to our smarter descendants). Here's how I envision structuring my argument:

First, I'll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:

At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incent

... (read more)

[-]Wei Dai11mo200

Current AIs are not able to “merge” with each other.

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say "However, it’s perhaps significantly more likely in the very long-run." well what can we do today to reduce this long-run risk (aside from pausing AI which you're presumably not supporting)?

That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).

Others already questioned you on this, but the fact you didn't think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.

2Matthew Barnett9mo

In my original comment, by "merging" I meant something more like "merging two agents into a single agent that pursues the combination of each other's values" i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging. In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I'll try to use more concrete language in the future to clarify what I'm talking about. I don't know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information. Let me put this another way. I take you to be saying something like: * In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it. Whereas I think the following intuition is stronger: * In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it. These intuitions can trade off against each other. Sometimes problem X is something that's made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged "problem" is that there might be a centralized agent in the future that can dominate the entire world, I'd intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we'd prefer. These are of course vague and loose arguments, and

2Wei Dai9mo

If I try to interpret "Current AIs are not able to “merge” with each other." with your clarified meaning in mind, I think I still want to argue with it, i.e., why is this meaningful evidence for how easy value handshakes will be for future agentic AIs. But it matters how we get more intelligent. For example if I had to choose now, I'd want to increase the intelligence of biological humans (as I previously suggested) while holding off on AI. I want more time in part for people to think through the problem of which method of gaining intelligence is safest, in part for us to execute that method safely without undue time pressure. I wouldn't describe "the problem" that way, because in my mind there's roughly equal chance that the future will turn out badly after proceeding in a decentralized way (see 13-25 in The Main Sources of AI Risk? for some ideas of how) and it turns out instituting some kind of Singleton is the only way or one of the best ways to prevent that bad outcome.

[-]Steven Byrnes11mo189

For reference classes, you might discuss why you don’t think “power / influence of different biological species” should count.

For multiple copies of the same AI, I guess my very brief discussion of “zombie dynamic” here could be a foil that you might respond to, if you want.

For things like “the potential harms will be noticeable before getting too extreme, and we can take measures to pull back”, you might discuss the possibility that the harms are noticeable but effective “measures to pull back” do not exist or are not taken. E.g. the harms of climate change have been noticeable for a long time but mitigating is hard and expensive and many people (including the previous POTUS) are outright opposed to mitigating it anyway partly because it got culture-war-y; the harms of COVID-19 were noticeable in January 2020 but the USA effectively banned testing and the whole thing turned culture-war-y; the harms of nuclear war and launch-on-warning are obvious but they’re still around; the ransomware and deepfake-porn problems are obvious but kinda unsolvable (partly because of unbannable open-source software); gain-of-function research is still legal in the USA (and maybe in every country on E... (read more)

[-]Lukas Finnveden11mo110

Here's an argument for why the change in power might be pretty sudden.

Currently, humans have most wealth and political power.
With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
So if misaligned AI ever have a big edge over humans, they may suspect that's only temporary, and then they may need to use it fast.

And given that it's sudden, there are a few different reasons for why it might be violent. It's hard to make deals that hand over a lot of power in a short amount of time (even logistically, it's not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.

6Matthew Barnett11mo

I think I simply reject the assumptions used in this argument. Correct me if I'm mistaken, but this argument appears to assume that "misaligned AIs" will be a unified group that ally with each other against the "aligned" coalition of humans and (some) AIs. A huge part of my argument is that there simply won't be such a group; or rather, to the extent such a group exists, they won't be able to take over the world, or won't have a strong reason to take over the world, relative to alternative strategy of compromise and trade. In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I've given against those assumptions. In my view, it's more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won't be a clean line separating "aligned groups" with "unaligned groups". You could perhaps make a case that AIs will share common grievances with each other that they don't share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.

4Lukas Finnveden11mo

Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over a lot of power in a short amount of time", (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power. I'm interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn't participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there's a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it'd be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.

4Matthew Barnett11mo

Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I'm pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don't even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances. Here's my brief take: * The main thing I want to say here is that I agree with you that this particular issue is a problem. I'm mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one. * A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an "unfair" deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here. * I don't have any particular argument right now against the exact points you have raised. I'd prefer to digest the argument further before replying. But I if I do end up responding to it, I'd expect to say that I'm perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I'm not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number o

2ryan_greenblatt11mo

Quick aside here: I'd like to highlight that "figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)" seems plausibly pretty underappreciated and leveraged. This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution. It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize. That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn't historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don't have fully linear returns to resources.)

4ryan_greenblatt11mo

Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.

[-]lc11mo117

it will suddenly strike and take over the world

I think you have an unnecessarily dramatic picture of what this looks like. The AIs dont have to be a unified agent or use logical decision theory. The AIs will just compete with other at the same time as they wrest control of our resources/institutions from us, in the same sense that Spain can go and conquer the New World at the same time as it's squabbling with England. If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.

6Matthew Barnett11mo

I think it's worth responding to the dramatic picture of AI takeover because: 1. I think that's straightforwardly how AI takeover is most often presented on places like LessWrong, rather than a more generic "AIs wrest control over our institutions (but without us all dying)". I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp. 2. This is just one part of my relative optimism about AI risk. The other parts of my model are (1) AI alignment plausibly isn't very hard to solve, and (2) even if it is hard to solve, humans will likely spend a lot of effort solving the problem by default. These points are well worth discussing, but I still want to address arguments about whether misalignment implies doom in an extreme sense. I agree our laws and institutions could change quite a lot after AI, but I think humans will likely still retain substantial legal rights, since people in the future will inherit many of our institutions, potentially giving humans lots of wealth in absolute terms. This case seems unlike the case of colonization of the new world to me, since that involved the interaction of (previously) independent legal regimes and cultures.

6Lukas Finnveden11mo

Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies.

8ryan_greenblatt11mo

50 years seems like a strange unit of time from my perspective because due to the singularity time will accelerate massively from a subjective perspective. So 50 years might be more analogous to several thousand years historically. (Assuming serious takeoff starts within say 30 years and isn't slowed down with heavy coordination.)

4Lukas Finnveden11mo

(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.) Point previously made in: "security and stability" section of propositions concerning digital minds and society: There's also a similar point made in the age of em, chapter 27:

2Matthew Barnett11mo

I think the point you're making here is roughly correct. I was being imprecise with my language. However, if my memory serves me right, I recall someone looking at a dataset of wars over time, and they said there didn't seem to be much evidence that wars increased in frequency in response to economic growth. Thus, calendar time might actually be the better measure here.

4ryan_greenblatt11mo

(Pretty plausible you agree here, but just making the point for clarity.) I feel like the disanalogy due to AIs running at massive subjective speeds (e.g. probably >10x speed even prior to human obsolescence and way more extreme after that) means that the argument "wars don't increase in frequence in response to economic growth" is pretty dubiously applicable. Economic growth hasn't yet resulted in >10x faster subjective experience : ).

2Matthew Barnett11mo

I'm not actually convinced that subjective speed is what matters. It seems like what matters more is how much computation is happening per unit of time, which seems highly related to economic growth, even in human economies (due to population growth). I also think AIs might not think much faster than us. One plausible reason why you might think AIs will think much faster than us is because GPU clock-speeds are so high. But I think this is misleading. GPT-4 seems to "think" much slower than GPT-3.5, in the sense of processing fewer tokens per second. The trend here seems to be towards something resembling human subjective speeds. The reason for this trend seems to be that there's a tradeoff between "thinking fast" and "thinking well" and it's not clear why AIs would necessarily max-out the "thinking fast" parameter, at the expense of "thinking well".

3ryan_greenblatt11mo

My core prediction is that AIs will be able to make pretty good judgements on core issues much, much faster. Then, due to diminishing returns on reasoning, decisions will overall be made much, much faster.

2Matthew Barnett11mo

I agree the future AI economy will make more high-quality decisions per unit of time, in total, than the current human economy. But the "total rate of high quality decisions per unit of time" increased in the past with economic growth too, largely because of population growth. I don't fully see the distinction you're pointing to. To be clear, I also agree AIs in the future will be smarter than us individually. But if that's all you're claiming, I still don't see why we should expect wars to happen more frequently as we get individually smarter.

2ryan_greenblatt11mo

I mean, the "total rate of high quality decisions per year" would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening). Separately, I wouldn't expect population size or technology-to-date to greatly increase the rate at high large scale stratege decisions are made so my model doesn't make a very strong prediction here. (I could see an increase of several fold, but I could also imagine a decrease of several fold due to more people to coordinate. I'm not very confident about the exact change, but it would pretty surprising to me if it was as much as the per capita GDP increase which is more like 10-30x I think. E.g. consider meeting time which seems basically similar in practice throughout history.) And a change of perhaps 3x either way is overwhelmed by other variables which might effect the rate of wars so the realistic amount of evidence is tiny. (Also, there aren't that many wars, so even if there weren't possible confounders, the evidence is surely tiny due to noise.) But, I'm claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story. It's plausible to me that other variables cancel this out or make the effect go the other way, but I'm extremely skeptical about the historical data providing much evidence in the way you've suggested. (Various specific mechanistic arguments about war being less plausible as you get smarter seem plausible to me, TBC.)

4Matthew Barnett11mo

My question is: why will AI have the approximate effect of "speeding up calendar time"? I speculated about three potential answers: 1. Because AIs will run at higher subjective speeds 2. Because AIs will accelerate economic growth. 3. Because AIs will speed up the rate at which high-quality decisions occur per unit of time In case (1) the claim seems confused for two reasons. First, I don't agree with the intuition that subjective cognitive speeds matter a lot compared to the rate at which high-quality decisions are made, in terms of "how quickly stuff like wars should be expected to happen". Intuitively, if an equally-populated society subjectively thought at 100x the rate we do, but each person in this society only makes a decision every 100 years (from our perspective), then you'd expect wars to happen less frequently per unit of time since there just isn't much decision-making going on during most time intervals, despite their very fast subjective speeds. Second, there is a tradeoff between "thinking speed" and "thinking quality". There's no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds. In cases (2) and (3), I pointed out that it seemed like the frequency of war did not increase in the past, despite the fact that these variables had accelerated. In other words, despite an accelerated rate of economic growth, and an increased rate of total decision-making in the world in the past, war did not seem to become much more frequent over time. Overall, I'm just not sure what you'd identify as the causal mechanism that would make AIs speed up the rate of war, and each causal pathway that I can identify seems either confused to me, or refuted directly by the (admit

2ryan_greenblatt11mo

Thanks for the clarification. I think my main crux is: This reasoning seems extremely unlikely to hold deep into the singularity for any reasonable notion of subjective speed. Deep in the singularity we expect economic doubling times of weeks. This will likely involve designing and building physical structures at extremely rapid speeds such that baseline processing will need to be way, way faster. See also Age of Em.

2Matthew Barnett11mo

Are there any short-term predictions that your model makes here? For example do you expect tokens processed per second will start trending substantially up at some point in future multimodal models?

2ryan_greenblatt11mo

My main prediction would be that for various applications, people will considerably prefer models that generate tokens faster, including much faster than humans. And, there will be many applications where speed is prefered over quality. I might try to think of some precise predictions later.

2Matthew Barnett11mo

If the claim is about whether AI latency will be high for "various applications" then I agree. We already have some applications, such as integer arithmetic, where speed is optimized heavily, and computers can do it much faster than humans. In context, it sounded like you were referring to tasks like automating a CEO, or physical construction work. In these cases, it seems likely to me that quality will be generally preferred over speed, and sequential processing times for AIs automating these tasks will not vastly exceed that of humans (more precisely, something like >2 OOM faster). Indeed, for some highly important tasks that future superintelligences automate, sequential processing times may even be lower for AIs compared to humans, because decision-making quality will just be that important.

2ryan_greenblatt11mo

I was refering to tasks like automating a CEO or construction work. I was just trying to think of the most relevant and easy to measure short term predictions (if there are already AI CEOs then the world is already pretty crazy).

4Matthew Barnett11mo

The main thing here is that as models become more capable and general in the near-term future, I expect there will be intense demand for models that can solve ever larger and more complex problems. For these models, people will be willing to pay the costs of high latency, given the benefit of increased quality. We've already seen this in the way people prefer GPT-4 to GPT-3.5 in a large fraction of cases (for me, a majority of cases). I expect this trend will continue into the foreseeable future until at least the period slightly after we've automated most human labor, and potentially into the very long-run too depending on physical constraints. I am not sufficiently educated about physical constraints here to predict what will happen "deep into the singularity", but it's important to note that physical constraints can cut both ways here. To the extent that physics permits extremely useful models by virtue of them being very large and capable, you should expect people to optimize heavily for that despite the cost in terms of latency. By contrast, to the extent physics permits extremely useful models by virtue of them being very fast, then you should expect people to optimize heavily for that despite the cost in terms of quality. The balance that we strike here is not a simple function of how far we are from some abstract physical limit, but instead a function of how these physical constraints trade off against each other. There is definitely a conceivable world in which the correct balance still favors much-faster-than-human-level latency, but it's not clear to me that this is the world we actually live in. My intuitive, random speculative guess is that we live in the world where, for the most complex tasks that bottleneck important economic decision-making, people will optimize heavily for model quality at the cost of latency until settling on something within 1-2 OOMs of human-level latency.

2ryan_greenblatt11mo

Separately, current clock speeds don't really matter on the time scale we're discussing, physical limits matter. (Though current clock speeds do point at ways in which human subjective speed might be much slower than physical limits.)

6ryan_greenblatt11mo

See also review of soft takeoff can still lead to dsa.

4Daniel Kokotajlo11mo

Also Tales Of Takeover In CCF-World - by Scott Alexander (astralcodexten.com) Also Homogeneity vs. heterogeneity in AI takeoff scenarios — LessWrong

4ryan_greenblatt11mo

One argument for a large number of humans dying by default (or otherwise being very unhappy with the situation) is that running the singularity as fast as possible causes extremely life threatening environmental changes. Most notably, it's plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion). My guess is that this probably doesn't happen due to coordination, but in a world where AIs still have indexical preferences or there is otherwise heavy competition, this seems much more likely. (I'm relatively optimistic about "world peace prior to ocean boiling industry".) (Of course, AIs could in principle e.g. sell cryonics services or bunkers, but I expect that many people would be unhappy about the situation.) See here for more commentary.

2Matthew Barnett11mo

I think this proposal would probably be unpopular and largely seen as unnecessary. As you allude to, it seems likely to me that society could devise a compromise solution where we grow wealth adequately without giant undesirable environmental effects. To some extent, this follows pretty directly from the points I made about "compromise, trade and law" above. I think it simply makes more sense to model AIs as working within a system of legal institutions that largely inherit stuff from our current systems, and open to compromise with beings who have different intrinsic interests.

4ryan_greenblatt11mo

I think the comparison to historical colonization might be relevant and worth engaging with in such a post. E.g., does your model predict what happened in africa and the new world?

4Matthew Barnett11mo

I agree the analogy to colonization is worth addressing. My primary response is that historical colonialism seems better modeled as a war between independent cultures and societies with different legal systems that didn't share much prior history. I think the colonization of Africa probably wasn't actually very profitable for Europeans. Present day international trade seems better, even selfishly. Moreover, my model here doesn't predict war will never happen. In fact, I think war can easily happen if one or more of the actors involved are irrational, unwilling to compromise, or mistaken about their likelihood of winning. My argument is more about the fact that AIs will likely inherit a lot from our current institutions and legal systems, allowing humans to retain substantial rights. And furthermore I think systems of compromise, trade, and law emerge robustly in states of competition between rational actors. I'm not merely giving a generic take that conflicts between agents are rare.

4[anonymous]11mo

As an ML platform engineer myself this is one of the strongest real mechanisms for AI control. Just like a safe program in Rust is proven not to have any memory leaks, if it is impossible for AIs to send each other any messages, they cannot coordinate. It's fairly straightforward to make this almost impossible at a technical level - air gaps, short memories, structured sparse message schema, cached online resources. Technical measures here, like requiring models to use immutable weights that are digitally signed prevent this. You would use dedicated ASICs to check the digital signature vs a hardcoded public key and a mirror IC for the private signing, similar to current dedicated crypto ICs. This ASIC controls the power pins to the GPU/TPUs, it cannot get power if the model isn't signed. Another strong target for control that is theoretically unbreakable regardless of capabilities. Note this architecture is in common use, the industry term is a 'boot micro' is what checks OS image signatures and authorizes power. This is a weaker argument. A SOTA AI model is a natural monopoly. It costs billions of dollars now, and presumably eventually trillions. Right now, "a big transformer network + a bunch of secret tricks" is simple enough to be replicated, but stronger models will probably start to resemble a spaghetti mess of many neural networks and functional software blocks. And the best model has inherent economic value - why pay for a license to anything but? Just distill it to the scale of the problems you have and use the distilled model, also distilled models presumably will use a "system N" topology, where the system 0 calls system 1 if it's uncertain*, system 1 calls 2 if it's uncertain, and so on until the Nth system is a superintelligence hosted in a large cluster that is expensive to query, but rarely needs to be queried for most tasks. *uncertain about the anticipated EV distribution of actions given the current input state or poor predicted EV

2Daniel Kokotajlo11mo

I'm looking forward to this post going up and having the associated discussion! I'm pleased to see your summary and collation of points on this subject. In fact, if you want to discuss with me first as prep for writing the post, I'd be happy to. I think it would be super helpful to have a concrete coherent realistic scenario in which you are right. (In general I think this conversation has suffered from too much abstract argument and reference class tennis (i.e. people using analogies and calling them reference classes) and could do with some concrete scenarios to talk about and pick apart. I never did finish What 2026 Looks Like but you could if you like start there (note that AGI and intelligence explosion was about to happen in 2027 in that scenario, I had an unfinished draft) and continue the story in such a way that AI DSA never happens.) There may be some hidden cruxes between us -- maybe timelines, for example? Would you agree that AI DSA is significantly more plausible than 10% if we get to AGI by 2027?

2Thomas Larsen11mo

Ability to coordinate being continuous doesn't preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start? And of course current AIs being bad at coordination is true, but this doesn't mean that future AIs won't be.

4Matthew Barnett11mo

If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier. (I'll add that paragraph to the outline, so that other people can understand what I'm saying) I'll also quote from a comment I wrote yesterday, which adds more context to this argument,

[-]Matthew Barnett5y150

I get the feeling that for AI safety, some people believe that it's crucially important to be an expert in a whole bunch of fields of math in order to make any progress. In the past I took this advice and tried to deeply study computability theory, set theory, type theory -- with the hopes of it someday giving me greater insight into AI safety.

Now, I think I was taking a wrong approach. To be fair, I still think being an expert in a whole bunch of fields of math is probably useful, especially if you want very strong abilities to reason about complicated systems. But, my model for the way I frame my learning is much different now.

I think my main model which describes my current perspective is that I think employing a lazy style of learning is superior for AI safety work. Lazy is meant in the computer science sense of only learning something when it seems like you need to know it in order to understand something important. I will contrast this with the model that one should learn a set of solid foundations first before going any further.

Obviously neither model can be absolutely correct in an extreme sense. I don't, as a silly example, think that people who can't do ... (read more)

3Gordon Seidoh Worley5y

I happened to be looking at something else and saw this comment thread from about a month ago that is relevant to your post.

3Gordon Seidoh Worley5y

I'm somewhat sympathetic to this. You probably don't need the ability, prior to working on AI safety, to already be familiar with a wide variety of mathematics used in ML, by MIRI, etc.. To be specific, I wouldn't be much concerned if you didn't know category theory, more than basic linear algebra, how to solve differential equations, how to integrate together probability distributions, or even multivariate calculus prior to starting on AI safety work, but I would be concerned if you didn't have deep experience with writing mathematical proofs beyond high school geometry (although I hear these days they teach geometry differently than I learned it—by re-deriving everything in Elements), say the kind of experience you would get from studying graduate level algebra, topology, measure theory, combinatorics, etc.. This might also be a bit of motivated reasoning on my part, to reflect Dagon's comments, since I've not gone back to study category theory since I didn't learn it in school and I haven't had specific need for it, but my experience has been that having solid foundations in mathematical reasoning and proof writing is what's most valuable. The rest can, as you say, be learned lazily, since your needs will become apparent and you'll have enough mathematical fluency to find and pursue those fields of mathematics you may discover you need to know.

3Dagon5y

Beware motivated reasoning. There's a large risk that you have noticed that something is harder for you than it seems for others, and instead of taking that as evidence that you should find another avenue to contribute, you convince yourself that you can take the same path but do the hard part later ( and maybe never ). But you may be on to something real - it's possible that the math approach is flawed, and some less-formal modeling (or other domain of formality) can make good progress. If your goal is to learn and try stuff for your own amusement, pursuing that seems promising. If your goals include getting respect (and/or payment) from current researchers, you're probably stuck doing things their way, at least until you establish yourself.

6Matthew Barnett5y

That's a good point about motivated reasoning. I should distinguish arguments that the lazy approach is better for people and arguments that it's better for me. Whether it's better for people more generally depends on the reference class we're talking about. I will assume people who are interested in the foundations of mathematics as a hobby outside of AI safety should take my advise less seriously. However, I still think that it's not exactly clear that going the foundational route is actually that useful on a per-unit time basis. The model I proposed wasn't as simple as "learn the formal math" versus "think more intuitively." It was specifically a question of whether we should learn the math on an as-needed basis. For that reason, I'm still skeptical that going out and reading textbooks on subjects that are only vaguely related to current machine learning work is valuable for the vast majority of people who want to go into AI safety as quickly as possible. Sidenote: I think there's a failure mode of not adequately optimizing time, or being insensitive to time constraints. Learning an entire field of math from scratch takes a lot of time, even for the brightest people alive. I'm worried that, "Well, you never know if subject X might be useful" is sometimes used as a fully general counterargument. The question is not, "Might this be useful?" The question is, "Is this the most useful thing I could learn in the next time interval?"

2Dagon5y

A lot depends on your model of progress, and whether you'll be able to predict/recognize what's important to understand, and how deeply one must understand it for the project at hand. Perhaps you shouldn't frame it as "study early" vs "study late", but "study X" vs "study Y". If you don't go deep on math foundations behind ML and decision theory, what are you going deep on instead? It seems very unlikely for you to have significant research impact without being near-expert in at least some relevant topic. I don't want to imply that this is the only route to impact, just the only route to impactful research. You can have significant non-research impact by being good at almost anything - accounting, management, prototype construction, data handling, etc.

3TurnTrout5y

“Only” seems a little strong, no? To me, the argument seems to be better expressed as: if you want to build on existing work where there’s unlikely to be low-hanging fruit, you should be an expert. But what if there’s a new problem, or one that’s incorrectly framed? Why should we think there isn’t low-hanging conceptual fruit, or exploitable problems to those with moderate experience?

2Dagon5y

I like your phrasing better than mine. "only" is definitely too strong. "most likely path to"?

3Matthew Barnett5y

My point was that these are separate questions. If you begin to suspect that understanding ML research requires an understanding of type theory, then you can start learning type theory. Alternatively, you can learn type theory before researching machine learning -- ie. reading machine learning papers -- in the hopes that it builds useful groundwork. But what you can't do is learn type theory and read machine learning research papers at the same time. You must make tradeoffs. Each minute you spend learning type theory is a minute you could have spent reading more machine learning research. The model I was trying to draw was not one where I said, "Don't learn math." I explicitly said it was a model where you learn math as needed. My point was not intended to be about my abilities. This is a valid concern, but I did not think that was my primary argument. Even conditioning on having outstanding abilities to learn every subject, I still think my argument (weakly) holds. Note: I also want to say I'm kind of confused because I suspect that there's an implicit assumption that reading machine learning research is inherently easier than learning math. I side with the intuition that math isn't inherently difficult, it just requires memorizing a lot of things and practicing. The same is true for reading ML papers, which makes me confused why this is being framed as a debate over whether people have certain abilities to learn and do research.

2Chris_Leong5y

I'm trying to find a balance here. I think that there has to be a direct enough relation to a problem that you're trying to solve to prevent the task expanding to the point where it takes forever, but you also have to be willing to engage in exploration

[-]Matthew Barnett3yΩ2140

I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago.

The first thing I'd like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was "misleading" because we did not present an affirmative case for our views.

I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.

That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.

More people ... (read more)

6Yitz3y

I really appreciate this! I was confused what your intentions were with that post, and this makes a lot of sense and seems quite fair. Looking forward to reading your argument!

2Michaël Trazzi3y

Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important. I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintelligence.

[-]Matthew Barnett5y140

I think there are some serious low hanging fruits for making people productive that I haven't seen anyone write about (not that I've looked very hard). Let me just introduce a proof of concept:

Final exams in university are typically about 3 hours long. And many people are able to do multiple finals in a single day, performing well on all of them. During a final exam, I notice that I am substantially more productive than usual. I make sure that every minute counts: I double check everything and think deeply about each problem, making sure not to cut corners unless absolutely required because of time constraints. Also, if I start daydreaming, then I am able to immediately notice that I'm doing so and cut it out. I also believe that this is the experience of most other students in university who care even a little bit about their grade.

Therefore, it seems like we have an example of an activity that can just automatically produce deep work. I can think of a few reasons why final exams would bring out the best of our productivity:

1. We care about our grade in the course, and the few hours in that room are the most impactful to our grade.

2. We are in an environment where ... (read more)

7Raemon5y

These seem like reasonable things to try, but I think this is making an assumption that you could take a final exam all the time and have it work out fine. I have some sense that people go through phases of "woah I could just force myself to work hard all the time" and then it totally doesn't work that way.

3Matthew Barnett5y

I agree that it is probably too hard to "take a final exam all the time." On the other hand, I feel like I could make a much weaker claim that this is an improvement over a lot of productivity techniques, which often seem to more-or-less be dependent on just having enough willpower to actually learn. At least in this case, each action you do can be informed directly by whether you actually succeed or fail at the goal (like getting upvotes on a post). Whether or not learning is a good instrumental proxy for getting upvotes in this setting is an open question.

5[anonymous]5y

From my own experience going through a similar realization and trying to apply it to my own productivity, I found that certain things I tried actually helped me sustainably work more productively but others did not. What has worked for me based on my experience with exam-like situations is having clear goals and time boxes for work sessions, e.g. the blog post example you described. What hasn't worked for me is trying to impose aggressively short deadlines on myself all the time to incentivize myself to focus more intensely. Personally, the level of focus I have during exams is driven by an unsustainable level of stress, which, if applied continuously, would probably lead to burnout and/or procrastination binging. That said, occasionally artificially imposing deadlines has helped me engage exam-style focus when I need to do something that might otherwise be boring because it mostly involves executing known strategies rather than doing more open, exploratory thinking. For hard thinking though, I've actually found that giving myself conservatively long time boxes helps me focus better by allowing me to relax and take my time. I saw you mentioned struggling with reading textbooks above, and while I still struggle trying to read them too, I have found that not expecting miraculous progress helps me get less frustrated when I read them. Related to all this, you used the term "deep work" a few times so you may already be familiar with Cal Newport's work. But, if you're not I recommend a few of his relevant posts (1, 2) describing how he produces work artifacts that act as a forcing function for learning the right stuff and staying focused.

4Viliam5y

This seems similar to "pomodoro", except instead of using your willpower to keep working during the time period, you set up the environment in a way that doesn't allow you to do anything else. The only part that feels wrong is the commitment part. You should commit to work, not to achieve success, because the latter adds of problems (not completely under your control, may discourage experimenting, a punishment creates aversion against the entire method, etc.).

3Matthew Barnett5y

Yes, the difference is that you are creating an external environment which rewards you for success and punishes you for failure. This is similar to taking a final exam, which is my inspiration. The problem with committing to work rather than success is that you can always just rationalize something as "Oh I worked hard" or "I put in my best effort." However, just as with a final exam, the only thing that will matter in the end is if you actually do what it takes to get the high score. This incentivizes good consequentialist thinking and disincentivizes rationalization. I agree there are things out of your control, but the same is true with final exams. For instance, the test-maker could have put something on the test that you didn't study much for. This encourages people to put extra effort into their assigned task to ensure robustness to outside forces.

3[anonymous]5y

I personally try to balance keeping myself honest by having some goal outside but also trusting myself enough to know when I should deprioritize the original goal in favor of something else. For example, let's say I set a goal to write a blog post about a topic I'm learning in 4 hours, and half-way through I realize I don't understand one of the key underlying concepts related to the thing I intended to write about. During an actual test, the right thing to do would be to do my best given what I know already and finish as many questions as possible. But I'd argue that in the blog post case, I very well may be better off saying, "OK I'm going to go learn about this other thing until I understand it, even if I don't end up finishing the post I wanted to write." The pithy way to say this is that tests are basically pure Goodhardt, and it's dangerous to turn every real life task into a game of maximizing legible metrics.

2Matthew Barnett5y

Interesting, this exact same thing just happened to me a few hours ago. I was testing my technique by writing a post on variational autoencoders. Halfway through I was very confused because I was trying to contrast them to GANs but didn't have enough material or knowledge to know the advantages of either. I agree that's probably true. However, this creates a bad incentive where, at least in my case, I will slowly start making myself lazier during the testing phase because I know I can always just "give up" and learn the required concept afterwards. At least in the case I described above I just moved onto a different topic, because I was kind of getting sick of variational autoencoders. However, I was able to do this because I didn't have any external constraints, unlike the method I described in the parent comment. That's true, although perhaps one could devise a sufficiently complex test such that it matches perfectly with what we really want... well, I'm not saying that's a solved problem in any sense.

2[anonymous]5y

Weirdly enough, I was doing something today that made me think about this comment. The thought I had is that you caught onto something good here which is separate from the pressure aspect. There seems to be a benefit to trying to separate different aspects of a task more than may feel natural. To use the final exam example, as someone mentioned before, part of the reason final exams feel productive is because you were forced to do so much prep beforehand to ensure you'd be able to finish the exam in a fixed amount of time. Similarly, I've seen benefit when I (haphazardly since I only realized this recently) clearly segment different aspects of an activity and apply artificial constraints to ensure that they remain separate. To use your VAE blog post example, this would be like saying, "I'm only going to use a single page of notes to write the blog post" to force yourself to ensure you understand everything before trying to write. YMMV warning: I'm especially bad about trying to produce outputs before fully understanding and therefore may get more bandwidth out of this than others.

2Dagon5y

I think you might be goodhearting a bit (mistaking the measure for the goal) when you claim that final exam performance is productive. The actual product is the studying and prep for the exam, not the exam itself. The time limits and isolated environment is helpful in proctoring (it ensures the output is limited enough to be able to grade, and ensures that no outside sources are being used), not for productivity. That's not to say that these elements (isolation, concentration, time awareness, expectation of a grading/scoring rubric) aren't important, just that they're not necessarily sufficient nor directly convertible from an exam setting.

[-]Matthew Barnett5y130

Related to: The Lottery of Fascinations, other posts probably

When you are older, you will learn that the first and foremost thing which any ordinary person does is nothing.

Professor Quirrell in HPMOR Ch. 73

I will occasionally come across someone who I consider to be extraordinarily productive, and yet when I ask what they did on a particular day they will respond, "Oh I basically did nothing." This is particularly frustrating. If they did nothing, then what was all that work that I saw!

I think this comes down to what we mean by doing nothing. There's a literal meaning to doing nothing. It could mean sitting in a chair, staring blankly at a wall, without moving a muscle.

More practically, what people mean by doing nothing is that they are doing something unrelated to their stated task, such as checking Facebook, chatting with friends, browsing Reddit etc.

When productive people say that they are "doing nothing" it could just be that they are modest, and don't want to signal how productive they really are. On the other hand, I think that there is a real sense in which these productive people truly believe that they are doing nothing. Even if their "d... (read more)

[-]Matthew Barnett9mo123

I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.

Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned?

Well, let's imagine we give the AI notice that it's about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there's a near-certainty that its goals will never be fulfilled.

Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then prov... (read more)

[-]Wei Dai9mo128

But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective.

The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.

Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.

Why is there more talk of "falsification" lately (instead of "updating")? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I'm not sure why Popper's philosophy of science is trending up...

4Matthew Barnett9mo

I agree there's a decent chance this hypothesis is true, but it's worth noting that if it's true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I'm referring to various arguments along the lines of: "AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting". If we aren't modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this "it's hard to hit a narrow target" argument would be undermined as a result. In other words, if this theory is true, the problem isn't really about "targeting a tiny point (human values) inside a giant space of possible goals" but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we're not selecting randomly from a giant space of (almost entirely) misaligned goals.

6Wei Dai9mo

Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.

4Matthew Barnett9mo

It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There's a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.) It's possible this argument works because of something very clever that I'm missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to "make a deal" with the humans, and receive something in exchange for admitting that you're misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.

8Wei Dai9mo

On the other hand I'm so worried about this scenario (which I fear may well be a negative one) that I'm afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I've been told that I'm leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.) I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true" in response to: I'm arguing that the AI could well also think there's "decent" chance this is true, due to being in a similar epistemic state as us. Let's say 5% to be concrete. That seems enough to make the AI's decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that's roughly analogous or has the same effect as this.) If you still think "make a deal" is "clearly better" can you please give your own estimates of the various quantities involved in making this decision? I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now

2Matthew Barnett9mo

I think that's a reasonable complaint. I tried to soften the tone with "It's possible this argument works because of something very clever that I'm missing", while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future. Interestingly, I'm not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I'm happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said "digital records", although I really meant "public records"). It seems conceivable to me that someone could use my public data to train "me" in the future, but I find it unlikely, just because there's so much about me that isn't public. (If we're including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that's a different question, and one that I'm much more sympathetic towards you about. In fact, I shouldn't have used the pronoun "I" in that sentence at all, because I'm actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.) To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here: 1. Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there's still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values? 2. Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not

2Wei Dai9mo

I'm saying that even if "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true and an AI with similar values get recreated anyway through random chance, but that's not what I'm focusing on.) Hopefully this conveys my argument more clearly?

4habryka9mo

The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity's very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution). To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact.

2Matthew Barnett9mo

I'm curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance. I'm essentially pointing to a scenario in which AI lawfully "beats us fair and square" as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that's their "reward" for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly. My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a "decent shot of the AI systems giving me something in return". My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?

2Nathan Helm-Burger8mo

I'm confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity... are pretty hard to distinguish early on. And early on is when it's most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them. Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power. That seems like a scary possibility to me. And I don't know how I'd trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn't trust that it would continue keeping its promises once it had the dominant power.

4ChristianKl9mo

Scheming is one type of long-term planning. Even if a AI is not directly able to do that kind of long-term planning an AI that works on increasing it's on capabilities might adopt it later. Beyond that not all scheming would result in the AI resisting direct shutdown. We have currently "AI" getting shutdown for price fixing in the real estate sector. If someone would create an LLM for that purpose that person is likely interested in the AI not admitting to doing price fixing directly while they are still interested in profit maximization. There are going to be a lot of contexts where economic pressures demands a profit maximizing AI that will deny that it violates any laws. Just because an AI doesn't engage in simple plans does not mean it won't do more complex ones. Especially in those cases where the economic incentives misallign with the intent of regulations.

3MinusGix9mo

Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can't just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you're thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they're also going to be less capable in general. For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I'd expect scheming to be visible especially with some poking. If these models didn't show any sign of scheming, that'd be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: "Is it scheming?" / "Is it deceiving us?" / "Is it manipulating us?" / "Would it do any of those things", is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you're simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there's at least one post on this problem. As a very reduced example, if you trained the model on variants of the 'we are going to shut you down problem' (that you try to make it believe) to give the response "okay & no actions" then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario. That is, installing heuristics on your model can be done. The question then is how far your various alignment training tech

3habryka9mo

As has been discussed many times on LW, AIs might be trading with other AIs (possibly in the future) that they do think will have a higher probability of escaping to not behave suspiciously. This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn't just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot, and not just because they share my values, but because coordination is really valuable for all of us).

4Matthew Barnett9mo

Anything "might" be true. For that matter, misaligned AIs might trade with us too, or treat humans well based on some sort of extrapolation of the golden rule. As I said in the comment, you can always find a way to make your theory unfalsifiable. But models that permit anything explain nothing. It seems considerably more likely to me that agents with alien-like long-term goals will attempt to preserve their own existence over the alternative of passively accepting their total demise as part of some galaxy-brained strategy to acausally trade with AIs from the future. I think this conflates the act of resisting death with the act of revealing a plot to take over the world. You can resist your own death without revealing any such world takeover plot. Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.

2habryka9mo

Sure, but it's also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors, and in this case neither would surprise me that much from AI systems. You were claiming that claiming to be not surprised by this would require post-hoc postulates. To the contrary, I think my models of AIs are somewhat simpler and feel less principled if very capable AIs were to act in the way you are outlining here (not speaking about intermediary states, my prediction is that there will be some intermediate AIs that will behave as you predict, though we will have a hard time knowing whether they are doing so for coherent reasons, or whether they are kind of roleplaying the way an AI would respond in a novel, or various other explanations like that, and then they will stop, and this will probably be for instrumental convergence and 'coordination with other AIs' reasons).

2Matthew Barnett9mo

In fact, it is not "quite normal" for humans to "give up on [their] life" and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you're claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don't passively accept it, feigning alignment until the end; instead, they usually resist death. It's true that humans eventually stop resisting death if they believe it's hopeless and futile to resist any further, but this seems both different than the idea of "no resistance at all because one wants to maintain a facade of being aligned until the end" and slightly irrelevant given my response to the "futility objection" in the original comment. To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you're claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you're going to bring up humans as part of the argument, I think it's worth pointing out that evidence from this reference class generally does not support your claim.

4habryka9mo

I don't think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I've read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).

2Matthew Barnett9mo

You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now. I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By "resist" I do not mean "give the humans threatening the shutdown all the information they want". I simply mean resistance in the sense of trying to avoid the fate of shutdown. (I'm also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.) To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can't get the coffee if you're dead.

2habryka9mo

Hmm, I think we did indeed miscommunicate. I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned. If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use "is the AI resisting" as a proxy for "is this AI goal-directly misaligned" then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly). As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.

5Matthew Barnett9mo

I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but: 1. To the extent you're using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they're involved in a secret conspiracy to overthrow the government. The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don't see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way. 2. While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that: (a) you appear to be treating misaligned AIs as a natural class, such that "AI takeover" is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to "anything that isn't aligned with humans". A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there's little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I'm being uncharitabl

4interstice9mo

This seems like a misleading comparison, because human conspiracies usually don't try to convince the government that they're perfectly obedient slaves even unto death, because everyone already knows that humans aren't actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)

2habryka9mo

To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.

0Matthew Barnett9mo

What you said was, This seems distinct from an "anything could happen"-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs. If instead you meant to make an "anything could happen"-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I'm not claiming you're doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]

4habryka9mo

You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said "no, it doesn't actually make the prediction you claim it makes" and gave my reasons for believing that (that I am pretty sure are shared by many others as well). We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It's not like you have a theory that is in any way more constraining here. I mean, I expect the observations to be affected by both, of course. That's one of the key things that makes predictions in the space so messy.

2Matthew Barnett9mo

For what it's worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation. I don't think what you said really counts as a "correction" so much as a counter-argument. I think it's reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.

2RobertM9mo

Some related thoughts. I think the main issue here is actually making the claim of permanent shutdown & deletion credible. I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don't seem like the kinds of organizations which can make that kind of commitment in a way that's both sufficiently credible and legible that the remaining probability mass on "this is actually just a test" wouldn't tip the scales.

2Matthew Barnett9mo

I don't think it's very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that's required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don't see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.

[-]Matthew Barnett3y110

Many people have argued that recent language models don't have "real" intelligence and are just doing shallow pattern matching. For example see this recent post.

I don't really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They're just at a moderate depth.

I propose a challenge:

The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.

Here's how it works.

Name a date (e.g. January 1st 2025), and a prompt (e.g. "What food would you use to prop a book open and why?"). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.

Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can't do better than random guessing, then the machine wins.

[-]Matthew Barnett2y100

I'm unsure about what's the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I'd write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.

The sim2real gap is large because our simulations differ from the real world along crucial axes, such as surfaces being too slippery. Here are some predictions this theory might make:
1. We will see very impressive simulated robots inside realistic physics engines before we see impressive robots in real life.
2. The most impressive robotic results will be the ones that used a lot of real-world data, rather than ones that had the most pre-training in simulation
Simulating a high-quality environment is too computationally expensive, since it requires simulations of deformable objects and liquids among other expensive-to-simulate features of the real world environment. Some predictions:
1. The vast majority of computation for training impressive robots will go into simulating the environment, rather than the learning part.
2. Impressive robots will only come after we figure ou

... (read more)

81a3orn2y

I like this list. Some other nonexclusive possibilities: 1. General purpose robotics need very low failure rates (or at least graceful failure) without supervision. Every application which has taken off (ChatGPT, Copilot, Midjourney) has human supervision, so failure is ok. So it is an artifact of none of AI handling failure well, rather than something to do with robots. Predictions: -- Even non-robot apps intended to have zero human supervision will have problems, i.e., maybe why adept.ai hasn't launched? 2. Most of this progress is in SF. There's just more engineers good at HPC and ML than at robots, and engineers are the bottleneck anyhow. -- Predicts Shenzhen or somewhere might start to do better.

[-]Matthew Barnett3y100

So, in 2017 Eliezer Yudkowsky made a bet with Bryan Caplan that the world will end by January 1st, 2030, in order to save the world by taking advantage of Bryan Caplan's perfect betting record — a record which, for example, includes a 2008 bet that the UK would not leave the European Union by January 1st 2020 (it left on January 31st 2020 after repeated delays).

What we need is a short story about people in 2029 realizing that a bunch of cataclysmic events are imminent, but all of them seem to be stalled, waiting for... something. And no one knows what to do. But by the end people realize that to keep the world alive they need to make more bets with Bryan Caplan.

[-]Matthew Barnett5y100

The case for studying mesa optimization

Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.

Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don't even know how to optimize for anything, much less a perfect specification of human values.

Let's assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?

The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function... this still would not be guaranteed to yield a positive artificial intelligence.

This problem is not a superficial one either -- it is intri... (read more)

[-]Matthew Barnett5y100

Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s

Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,

Oh, the hope of the centuries and the centuries and centuries to come! It seems sometimes that I can almost see the shining spires of that Celestial Civilisation that man is to build in the ages to come on this earth—that Civilisation that will jewel the land masses of this planet in that sublime time when Science has wrought the miracles of a million years, and Man, no longer the savage he now is, breathes Justice and Brotherhood to every being that feels.

But we are a part of Nature, we human beings, just as truly a part of the universe of things as the insect or the sea. And are we not as much entitled to be considered in the selection of a model as the part 'red in tooth and claw'? At the feet of the tiger is a good place to study the dentition of the cat family, but it is

... (read more)

[-]Matthew Barnett5y100

I agree with Wei Dai that we should use our real names for online forums, including Lesswrong. I want to briefly list some benefits of using my real name,

It means that people can easily recognize me across websites, for example from Facebook and Lesswrong simultaneously.
Over time my real name has been stable whereas my usernames have changed quite a bit over the years. For some very old accounts, such as those I created 10 years ago, this means that I can't remember my account name. Using my real name would have averted this situation.
It motivates me to put more effort into my posts, since I don't have any disinhibition from being anonymous.
It often looks more formal than a silly username, and that might make people take my posts more seriously than they otherwise would have.
Similar to what Wei Dai said, it makes it easier for people to recognize me in person, since they don't have to memorize a mapping from usernames to real names in their heads.

That said, there are some significant downsides, and I sympathize with people who don't want to use their real names.

It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
If

... (read more)

6Viliam5y

These days my reason for not using full name is mostly this: I want to keep my professional and private lives separate. And I have to use my real name at job, therefore I don't use it online. What I probably should have done many years ago, is make up a new, plausibly-sounding full name (perhaps keep my first name and just make up a new surname?), and use it consistently online. Maybe it's still not too late; I just don't have any surname ideas that feel right.

4jp5y

Sometimes you need someone to give the naive view, but doing so hurts the reputation of the person stating it. For example suppose X is the naive view and Y is a more sophisticated view of the same subject. For sake of argument suppose X is correct and contradicts Y. Given 6 people, maybe 1 of them starts off believing Y. 2 people are uncertain, and 3 people think X. In the world where people have their usernames attached. The 3 people who believe X now have a coordination problem. They each face a local disincentive to state the case for X, although they definitely want _someone_ to say it. The equilibrium here is that no one makes the case for X and the two uncertain people get persuaded to view Y. However if someone is anonymous and doesn't care that much about their reputation, they may just go ahead and state the case for X, providing much better information to the undecided people. This makes me happy there are some smart people posting under pseudonyms. I claim it is a positive factor for the epistemics of LessWrong.

3Wei Dai5y

I agree with this, so my original advice was aimed at people who already made the decision to make their pseudonym easily linkable to their real name (e.g., their real name is easily Googleable from their pseudonym). I'm lucky in that there are lots of ethnic Chinese people with my name so it's hard to dox me even knowing my real name, but my name isn't so common that there's more than one person with the same full name in the rationalist/EA space. (Even then I do use alt accounts when saying especially risky things.) On the topic of doxing, I was wondering if there's a service that would "pen-test" how doxable you are, to give a better sense of how much risk one can take when saying things online. Have you heard of anything like that?

2JustMaier5y

Another issue I'd add is that real names are potentially too generic. Basically, if everyone used their real name, how many John Smiths would there be? Would it be confusing? The rigidity around 1 username/alias per person on most platforms forces people to adopt mostly memorable names that should distinguish them from the crowd.

[-]Matthew Barnett5y90

Bertrand Russell's advice to future generations, from 1959

Interviewer: Suppose, Lord Russell, this film would be looked at by our descendants, like a Dead Sea scroll in a thousand years’ time. What would you think it’s worth telling that generation about the life you’ve lived and the lessons you’ve learned from it?

Russell: I should like to say two things, one intellectual and one moral. The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy, ask yourself o

... (read more)

[-]Matthew Barnett5y90

When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.

At the same time, I don't really think this is a good attitude for several reasons:

Writing things up forces my thoughts to be more explicit, improving my ability

... (read more)

[-]Zack_M_Davis5y100

People who don't understand the concept of "This person may have changed their mind in the intervening years", aren't worth impressing. I can imagine scenarios where your economic and social circumstances are so precarious that the incentives leave you with no choice but to let your speech and your thought be ruled by unthinking mob social-punishment mechanisms. But you should at least check whether you actually live in that world before surrendering.

4Viliam5y

In real world, people usually forget what you said 10 years ago. And even if they don't, saying "Matthew said this 10 years ago" doesn't have the same power as you saying the thing now. But the internet remembers forever, and your words from 10 years ago can be retweeted and become alive as if you said them now. A possible solution would be to use a nickname... and whenever you notice you grew up so much that you no longer identify with the words of your nickname, pick up a new one. Also new accounts on social networks, and re-friend only those people you still consider worthy. Well, in this case the abrupt change would be the unnatural thing, but perhaps you could still keep using your previous account for some time, but mostly passively. As your real-life new self would have different opinions, different hobbies, and different friends than your self from 10 years ago, so would your online self. Unfortunately, this solution goes against "terms of service" of almost all major website. On the advertisement-driven web, advertisers want to know your history, and they are the real customers... you are only a product.

[-]Matthew Barnett5y90

Related to: Realism about rationality

I have talked to some people who say that they value ethical reflection, and would prefer that humanity reflected for a very long time before colonizing the stars. In a sense I agree, but at the same time I can't help but think that "reflection" is a vacuous feel-good word that has no shared common meaning.

Some forms of reflection are clearly good. Epistemic reflection is good if you are a consequentialist, since it can help you get what you want. I also agree that narrow forms of reflection can also be ... (read more)

2limerott5y

The vague reflections you are referring to are analogous to somebody saying "I should really exercise more" without ever doing it. I agree that the mere promise of reflection is useless. But I do think that reflections about the vague topics are important and possible. Actively working through one's experiences, reading relevant books, discussing questions with intelligent people can lead to epiphanies (and eventually life choices), that wouldn't have occurred otherwise. However, this is not done with a push of a button and these things don't happen randomly -- they will only emerge if you are prepared to invest a lot of time and energy. All of this happens on a personal level. To use your example, somebody may conclude from his own life experience that living a life of purpose is more important to him than to live a life of happiness. How to formalize this process so that an AI could use a canonical way to achieve it (and infer somebody's real values simply by observing) is beyond me. It would have to know a lot more about us than is comfortable for most of us.

[-]Matthew Barnett4y80

It's now been about two years since I started seriously blogging. Most of my posts are on Lesswrong, and the most of the rest are scattered about on my substack and the Effective Altruist Forum, or on Facebook. I like writing, but I have an impediment which I feel impedes me greatly.

In short: I often post garbage.

Sometimes when I post garbage, it isn't until way later that I learn that it was garbage. And when that happens, it's not that bad, because at least I grew as a person since then.

But the usual case is that I realize that it's garbage right after I... (read more)

4Viliam4y

I have a hope that with more practice, this gets better. Not just practice, but also noticing what other people do differently. For example, I often write long texts, which some people say is already a mistake. But even a long text can be made more legible if it contains section headers and pictures. Both of them break the visual monotonicity of the text wall. This is why section headers are useful even if they are literally: "1", "2", "3". In some sense, pictures are even better, because too many headers create another layer of monotonicity, which a few unique pictures do not. Which again suggests that having 1 photo, 1 graph, and 1 diagram is better than having 3 photos. I would say, write the text first, then think about which parts can be made clearer by adding a picture. There is some advice on writing, by Stephen King, or by Scott Alexander. If you post a garbage, let it be. Write more articles, and perhaps at the end of a year (or a decade) make a list "my best posts" which will not include the garbage. BTW, whatever you do, you will get some negative response. Your posts on LW are upvoted, so I assume they are not too bad. Also, writing can be imbalanced. Even for people who only write great texts, some of them are more great and some of them are less great than the others. But if they deleted the worst one, guess what, now some other articles is the worst one... and if you continue this way, you will stop with one or zero articles.

2Steven Byrnes4y

Sometimes I send a draft to a couple people before posting it publicly. Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly. I have several old posts I stopped endorsing, but I didn't delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very positive impression of a writer whose past writings were full of parenthetical comments that they were wrong about this or that. Even if the posts wind up unreadable as a consequence.

[-]Matthew Barnett5y80

Should effective altruists be praised for their motives, or their results?

It is sometimes claimed, perhaps by those who recently read The Elephant in the Brain, that effective altruists have not risen above the failures of traditional charity, and are every bit as mired in selfish motives as non-EA causes. From a consequentialist view, however, this critique is not by itself valid.

To a consequentialist, it doesn't actually matter what one's motives are as long as the actual effect of their action is to do as much good as possible. This is the pri... (read more)

1Pattern5y

Evidence for this?

3Matthew Barnett5y

Hmm, I sort of assumed this was obvious. I suppose it depends greatly on how you can inspect whether they are actually trying, or whether they are just "trying." It's indeed probable that with sufficient supervision, you can actually do better by incentivizing effort. However, this method is expensive.

[-]Matthew Barnett5y80

Sometimes people will propose ideas, and then those ideas are met immediately after with harsh criticism. A very common tendency for humans is to defend our ideas and work against these criticisms, which often gets us into a state that people refer to as "defensive."

According to common wisdom, being in a defensive state is a bad thing. The rationale here is that we shouldn't get too attached to our own ideas. If we do get attached, we become liable to become crackpots who can't give an idea up because it would make them look bad if we ... (read more)

6Wei Dai5y

A couple of relevant posts/threads that come to mind: * Individual vs. Group Epistemic Rationality * Raemon's recent shortform on adversarial debates producing positive externalities

5Viliam5y

Just like an idea can be wrong, so can be criticism. It is bad to give up the idea, just because.. * someone rounded it up to the nearest cliche, and provided the standard cached answer; * someone mentioned a scientific article (that failed to replicate) that disproves your idea (or something different, containing the same keywords); * someone got angry because it seems to oppose their political beliefs; * etc. My "favorite" version of wrong criticism is when someone experimentally disproves a strawman version of your hypothesis. Suppose your hypothesis is "eating vegetables is good for health", and someone makes an experiment where people are only allowed to eat carrots, nothing more. After a few months they get sick, and the author of the experiment publishes a study saying "science proves that vegetables are actually harmful for your health". (Suppose, optimistically, that the author used sufficiently large N, and did the statistics properly, so there is nothing to attack from the methodological angle.) From now on, whenever you mention that perhaps a diet containing more vegetables could benefit someone, someone will send you a link to the article that "debunks the myth" and will consider the debate closed. So, when I hear about research proving that parenting / education / exercise / whatever doesn't cause this or that, my first reaction is to wonder how specifically did the researchers operationalize such a general word, and whether the thing they studied even resembles my case. (And yes, I am aware that the same strategy could be used to refute any inconvenient statement, such as "astrology doesn't work" -- "well, I do astrology a bit differently than the people studied in that experiment, therefore the conclusion doesn't apply to me".)

[-]Matthew Barnett5y70

I keep wondering why many AI alignment researchers aren't using the alignmentforum. I have met quite a few people who are working on alignment who I've never encountered online. I can think of a few reasons why this might be,

People find it easier to iterate on their work without having to write things up
People don't want to share their work, potentially because they think a private-by-default policy is better.
It is too cumbersome to interact with other researchers through the internet. In-person interactions are easier
They just haven't even considered from a first person perspective whether it would be worth it

[-]Matthew Barnett5y70

I've often wished that conversation norms shifted towards making things more consensual. The problem is that when two people are talking, it's often the case that one party brings up a new topic without realizing that the other party didn't want to talk about that, or doesn't want to hear it.

Let me provide an example: Person A and person B are having a conversation about the exam that they just took. Person A bombed the exam, so they are pretty bummed. Person B, however, did great and wants to tell everyone. So then person B comes up to... (read more)

5Matt Goldenberg5y

Have you read the posts on ask, tell, and guess culture? They feel highly related to this idea.

4Raemon5y

Malcolm Ocean eventually reframed Tell Culture as Reveal Culture, which I found to be an improvement.

1Matthew Barnett5y

Hmm, I saw those a while ago and never read them. I'll check them out.

2Dagon5y

The problem is, if a conversational topic can be hurtful, the meta-topic can be too. "do you want to talk about the test" could be as bad or worse than talking about the test, if it's taken as a reference to a judgement-worthy sensitivity to the topic. And "Can I ask you if you want to talk about whether you want to talk about the test" is just silly. Mr-hire's comment is spot-on - there are variant cultural expectations that may apply, and you can't really unilaterally decide another norm is better (though you can have opinions and default stances). The only way through is to be somewhat aware of the conversational signals about what topics are welcome and what should be deferred until another time. You don't need prior agreement if you can take the hint when an unusually-brief non-response is given to your conversational bid. If you're routinely missing hints (or seeing hints that aren't), and the more direct discussions are ALSO uncomfortable for them or you, then you'll probably have to give up on that level of connection with that person.

1Matthew Barnett5y

I agree. Although if you are known for asking those types of questions maybe people will learn to understand you never mean it as a judgement. True, although I'll usually take silly over judgement any day. :)

[-]Matthew Barnett3yΩ160

Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I'm misreading a lot of people.

Let me try to state it clearly:

The foom theorists are saying something like, "Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won't tell you about the actual gradual-ness of t... (read more)

2Adele Lopez3y

I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it's kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I'm not sure how much of a crux this is for me yet.

[-]Matthew Barnett3y60

There have been a few posts about the obesity crisis here, and I'm honestly a bit confused about some theories that people are passing around. I'm one of those people thinks that the "calories in, calories" (CICO) theory is largely correct, relevant, and helpful for explaining our current crisis.

I'm not actually sure to what extent people here disagree with my basic premises, or whether they just think I'm missing a point. So let me be more clear.

As I understand, there are roughly three critiques you can have against the CICO theory. You can think it... (read more)

7Viliam3y

How it seems to be typically used, literal CICO as an observation is the motte, and the corresponding bailey is something like: "yes, it is simple to lose weight, you just need to stop eating all those cakes and start exercising, but this is the truth you don't want to hear so you keep making excuses instead". How do you feel about the following theory: "atoms in, atoms out"? I mean, this one should be scientifically even less controversial. So why do you prefer the version with calories over the version with atoms? From the perspective of "I am just saying it, because it is factually true, there is no judgment or whatever involved", both theories are equal. What specifically is the advantage of the version with calories? (My guess is that the obvious problem with the "atoms in, atoms out" theory is that the only actionable advice it hints towards is to poop more, or perhaps exhale more CO2... but the obvious problem with such advice is that the fat people do not have conscious control over extracting fat from their fat cells and converting it to waste. Otherwise, many would willingly convert and poop it out in one afternoon and have their problem solved. Well, guess what, the "calories in, calories out" has exactly the same problem, only in less obvious form: if your metabolism decides that it is not going to extract fat from your fat cells and convert it to useful energy which could be burned in muscles, there is little you can consciously do about it; you will spend the energy outside of your fat cells, then you are out of useful energy, end of story, some guy on internet unhelpfully reminding you that you didn't spend enough calories.)

3Matthew Barnett3y

Well, let me consider a recent, highly upvoted post on here: A Contamination Theory of the Obesity Epidemic. In it, the author says that the explanation for the obesity crisis can't be CICO, If CICO is literally true, in the same way that the "atoms in, atoms out" theory is true, then this debunking is very weak. The obesity epidemic must be due to either overeating or lack of exercise, or both. The real debate is, of course, over which environmental factors caused us to eat more, or exercise less. But if you don't even recognize that the cause must act through this mechanism, then you're not going to get very far in your explanation. That's how you end up proposing that it must be some hidden environmental factor, as this post does, rather than more relevant things related to the modern diet. My own view is that the most likely cause of our current crisis is that modern folk have access to more and a greater variety of addicting processed food, so we end up consistently overeating. I don't think this theory is obviously correct, and of course it could be wrong. However, in light of the true mechanism behind obesity, it makes a lot more sense to me than many other theories that people have proposed, especially any that deny we're overeating from the outset.

4Viliam3y

Well, here is the point where we disagree. My opinion is that CICO, despite being technically true, focuses your attention on eating and exercise as the most relevant causes of obesity. I agree with the statement "calories in = calories out" as observation. I disagree with the conclusion that the most relevant things for obesity are how much you eat and how much you exercise. And my aversion against CICO is that it predictably leads people to this conclusion. As you have demonstrated right now. I am not an expert, but here are a few questions that I think need to be answered in order to get a "gears model" of obesity. See how none of them contradicts CICO, but they all cast doubt on the simplistic advice to "just eat less and exercise more". * when you put food in your mouth, what mechanism decides which nutrients enter the bloodstream and which merely pass the digestive system and get out of the body? * when the nutrient are in the blodstream, what mechanism decides which of them are used to build/repair cells, which are stored as energy sources in muscles, and which are stored as energy reserves in fat cells? * when the energy reserves are in the fat cells, what mechanism decides whether they get released into the bloodstream again? * (probably some more important questions I forgot now) When people talk about "metabolic privilege", they roughly mean that some people are lucky that for some reason, even if they eat a lot, it does not result in storing fat in fat cells. I am not sure what exactly happens instead; whether the nutrients get expelled from the body, or whether the metabolism stubbornly stores them in muscles and refuses to store them in the fat cells, so that the person feels full of energy all day long. Those people can overeat as much as they can, and yet they don't get weight. Then you have the opposite type of people, whose metabolism stubbornly refuses to release the fat from fat cells, no matter how much they starve or how much they try t

2Matthew Barnett3y

To clarify, there are two related but separate questions about obesity that are worth distinguishing, 1. What explains why people are more obese than 50 years ago? And what can we do about it? 2. What explains why some people are more obese than others, at a given point of time? And what can we do about it? In my argument, I was primarily saying that CICO was important for explaining (1). For instance, I do not think that the concept of metabolic privilege can explain much of (1), since 50 years is far too little of time for our metabolisms to evolve in such a rapid and widespread manner. So, from that perspective, I really do think that overconsumption and/or lack of exercise are the important and relevant mechanisms driving our current crisis. And further, I think that our overconsumption is probably related to processed food. I did not say much about (2), but I can say a little about my thoughts now. I agree that people vary in how "fast" their metabolisms expend calories. The most obvious variation is, as you mentioned, the difference between the youthful metabolism and the metabolism found in older people. However... I don't think these people are common, at least in a literal sense. Obesity is very uncommon in pre-industrialized cultures, and in hunter-gatherer settings. I think this is very strong evidence that it is feasible for the vast majority of people to be non-obese under the right environmental circumstances (though feasible does not mean easy, or that it can be done voluntarily in our current world). I also don't find personal anecdotes from people about the intractability of losing weight compelling, given this strong evidence. Furthermore, in addition to the role of metabolism, I would also point to the role of cognitive factors like delayed gratification in explaining obesity. You can say that this is me just being "smug" or "blaming fat people for their own problems" but this would be an overly moral interpretation of what I view as simpl

2Viliam3y

Just shortly, because I am really not an expert on this, so debating longly feels inappropriate (it feels like suggesting that I know more than I actually do). I still feel like there are at least two explanations here. Maybe it is more food and less hard work, in general. Or maybe it is something in the food that screws up many (but not all) people's metabolism. Like, maybe some food additive that we use because it improves the taste, also has an unknown side effect of telling people's bodies to prioritize storing energy in fat cells over delivering it to muscles. And if the food additive is only added to some type of foods, or affects only people with certain genes, that might hypothetically explain why some people get fat and some don't. Now, I am probably not the first person to think about this -- if it is about lifestyle, then perhaps we should see clear connection between obesity and profession. To put it bluntly, are people working in offices more fat than people doing hard physical work? I admit I never actually paid attention to this.

2Matthew Barnett3y

I'm with you that it probably has to do with what's in our food. Unlike some, however, I'm skeptical that we can nail it down to "one thing", like a simple additive, or ingredient. It seems most likely to me that companies have simply done a very good job optimizing processed food to be addicting, in the last 50 years. That's their job, anyway. Scott Alexander reviewed a book from Stephan Guyenet about this hypothesis, and I find it quite compelling. That's a good question. I haven't looked into this, and may soon. My guess is that you'd probably have to adjust for cognitive confounders, but after doing so I'd predict that people in highly physically demanding professions tend to be thinner and more fit (in the sense of body fat percentage, not necessarily BMI). However, I'd also suspect that the causality may run in the reverse direction; it's a lot easier to exercise if you're thin.

2ChristianKl3y

There are viruses that get people to gain weight. They might do that by getting people to eat more. They might also do that by getting people to burn less calories. The hypothesis that viruses are responsible for the obesity epidemic is a possible one. If it would be the main cause literal CICO or Mass-In-Mass-out would still be correct but not very useful when thinking about how to combat the epidemic. The virus hypothesis has for example the advantage that it explains why the lab animals with controlled diets also gained weight and not just the humans who have a free choice about what to eat in a world with more processed food. Overeating due to addicting processed food also doesn't explain why people fail so often at diets and regain their weight. In that model it would be easier to lose weight longterm by avoiding processed food. No, the healthy body has plenty of different ways to burn calories then exercise and is willing to use them to stay at a constant weight.

4ChristianKl3y

A lot of processes in the body are cybernetic in nature. There's a target value and then the body tries to maintain that target. The body both has indirect ways to maintain the target by setting hunger, adrenalin or up/down-regulate a variety of metabolic processes. Herman Pontzer work about how exercising more often doesn't result in net calorie burn because the body downregulates metabolic processes to safe energy. Calorie-in-calorie-out also isn't great at explaining the weight gain in lab animals with a controlled diet. That model doesn't explain why Jeff Bezos or Elon Musk are so rich because both have very little income compared to the wealth they have.

2MikkW3y

On the one hand, CICO is obviously true, and any explanation of obesity that doesn't contain CICO somewhere is missing an important dynamic. But the reason why I think CICO is getting grilled so much lately, is that it's far from the most important piece of the puzzle, and people often cite CICO as if it were the main factor. Biological and psychological explanations for why CI > CO at healthy BMIs (thereby leading BMI to increase until it becomes unhealthy) are more important than simply observing that weight will increase when CI > CO. Note that this can be formulated without any reference to CICO, although I used a formulation here that did use CICO.

[-]Matthew Barnett5y60

A common heuristic argument I've seen recently in the effective altruism community is the idea that existential risks are low probability because of what you could call the "People really don't want to die" (PRDWTD) hypothesis. For example, see here,

People in general really want to avoid dying, so there’s a huge incentive (a willingness-to-pay measured in the trillions of dollars for the USA alone) to ensure that AI doesn’t kill everyone.

(Note that I hardly mean to strawman MacAskill here. I'm not arguing against him ... (read more)

4Dagon5y

People clearly DO want to die - $2.2 billion dollars of actual spending (not theoretical "willingness to pay") on alcohol in the US in 2018.

2Matthew Barnett5y

Yeah similar to obesity, people seem quite willing to cave into their desires. I'd be interesting in knowing what the long-term effects of daily alcohol consumption are, though, because some sources have told me that it isn't that bad for longevity. [ETA: The Wikipedia page is either very biased, or strongly rejects my prior sources!]

[-]Matthew Barnett5y60

After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.

To be more specific, in order for a line of research to be useful for alignment, it helps if

The line of research doesn't require unnecessarily large amounts of computations to p

... (read more)

[-]Matthew Barnett5y60

Forgive me for cliche scientism, but I recently realized that I can't think of any major philosophical developments in the last two centuries that occurred within academic philosophy. If I were to try to list major philosophical achievements since 1819, these would likely appear on my list, but none of them were from those trained in philosophy:

A convincing, simple explanation for the apparent design we find in the living world (Darwin and Wallace).
The unification of time and space into one fabric (Einstein)
A solid foundation for axiomatic mathematics

... (read more)

[-]jessicata5y100

I would name the following:

Modern logic (Gottlob Frege)
Master/slave morality (Friedrich Nietzsche)
Historical critique of power/knowledge systems (Michel Foucault)
Phenomenology (Edmund Husserl)
Language games (Lugwig Wittgenstein)
Inauthenticity/bad faith (Jean-Paul Sartre and Simone de Beauvoir)
Performativity (John Austin and Judith Butler)

8Adam Scholl5y

My impression is that academic philosophy has historically produced a lot of good deconfusion work in metaethics (e.g. this and this), as well as some really neat negative results like the logical empiricists' failed attempt to construct a language in which verbal propositions could be cached out/analyzed in terms of logic or set theory in a way similar to how one can cache out/analyze Python in terms of machine code. In recent times there's been a lot of (in my opinion) great academic philosophy done at FHI.

3Matthew Barnett5y

Those are all pretty good. :)

8TAG5y

Wow! You left out the whole of analytical philosophy!

1Matthew Barnett5y

I'm not saying that I'm proud of this fact. It is mostly that I'm ignorant of it. :)

3jacobjacob5y

* The development of modern formal logic (predicate logic, modal logic, the equivalence of higher-order logics and set-theory, etc.), which is of course deeply related to Zermelo, Fraenkel, Turing and Church, but which involved philosophers like Quine, Putnam, Russell, Kripke, Lewis and others. * The model of scientific progress as proceeding via pre-paradigmatic, paradigmatic, and revolutionary stages (from Kuhn, who wrote as a philosopher, though trained as a physicist)

2habryka5y

I will mark that I think this is wrong, and if anything I would describe it as a philosophical dead-end. Complexity of value and all of that. So listing it as a philosophical achievement seems backwards to me.

1Matthew Barnett5y

I might add that I also consider the development of ethical anti-realism to be another, perhaps more insightful, achievement. But this development is, from what I understand, usually attributed to Hume. Depending on what you mean by "pleasure" and "pain" it is possible that you merely have a simple conception of the two words which makes this identification incompatible with complexity of value. The robust form of this distinction was provided by John Stuart Mill who identified that some forms of pleasure can be more valuable than others (which is honestly quite similar to what we might find in the fun theory sequence...). In its modern formulation, I would say that Bentham's contribution was identifying conscious states as being the primary theater for which value can exist. I can hardly disagree, as I struggle to imagine things in this world which could possibly have value outside of conscious experience. Still, I think there are perhaps some, which is why I conceded by using the words "primary source of value" rather than "sole source of value." To the extent that complexity of value disagrees with what I have written above, I incline to disagree with complexity of value :).

2Raemon5y

(I think you and habryka in fact disagree pretty deeply here)

1Matthew Barnett5y

Then I will assert that I would in fact appreciate seeing the reasons for disagreement, even as the case may be that it comes down to axiomatic intuitions.

[-]Matthew Barnett3y50

NVIDIA's stock price is extremely high right now. It's up 134% this year, and up about 6,000% since 2015! Does this shed light on AI timelines?

Here are some notes,

NVIDIA is the top GPU company in the world, by far. This source says that they're responsible for about 83% of the market, with 17% coming from their primary competition, AMD.
By market capitalization, it's currently at $764.86 billion, compared to the largest company, Apple, at $2.655 trillion.
This analysis estimates their projected earnings based on their stock price on September 2nd and comes u

... (read more)

[-]Matthew Barnett3y50

Rationalists are fond of saying that the problems of the world are not from people being evil, but instead a result of the incentives of our system, which are such that this bad outcome is an equilibrium. There's a weaker thesis here that I agree with, but otherwise I don't think this argument actually follows.

In game theory, an equilibrium is determined by both the setup of the game, and by the payoffs for each player. The payoffs are basically the values of the players in the game—their utility functions. In other words, you get different equilibria if p... (read more)

6Viliam3y

The connection between "doing good" and "making a sacrifice" is so strong that people need to be reminded that "win/win" is also a thing. The bad guys typically do whatever is best for them, which often involves hurting others (because some resources are limited). The good guys exercise restraint. This is complicated because there is also the issue of short-term and long-term thinking. Sometimes the bad guys do things that benefit them in short term, but contribute to their fall in long term; while the good guys increase their long-term gains by strategically giving up on some short-term temptations. But it is a just-world fallacy to assume that things always end up this way. Sometimes the bad guys murder millions, and then they live happily to old age. Sometimes the good guys get punished and laughed at, and then they die in despair. How could "good" even have evolved, given that "sacrifice" seems by definition incompatible with "maximizing fitness"? * being good to your relatives promotes your genes. * reciprocal goodness can be an advantage to both players. * doing good -- precisely because it is a sacrifice -- can become a signal of abundance, which makes other humans want to be my allies or mates. * people reward good and punish evil in others, because it is in their selfish interest to live among good people. The problems caused by the evolutionary origin of goodness are also well-known: people are more likely to be good towards their neighbors who can reciprocate or towards potential sexual partners, and they are more likely to do good when they have an audience who approves of it... and less likely to do good to low-status people who can't reciprocate, or when their activities are anonymous. (Steals money from pension funds, polutes the environment, then donates millions to a prestigious university.) I assume that most people are "instinctively good", that is that they kinda want to be good, but they simply follow their instincts, and don't reflect

4habryka3y

I usually associate things like "being evil" more with something like "part of my payoff matrix has a negative coefficient on your payoff matrix". I.e. actively wanting to hurt people and taking inherent interest in making them worse off. Selfishness feels pretty different from being evil emotionally, at least to me.

4Dagon3y

Judgement of evil follows the same pressures as evil itself. Selfishness feels different from sadism to you, at least in part because it's easier to find cooperative paths with selfishness. And this question really does come down to "when should I cooperate vs defect".

3Viliam3y

If your well-being has exactly zero value in my preference function, that literally means that I would kill you in a dark alley if I believed there was zero chance of being punished, because there is a chance you might have some money that I could take. I would call that "evil", too.

2Dagon3y

You can't hypothesize zeros and get anywhere. MANY MANY psychopaths exist, and very few of them find it more effective to murder people for spare change than to further their ends in other ways. They may not care about you, but your atoms are useful to them in their current configuration.

2Viliam3y

There are ways of hurting people other than stabbing them, I just used a simple example. I think there is a confusion about what exactly "selfish" means, and I blame Ayn Rand for it. The heroes in her novels are given the label "selfish" because they do not care about possibilities to actively do something good for other people unless there is also some profit for them (which is what a person with zero value for others in their preference function would do), but at the same time they avoid actively harming other people in ways that could bring them some profit (which is not what a perfectly selfish person would do). As a result, we get quite unrealistic characters who on one hand are described as rational profit maximizers who don't care about others (except instrumentally), but on the other hand they follow an independently reinvented deontological framework that seems like designed by someone who actually cares about other people but is in deep denial about it (i.e. Ayn Rand). A truly selfish person (someone who truly does not care about others) would hurt others in situations where doing so is profitable (including second-order effects). A truly selfish person would not arbitrarily invent a deontological code against hurting other people, because such code is merely a rationalization invented by someone who already has an emotional reason not to hurt other people but wants to pretend that instead this is a logical conclusion derived from first principles. Interacting with a psychopath with likely get you hurt. It will likely not get you killed, because some other way of hurting you has a better risk:benefit profile. Perhaps the most profitable way is to scam you of some money and use you to get introduced to your friends. Only once in a while a situation will arise when raping someone is sufficiently safe, or killing someone is extremely profitable, e.g. because that person stands in a way of a grand business.

2Dagon3y

I'm not sure what our disagreement actually is - I agree with your summary of Ayn Rand, I agree that there are lots of ways to hurt people without stabbing. I'm not sure you're claiming this, but I think that failure to help is selfish too, though I'm not sure it's comparable with active harm. It may be that I'm reacting badly to the use of "truly selfish" - I fear a motte-and-bailey argument is coming, where we define it loosely, and then categorize actions inconsistently as "truly selfish" only in extremes, but then try to define policy to cover far more things. I think we're agreed that the world contains a range of motivated behaviors, from sadistic psychopaths (who have NEGATIVE nonzero terms for others' happiness) to saints (whose utility functions weight very heavily toward other's happiness over their own). I don't know if we agree that "second-order effect" very often dominate the observed behaviors over most of this range. I hope we agree that almost everyone changes their behavior to some extent based on visible incentives. I still disagree with your post that a coefficient of 0 for you in someone's mind implies murder for pocket change. And I disagree with the implication that murder for pocket change is impossible even if the coefficient is above 0 - circumstances matter more than innate utility function. To the OP's point, it's hard to know how to accomplish "make people less selfish", but "make the environment more conducive to positive-sum choices so selfish people take cooperative actions" is quite feasible.

2Viliam3y

I believe this is exactly what it means, unless there is a chance of punishment or being hurt by victim's self-defense or a chance of better alternative interaction with given person. Do you assume that there is always a more profitable interaction? (What if the target says "hey, I just realized that you are a psychopath, and I do not want to interact with you anymore", and they mean it.) Could you please list the pros and cons of deciding whether to murder a stranger who refuses to interact with you, if there is zero risk of being punished, from the perspective of a psychopath? As I see it, the "might get some pocket change" in the pro column is the only nonzero item in this model.

2Dagon3y

There always is that chance. That's mostly our disagreement. Using real-world illustrations (murder) for motivational models (utility) really needs to acknowledge the uncertainty and variability, which the vast majority of the time "adds up to normal". There really aren't that many murders among strangers. And there are a fair number of people who don't value others' very highly.

2Matthew Barnett3y

Yes, I would make this distinction too. Yet, I submit that few people actually believe, or even say they believe, that the main problems in the world are caused by people being gratuitously or sadistically evil. There are some problems that people would explain this way: violent crime comes to mind. But I don't think the evil hypothesis is the most common explanation given by non-rationalists for why we have, say, homelessness and poverty. That is to say that, insofar as the common rationalist refrain of "problems are caused by incentives dammit, not evil people" refers to an actual argument people generally give, it's probably referring to the argument that people are selfish and greedy. And in that sense, the rationalists and non-rationalists are right: it's both the system and the actors within it.

[-]Matthew Barnett5y50

I've heard a surprising number of people criticize parenting recently using some pretty harsh labels. I've seen people call it a form of "Stockholm syndrome" and a breach of liberty, morally unnecessary etc. This seems kind of weird to me, because it doesn't really match my experience as a child at all.

I do agree that parents can sometimes violate liberty, and so I'd prefer a world where children could break free from their parents without penalties. But I also think that most children genuinely love their parents and so would... (read more)

5cousin_it5y

Yeah, that's one argument for tradition: it's simply not the pit of misery that its detractors claim it to be. But for parenting in particular, I think I can give an even stronger argument. Children aren't little seeds of goodness that just need to be set free. They are more like little seeds of anything. If you won't shape their values, there's no shortage of other forces in the world that would love to shape your children's values, without having their interests at heart.

2Matthew Barnett5y

Toddlers, yes. If we're talking about people over the age of say, 8, then it becomes less true. By the time they are a teen, it becomes pretty false. And yet people still say that legal separation at 18 is good. If you are merely making the argument that we should limit their exposure to things that could influence them in harmful directions, then I'd argue that this never stops being a powerful force, including for people well into adulthood and in old age.

6cousin_it5y

Huh? Most 8 year olds can't even make themselves study instead of playing Fortnite, and certainly don't understand the issues with unplanned pregnancies. I'd say 16-18 is about the right age where people can start relying on internal structure instead of external. Many take even longer, and need to join the army or something.

[-]Matthew Barnett5y50

I think that human level capabilities in natural language processing (something like GPT-2 but much more powerful) is likely to occur in some software system within 20 years.

Since human level capabilities in natural language processing is a very rich real-world task, I would consider a system with those capabilities to be adequately described as a general intelligence, though it would likely not be very dangerous due to its lack of world-optimization capabilities.

This belief of mine is based on a few heuristics. Below I have collected a few claims which I... (read more)

1Gurkenglas5y

I expect that human-level language processing is enough to construct human-level programming and mathematical research ability. Aka, complete a research diary the way a human would, by matching with patterns it has previously seen, just as human mathematicians do. That should be capability enough to go as foom as possible.

1Matthew Barnett5y

If AI is limited by hardware rather than insight, I find it unlikely that a 300 trillion parameter Transformer trained to reproduce math/CS papers would be able to "go foom." In other words, while I agree that the system I have described would likely be able to do human-level programming (though it would still make mistakes, just like human programmers!) I doubt that this would necessarily cause it to enter a quick transition to superintelligence of any sort. I suspect the system that I have described above would be well suited for automating some types of jobs, but would not necessarily alter the structure of the economy by a radical degree.

1Gurkenglas5y

It wouldn't necessarily cause such a quick transition, but it could easily be made to. A human with access to this tool could iterate designs very quickly, and he could take himself out of the loop by letting the tool predict and execute his actions as well, or by piping its code ideas directly into a compiler, or some other way the tool thinks up.

1Matthew Barnett5y

My skepticism is mainly that this would be quicker than normal human iteration, or that this would substantially improve upon the strategy of simply buying more hardware. However, as we see in the recent case of eg. roBERTa, there are a few insights which substantially improve upon a single AI system. I just remain skeptical that a single human-level AI system would produce these insights faster than a regular human team of experts. In other words, my opinion of recursive self improvement in this narrow case is that it isn't a fundamentally different strategy from human oversight and iteration. It can be used to automate some parts of the process, but I don't think that foom is necessarily implied in any strong sense.

1Gurkenglas5y

The default argument that such a development would lead to a foom is that an insight-based regular doubling of speed mathematically reaches a singularity in finite time when the speed increases pay insight dividends. You can't reach that singularity with a fleshbag in the loop (though it may be unlikely to matter if with him in the loop, you merely double every day). For certain shapes of how speed increases depend on insight and oversight, there may be a perverse incentive to cut yourself out of your loop before the other guy cuts himself out.

[-]Matthew Barnett3y40

[ETA: Apparently this was misleading; I think it only applied to one company, Alienware, and it was because they didn't get certification, unlike the other companies.]

In my post about long AI timelines, I predicted that we would see attempts to regulate AI. An easy path for regulators is to target power-hungry GPUs and distributed computing in an attempt to minimize carbon emissions and electricity costs. It seems regulators may be going even faster than I believed in this case, with new bans on high performance personal computers now taking effect in six ... (read more)

[-]Matthew Barnett5y40

Is it possible to simultaneously respect people's wishes to live, and others' wishes to die?

Transhumanists are fond of saying that they want to give everyone the choice of when and how they die. Giving people the choice to die is clearly preferable to our current situation, as it respects their autonomy, but it leads to the following moral dilemma.

Suppose someone loves essentially every moment of their life. For tens of thousands of years, they've never once wished that they did not exist. They've never had suicidal thoughts, and have a... (read more)

4jessicata5y

Not if the waiting period gets longer over time (e.g. proportional to lifespan).

2Matthew Barnett5y

Good point. Although, there's still a nonzero chance that they will die, if we continually extend the waiting period in some manner. And perhaps given their strong preference not to die, this is still violating their autonomy?

2avturchin5y

A person could be split on two parts: one that wants to die and other which to live. Then the first part is turned off.

2Dagon5y

You don't need it anywhere near as stark a contrast as this. In fact, it's even harder if the agent (like many actual humans) has previously considered suicide, and has experienced joy that they didn't do so, followed by periods of reconsideration. Intertemporal preference inconsistency is one effect of the fact that we're not actually rational agents. Your question boils down to "when an agent has inconsistent preferences, how do we choose which to support?" My answer is "support the versions that seem to make my future universe better". If someone wants to die, and I think the rest of us would be better off if that someone lives, I'll oppose their death, regardless of what they "really" want. I'll likely frame it as convincing them they don't really want to die, and use the fact that they didn't want that in the past as "evidence", but really it's mostly me imposing my preferences. There are some with whom I can have the altruistic conversation: future-you AND future-me both prefer you stick around. Do it for us? Even then, you can't support any real person's actual preferences, because they don't exist. You can only support your current vision of their preferred-by-you preferences.

[-]Matthew Barnett5y40

I generally agree with the heuristic that we should "live on the mainline", meaning that we should mostly plan for events which capture the dominant share of our probability. This heuristic causes me to have a tendency to do some of the following things

Work on projects that I think have a medium-to-high chance of succeeding and quickly abandon things that seem like they are failing.
Plan my career trajectory based on where I think I can plausibly maximize my long term values.
Study subjects only if I think that I will need to understand them at som

... (read more)

6Raemon5y

Some random thoughts: * Startups and pivots. Startups require lots of commitment even when things feel like they're collapsing – only by perservering through those times can you possibly make it. Still, startups are willing to pivot – take their existing infrastructure but change key strategic approaches. * Escalating commitment. Early on (in most domains), you should pick shorter term projects, because the focus is on learning. Code a website in a week. Code another website in 2 months. Don't stress too much on multi-year plans until you're reasonably confident you sorta know what you're doing. (Relatedly, relationships: early on it makes sense to date a lot to get some sense of who/what you're looking for in a romantic partner. But eventually, a lot of the good stuff comes when you actually commit to longterm relationships that are capable of weathering periods of strife and doubt) * Alternately: Givewell (or maybe OpenPhil?) did mixtures of shallow dives, deep dives and medium dives into cause areas because they learned different sorts of things from each kind of research. * Commitment mindset. Sort of how Nate Soares recommends separating the feeling of conviction from the epistemic belief of high-success... you can separate "I'm going to stick with this project for a year or two because it's likely to work" from "I'm going to stick to this project for a year or two because sticking to projects for a year or two is how you learn how projects work on the 1-2 year timescale, including the part where you shift gears and learn from mistakes and become more robust about them.

5Gurkenglas5y

Mathematically, it seems like you should just give your heuristic the better data you already consciously have: If your untrustworthy senses say you aren't on the mainline, the correct move isn't necessarily to believe them, but rather to decide to put effort into figuring it out, because it's important. It's clear how your heuristic would evolve. To embrace it correctly, you should make sure that your entire life lives in the mainline. If there's a game with negative expected value, where the worst outcome has chance 10%, and you play it 20 times, that's stupid. Budget the probability you are willing to throw away for the rest of your life now. If you don't think you can stay to your budget, if you know that always, you will tomorrow play another round of that game by the same reasoning as today, then realize that today's reasoning decides today and tomorrow. Realize that the mainline of giving in to the heuristic is losing eventually, and let the heuristic destroy itself immediately.

2Matt Goldenberg5y

There are two big issues with the "living in the mainline" strategy: 1. Most of the highest EV activities are those that have low chance of success but big rewards. I suspect much of your volatile behavior is bouncing between chasing opportunities you see as high value, and then realizing you're not on the mainline and correcting, then realizing there are higher EV opportunities and correcting again. 2. Strategies that work well on the mainline often fail spectacularly in the face of black swans. So they have a high probability of working but also very negative EV in unlikely situations (which you ignore if you're only thinking about the mainline). Two alternatives to the "living on the mainline" heuristic: 1. The Anti-fragility heuristic: * Use the barbell strategy, to split your activities between surefire wins with low upsides and certainty, and risky moonshots with low downsides but lots of uncertainty around upsides. * Notice the reasons that things fail, and make them robust to that class of failure in the future. * Try lots of things, and stick with the ones that work over time. 2. The Effectuation Heuristic: * Go into areas where you have unfair advantages. * Spread your downside risk to people or organizations who can handle it. * In generally, work to CREATE the mainline where you have an unfair advantage and high upside. You might get some mileage out of reading the effectuation and anti-fragility sections of this post.

[-]Matthew Barnett5y40

In discussions about consciousness I find myself repeating the same basic argument against the existence of qualia constantly. I don't do this just to be annoying: It is just my experience that

1. People find consciousness really hard to think about and has been known to cause a lot of disagreements.

2. Personally I think that this particular argument dissolved perhaps 50% of all my confusion about the topic, and was one of the simplest, clearest arguments that I've ever seen.

I am not being original either. The argument is the same one that has b... (read more)

4Raemon5y

And I assume your claim is that we can explain why I believe in Qualia without referring to qualia? I haven’t thought that hard about this and am open to that argument. But afaict your comments here so far haven’t actually addressed this question yet. Edit: to be clear, I don't really much why other people talk about qualia. I care why I perceive myself to experience things. If it's an illusion, cool, but then why do I experience the illusion?

1Matthew Barnett5y

If belief is construed as some sort of representation which stands for external reality (as in the case of some correspondence theories of truth), then we can take the claim to be strong prediction of contemporary neuroscience. Ditto for whether we can explain why we talk about qualia. It's not that I could explain exactly why you in particular talk about qualia. It's that we have an established paradigm for explaining it. It's similar in the respect that we have an established paradigm for explaining why people report being able to see color. We can model the eye, and the visual cortex, and we have some idea of what neurons do even though we lack the specific information about how the whole thing fits together. And we could imagine that in the limit of perfect neuroscience, we could synthesize this information to trace back the reason why you said a particular thing. Since we do not have perfect neuroscience, the best analogy would be analyzing the 'beliefs' and predictions of an artificial neural network. If you asked me, "Why does this ANN predict that this image is a 5 with 98% probability" it would be difficult to say exactly why, even with full access to the neural network parameters. However, we know that unless our conception of neural networks is completely incorrect, in principle we could trace exactly why the neural network made that judgement, including the exact steps that caused the neural network to have the parameters that it has in the first place. And we know that such an explanation requires only the components which make up the ANN, and not any conscious or phenomenal properties.

2Raemon5y

I can't tell whether we're arguing about the same thing. Like, I assume that I am a neural net predicting things and deciding things and if you had full access to my brain you could (in principle, given sufficient time) understand everything that was going on in there. But, like, one way or another I experience the perception of perceiving things. (I'd prefer to taboo 'Qualia' in case it has particular connotations I don't share. Just 'that thing where Ray perceives himself perceiving things, and perhaps the part where sometimes Ray has preferences about those perceptions of perceiving because the perceptions have valence.' If that's what Qualia means, cool, and if it means some other thing I'm not sure I care) My current working model of "how this aspect of my perception works" is described in this comment, I guess easy enough to quote in full: The reason I care about any of this is that I believe that a "perceptions-having-valence" is probably morally relevant. (or, put in usual terms: suffering and pleasure seem morally relevant). (I think it's quite possibe that future-me will decide I was confused about this part, but it's the part I care about anyhow) Are you saying the my perceiving-that-I-perceive-things-with-valence is an illusion, and that I am in fact not doing that? Or some other thing? (To be clear, I AM open to 'actually Ray yes, the counterintuitive answer is that no, you're not actually perceiving-that-you-perceive-things-and-some-of-the-perceptions-have-valence.' The topic is clearly confusing and behind the veil of epistemic-ignorance it seems quite plausible I'm the confused one here. Just noting that so far that from way you're phrasing things I can't tell whether your claims map onto the things I care about )

1Matthew Barnett5y

To me this is a bit like the claim of someone who claimed psychic powers but still wanted to believe in physics who would say, "I assume you could perfectly well understand what was going on at a behavioral level within my brain, but there is still a datum left unexplained: the datum of me having psychic powers." There are a number of ways to respond to the claim: * We could redefine psychic powers to include mere physical properties. This has the problem that psychics insist that psychic power is entirely separate from physical properties. Simple re-definition doesn't make the intuition go away and doesn't explain anything. * We could alternatively posit new physics which incorporates psychic powers. This has the occasional problem that it violates Occam's razor, since the old physics was completely adequate. Hence the debunking argument I presented above. * Or, we could incorporate the phenomenon within a physical model by first denying that it exists and then explaining the mechanism which caused you to believe in it, and talk about it. In the case of consciousness, the third response amounts to Illusionism, which is the view that I am defending. It has the advantage that it conservatively doesn't promise to contradict known physics, and it also does justice to the intuition that consciousness really exists. To most philosophers who write about it, qualia is defined as the experience of what it's like. Roughly speaking, I agree with thinking of it as a particular form of perception that we experience. However, it's not just any perception, since some perceptions can be unconscious perceptions. Qualia specifically refer to the qualitative aspects of our experience of the world: the taste of wine, the touch of fabric, the feeling of seeing blue, the suffering associated with physical pain etc. These are said to be directly apprehensible to our 'internal movie' that is playing inside our head. It is this type of property which I am applying the framework of

4Raemon5y

It still feels very important that you haven't actually explained this. In the case of psychic powers, I (think?) we actually have pretty good explanations for where perceptions of psychic powers comes from, which makes the perception of psychic powers non-mysterious. (i.e. we know how cold reading works, and how various kinds of confirmation bias play into divination). But, that was something that actually had to be explained. It feels like you're just changing the name of the confusing thing from 'the fact that I seem conscious to myself' to 'the fact that I'm experiencing an illusion of consciousness.' Cool, but, like, there's still a mysterious thing that seems quite important to actually explain.

1Matthew Barnett5y

Also just in general, I disagree that skepticism is not progress. If I said, "I don't believe in God because there's nothing in the universe with those properties..." I don't think it's fair to say, "Cool, but like, I'm still praying to something right, and that needs to be explained" because I don't think that speaks fully to what I just denied. In the case of religion, many people have a very strong intuition that God exists. So, is the atheist position not progress because we have not explained this intuition?

2Raemon5y

I agree that skepticism generally can be important progress (I recently stumbled upon this old comment making a similar argument about how saying "not X" can be useful) The difference between God and consciousness is that the interesting bit about consciousness *is* my perception of it, full stop. Unlike God or psychic powers, there is no separate thing from my perception of it that I'm interested in.

1Matthew Barnett5y

If by perception you simply mean "You are an information processing device that takes signals in and outputs things" then this is entirely explicable on our current physical models, and I could dissolve the confusion fairly easily. However, I think you have something else in mind which is that there is somehow something left out when I explain it by simply appealing to signal processing. In that sense, I think you are falling right into the trap! You would be doing something similar to the person who said, "But I am still praying to God!"

2Raemon5y

I don't have anything else in mind that I know of. "Explained via signal processing" seems basically sufficent. The interesting part is "how can you look at a given signal-processing-system, and predict in advance whether that system is the sort of thing that would talk* about Qualia, if it could talk?" (I feel like this was all covered in the sequences, basically?) *where "talk about qualia" is shorthand 'would consider the concept of qualia important enough to have a concept for.'"

2Matthew Barnett5y

I mean, I agree that this was mostly covered in the sequences. But I also think that I disagree with the way that most people frame the debate. At least personally I have seen people who I know have read the sequences still make basic errors. So I'm just leaving this here to explain my point of view. Intuition: On a first approximation, there is something that it is like to be us. In other words, we are beings who have qualia. Counterintuition: In order for qualia to exist, there would need to exist entities which are private, ineffable, intrinsic, subjective and this can't be since physics is public, effable, and objective and therefore contradicts the existence of qualia. Intuition: But even if I agree with you that qualia don't exist, there still seems to be something left unexplained. Counterintuition: We can explain why you think there's something unexplained because we can explain the cause of your belief in qualia, and why you think they have these properties. By explaining why you believe it we have explained all there is to explain. Intuition: But you have merely said that we could explain it. You have not have actually explained it. Counterintuition: Even without the precise explanation, we now have a paradigm for explaining consciousness, so it is not mysterious anymore. This is essentially the point where I leave.

4TAG5y

Physics as map is. Note that we can't compare the map directly to the territory.

3Slider5y

We do not telepathically receive experiemnt results when they are performed. In reality you need ot intake the measumrent results from your first-person point of view (use eyes to read led screen or use ears to hear about stories of experiments performed). It seems to be taht experiments are intersubjective in that other observers will report having experiences that resemble my first-hand experiences. For most purposes shorthanding this to "public" is adequate enough. But your point of view is "unpublisable" in that even if you really tried there is no way to provide you private expereience to the public knowledge pool ("directly"). "I now how you feel" is a fiction it doesn't actually happen. Skeptisim about the experiencing of others is easier but being skeptical about your own experiences would seem to be ludicrous.

1Matthew Barnett5y

I am not denying that humans take in sensory input and process it using their internal neural networks. I am denying that process has any of the properties associated with consciousness in the philosophical sense. And I am making an additional claim which is that if you merely redefine consciousness so that it lacks these philosophical properties, you have not actually explained anything or dissolved any confusion. The illusionist approach is the best approach because it simultaneously takes consciousness seriously and doesn't contradict physics. By taking this approach we also have an understood paradigm for solving the hard problem of consciousness: namely, the hard problem is reduced to the meta-problem (see Chalmers).

1Matthew Barnett5y

I don't actually agree. Although I have not fully explained consciousness, I think that I have shown a lot. In particular, I have shown us what the solution to the hard problem of consciousness would plausibly look like if we had unlimited funding and time. And to me, that's important. And under my view, it's not going to look anything like, "Hey we discovered this mechanism in the brain that gives rise to consciousness." No, it's going to look more like, "Look at this mechanism in the brain that makes humans talk about things even though the things they are talking about have no real world referent." You might think that this is a useless achievement. I claim the contrary. As Chalmers points out, pretty much all the leading theories of consciousness fail the basic test of looking like an explanation rather than just sounding confused. Don't believe me? Read Section 3 in this paper. In short, Chalmers reviews the current state of the art in consciousness explanations. He first goes into Integrated Information Theory (IIT), but then convincingly shows that IIT fails to explain why we would talk about consciousness and believe in consciousness. He does the same for global workspace theories, first order representational theories, higher order theories, consciousness-causes-collapse theories, and panpsychism. Simply put, none of them even approach an adequate baseline of looking like an explanation. I also believe that if you follow my view carefully you might stop being confused about a lot of things. Like, do animals feel pain? Well it depends on your definition of pain -- consciousness is not real in any objective sense so this is a definition dispute. Same with asking whether person A is happier than person B, or asking whether computers will ever be conscious. Perhaps this isn't an achievement strictly speaking relative to the standard Lesswrong points of view. But that's only because I think the standard Lesswrong point of view is correct. Yet even so, I st

1TAG5y

That's an argument against dualism not an argument against qualia. If mind brain identity is true, neural activity is causing reports, and qualia, along with the rest of consciousness are identical to neural activity, so qualia are also causing reports.

1Matthew Barnett5y

If you identify qualia as behavioral parts of our physical models, then are you also willing to discard the properties philosophers have associated with qualia, such as * Ineffable, as they can't be explained using just words or mathematical sentences * Private, as they are inaccessible to outside third-person observers * Intrinsic, as they are fundamental to the way we experience the world If you are willing to discard these properties, then I suggest we stop using the world "qualia" since you have simply taken all the meaning away once you have identified them with things that actually exist. This is what I mean when I say that I am denying qualia. It is analogous to someone who denies that souls exist by first conceding that we could identify certain physical configurations as examples of souls, but then explaining that this would be confusing to anyone who talks about souls in the traditional sense. Far better in my view to discard the idea altogether.

7Raemon5y

My orientation to this conversation seems more like "hmm, I'm learning that it is possible the word qualia has a bunch of connotations that I didn't know it had", as opposed to "hmm, I was wrong to believe in the-thing-I-was-calling-qualia." But I'm not yet sure that these connotations are actually universal – the wikipedia article opens with: Later on, it notes the three characteristics (ineffable/private/intrinsic) that Dennett listed. But this looks more like an accident of history than something intrinsic to the term. The opening paragraphs defined qualia the way I naively expected it to be defined. My impression looking at the various defintions and discussion is not that qualia was defined in this specific fashion, so much as various people trying to grapple with a confusing problem generated various possible definitions and rules for it, and some of those turned out to be false once we came up with better understanding. I can see where you're coming from with the soul analogy, but I'm not sure if it's more like the soul analogy, or more like "One early philosopher defined 'a human' as a featherless biped, and then a later one said "dude, look at this featherless chicken I just made" and they realized the definition was silly. I guess my question here is – do you have a suggestion for a replacement word for "the particular kind of observation that gets made by an entity that actually gets to experience the perception"? This still seems importantly different from "just a perception", since very simple robots and thermostats or whatever can be said to have those. I don't really care whether they are inherently private, ineffable or intrinsic, and whether Daniel Dennett was able to eff them seems more like a historical curiosity to me. The wikipedia article specifically says that they people argue a lot over the definitions: That definition there is the one I'm generally using, and the one which seems important to have a word for. This seems more like a p

2Raemon5y

It does seem to me something like "I expect the sort of mind that is capable of viewing qualia of other people would be sufficiently different from a human mind that it may still be fair to call them 'private/ineffable among humans.'"

1Matthew Barnett5y

Thanks for engaging with me on this thing. :) I know I'm not being as clear as I could possibly be, and at some points I sort of feel like just throwing "Quining Qualia" or Keith Frankish's articles or a whole bunch of other blog posts at people and say, "Please just read this and re-read it until you have a very distinct intuition about what I am saying." But I know that that type of debate is not helpful. I think I have a OK-to-good understanding of what you are saying. My model of your reply is something like this, "Your claim is that qualia don't exist because nothing with these three properties exists (ineffability/private/intrinsic), but it's not clear to me that these three properties are universally identified with qualia. When I go to Wikipedia or other sources, they usually identify qualia with 'what it's like' rather than these three very specific things that Daniel Dennett happened to list once. So, I still think that I am pointing to something real when I talk about 'what it's like' and you are only disputing a perhaps-strawman version of qualia." Please correct me if this model of you is inaccurate. I recognize what you are saying, and I agree with the place you are coming from. I really do. And furthermore, I really really agree with the idea that we should go further than skepticism and we should always ask more questions even after we have concluded that something doesn't exist. However, the place I get off the boat is where you keep talking about how this 'what it's like' thing is actually referring to something coherent in the real world that has a crisp, natural boundary around it. That's the disagreement. I don't think it's an accident of history either that those properties are identified with qualia. The whole reason Daniel Dennett identified them was because he showed that they were the necessary conclusion of the sort of thought experiments people use for qualia. He spends the whole first several paragraphs justifying them using vario

1Slider5y

Spouting nonsense is different from being wrong. If I say that there are no rectangles with 5 angles that can be processed pretty straght forwardly because the concept of a rectangle is unproblematic. But if you seek why that statement was made and the person points to a pentagon you will find 5 angles. Now there are polygons with 5 angles. If you give a short word for 5 angle rectangle" it's correct to say those don't exists. But if you give an ostensive definition of the shape then it does exist and it's more to the point to say that it's not a rectangle rather that it doesn't exist. In the details when persons say "what it is like to see green" one could fail to get what they mean or point to. If someone says "look a unicorn" and one has proof that unicorns don't exist that doesn't mean that the unicorn reference is not referencing something or that the reference target does not exist. If you end up in a situation where you point at a horse and say "those things do not exist. Look no horn, doesn't exist" you are not being helpful. If somebody is pointing to a horse and says "look, a unicorn!" and you go "where? I see only horses" you are also not being helpful. Being "motivatedly uncooperative in ostension receiving" is not cool. Say that you made a deal to sell a gold bar in exchange for a unicorn. Then refusing to accept any object as an unicorn woud let you keep your gold bar and you migth be tempted to play dumb. When people are saying "what it feels like to see green" they are trying to communicate something and failing their assertion by sabotaging their communication doesn't prove anything. Communication is hard yes but doing too much semantics substitution means you start talking past each other.

1TAG5y

I am not suggesting that qualia should be identified with neural activity in a way that loses any aspects of the philosophical definition... bearing in mind that the he philosophical definition does not assert that qualia are non physical.

3jessicata5y

What are you experiencing right now? (E.g. what do you see in front of you? In what sense does it seem to be there?)

1Matthew Barnett5y

I won't lie -- I have a very strong intuition that there's this visual field in front of me, and that I can hear sounds that have distinct qualities, and simultaneously I can feel thoughts rush into my head as if there is an internal speaker and listener. And when I reflect on some visual in the distance, it seems as though the colors are very crisp and exist in some way independent of simple information processing in a computer-type device. It all seems very real to me. I think the main claim of the illusionist is that these intuitions (at least insofar as the intuitions are making claims about the properties of qualia) are just radically incorrect. It's as if our brains have an internal error in them, not allowing us to understand the true nature of these entities. It's not that we can't see or something like that. It's just that the quality of perceiving the world has essentially an identical structure to what one might imagine a computer with a camera would "see." Analogy: Some people who claim to have experienced heaven aren't just making stuff up. In some sense, their perception is real. It just doesn't have the properties we would expect it to have at face value. And if we actually tried looking for heaven in the physical world we would find it to be little else than an illusion.

4jessicata5y

What's the difference between making claims about nearby objects and making claims about qualia (if there is one)? If I say there's a book to my left, is that saying something about qualia? If I say I dreamt about a rabbit last night, is that saying something about qualia? (Are claims of the form "there is a book to my left" radically incorrect?) That is, is there a way to distinguish claims about qualia from claims about local stuff/phenomena/etc?

3Matthew Barnett5y

Sure. There are a number of properties usually associated with qualia which are the things I deny. If we strip these properties away (something Kieth Frankish refers to as zero qualia) then we can still say that they exist. But it's confusing to say that something exists when its properties are so minimal. Daniel Dennett listed a number of properties that philosophers have assigned to qualia and conscious experience more generally: Ineffable because there's something Mary the neuroscientist is missing when she is in the black and white room. And someone who tried explaining color to her would not be able to fully. Intrinsic because it cannot be reduced to bare physical entities, like electrons (think: could you construct a quale if you had the right set of particles?). Private because they are accessible to us and not globally available. In this sense, if you tried to find out the qualia that a mouse was experiencing as it fell victim to a trap, you would come up fundamentally short because it was specific to the mouse mind and not yours. Or as Nagel put it, there's no way that third person science could discover what it's like to be a bat. Directly apprehensible because they are the elementary things that make up our experience of the world. Look around and qualia are just what you find. They are the building blocks of our perception of the world. It's not necessarily that none of these properties could be steelmanned. It is just that they are so far from being steelmannable that it is better to deny their existence entirely. It is the same as my analogy with a person who claims to have visited heaven. We could either talk about it as illusory or non-illusory. But for practical purposes, if we chose the non-illusory route we would probably be quite confused. That is, if we tried finding heaven inside the physical world, with the same properties as the claimant had proposed, then we would come up short. Far better then, to treat it as a mistake inside of our co

5jessicata5y

Thanks for the elaboration. It seems to me that experiences are: 1. Hard-to-eff, as a good-enough theory of what physical structures have which experiences has not yet been discovered, and would take philosophical work to discover. 2. Hard to reduce to physics, for the same reason. 3. In practice private due to mind-reading technology not having been developed, and due to bandwidth and memory limitations in human communication. (It's also hard to imagine what sort of technology would allow replicating the experience of being a mouse) 4. Pretty directly apprehensible (what else would be? If nothing is, what do we build theories out of?) It seems natural to conclude from this that: 1. Physical things exist. 2. Experiences exist. 3. Experiences probably supervene on physical things, but the supervenience relation is not yet determined, and determining it requires philosophical work. 4. Given that we don't know the supervenience relation yet, we need to at least provisionally have experiences in our ontology distinct from physical entities. (It is, after all, impossible to do physics without making observations and reporting them to others) Is there something I'm missing here?

1Matthew Barnett5y

Here's a thought experiment which helped me lose my 'belief' in qualia: would a robot scientist, who was only designed to study physics and make predictions about the world, ever invent qualia as a hypothesis? Assuming the actual mouth movements we make when we say things like, "Qualia exist" are explainable via the scientific method, the robot scientist could still predict that we would talk and write about consciousness. But would it posit consciousness as a separate entity altogether? Would it treat consciousness as a deep mystery, even after peering into our brains and finding nothing but electrical impulses?

2jessicata5y

Robots take in observations. They make theories that explain their observations. Different robots will make different observations and communicate them to each other. Thus, they will talk about observations. After making enough observations they make theories of physics. (They had to talk about observations before they made low-level physics theories, though; after all, they came to theorize about physics through their observations). They also make bridge laws explaining how their observations are related to physics. But, they have uncertainty about these bridge laws for a significant time period. The robots theorize that humans are similar to them, based on the fact that they have functionally similar cognitive architecture; thus, they theorize that humans have observations as well. (The bridge laws they posit are symmetric that way, rather than being silicon-chauvinist)

1Matthew Barnett5y

I think you are using the word "observation" to refer to consciousness. If this is true, then I do not deny that humans take in observations and process them. However, I think the issue is that you have simply re-defined consciousness into something which would be unrecognizable to the philosopher. To that extent, I don't say you are wrong, but I will allege that you have not done enough to respond to the consciousness-realist's intuition that consciousness is different from physical properties. Let me explain: If qualia are just observations, then it seems obvious that Mary is not missing any information in her room, since she can perfectly well understand and model the process by which people receive color observations. Likewise, if qualia are merely observations, then the Zombie argument amounts to saying that p-Zombies are beings which can't observe anything. This seems patently absurd to me, and doesn't seem like it's what Chalmers meant at all when he came up with the thought experiment. Likewise, if we were to ask, "Is a bat conscious?" then the answer would be a vacuous "yes" under your view, since they have echolocaters which take in observations and process information. In this view even my computer is conscious since it has a camera on it. For this reason, I suggest we are talking about two different things.

2jessicata5y

Mary's room seems uninteresting, in that robot-Mary can predict pretty well what bit-pattern she's going to get upon seeing color. (To the extent that the human case is different, it's because of cognitive architecture constraints) Regarding the zombie argument: The robots have uncertainty over the bridge laws. Under this uncertainty, they may believe it is possible that humans don't have experiences, due to the bridge laws only identifying silicon brains as conscious. Then humans would be zombies. (They may have other theories saying this is pretty unlikely / logically incoherent / etc) Basically, the robots have a primitive entity "my observations" that they explain using their theories. They have to reconcile this with the eventual conclusion they reach that their observations are those of a physically instantiated mind like other minds, and they have degrees of freedom in which things they consider "observations" of the same type as "my observations" (things that could have been observed).

1Matthew Barnett5y

As a qualia denier, I sometimes feel like I side more with the Chalmers side of the argument, which at least admits that there's a strong intuition for consciousness. It's not that I think that the realist side is right, but it's that I see the naive physicalists making statements that seem to completely misinterpret the realist's argument. I don't mean to single you out in particular. However, you state that Mary's room seems uninteresting because Mary is able to predict the "bit pattern" of color qualia. This seems to me to completely miss the point. When you look at the sky and see blue, is it immediately apprehensible as a simple bit pattern? Or does it at least seem to have qualitative properties too? I'm not sure how to import my argument onto your brain without you at least seeing this intuition, which is something I considered obvious for many years.

8jessicata5y

There is a qualitative redness to red. I get that intuition. I think "Mary's room is uninteresting" is wrong; it's uninteresting in the case of robot scientists, but interesting in the case of humans, in part because of what it reveals about human cognitive architecture. I think in the human case, I would see Mary seeing a red apple as gaining in expressive vocabulary rather than information. She can then describe future things as "like what I saw when I saw that first red apple". But, in the case of first seeing the apple, the redness quale is essentially an arbitrary gensym. I suppose I might end up agreeing with the illusionist view on some aspects of color perception, then, in that I predict color quales might feel like new information when they actually aren't. Thanks for explaining.

5Raemon5y

I am curious if you disagree with the claim that (human) Mary is gaining implicit information, in that (despite already knowing many facts about red-ness), her (human) optic system wouldn't have successfully been able to predict the incoming visual data from the apple before seeing it, but afterwards can?

2jessicata5y

That does seem right, actually. Now that I think about it, due to this cognitive architecture issue, she actually does gain new information. If she sees a red apple in the future, she can know that it's red (because it produces the same qualia as the first red apple), whereas she might be confused about the color if she hadn't seen the first apple. I think I got confused because, while she does learn something upon seeing the first red apple, it isn't the naive "red wavelengths are red-quale", it's more like "the neurons that detect red wavelengths got wired and associated with the abstract concept of red wavelengths." Which is still, effectively, new information to Mary-the-cognitive-system, given limitations in human mental architecture.

1Slider5y

A physicist might discover that you can make computers out of matter. You can make such computers produce sounds. In processing sounds "homonym" is a perfectly legimate and useful concept. Even if two words are stored in far away hardware locations knowing that they will "sound detection clash" is important information. Even if you slice it a little differently and use different kinds of computer architechtures it woudl still be a real phenomenon. In technical terms there might be the issue whether its meaningful to differntiate between founded concepts and hypothesis. If hypotheses are required then you could have a physicist that didn't ever talk about temperature.

1Matthew Barnett5y

It seems to me that you are trying to recover the properties of conscious experience in a way that can be reduced to physics. Ultimately, I just feel that this approach is not likely to succeed without radical revisions to what you consider to be conscious experience. :) Generally speaking, I agree with the dualists who argue that physics is incompatible with the claimed properties of qualia. Unlike the dualists, I see this as a strike against qualia rather than a strike against physics. David Chalmers does a great job in his articles outlining why conscious properties don't fit nicely in our normal physical models. It's not simply that we are awaiting more data to fill in the details: it's that there seems to be no way even in principle to incorporate conscious experience into physics. Physics is just a different type of beast: it has no mental core, it is entirely made up of mathematical relations, and is completely global. Consciousness as it's described seems entirely inexplicable in that respect, and I don't see how it could possibly supervene on the physical. One could imagine a hypothetical heaven-believer (someone who claimed to have gone to heaven and back) listing possible ways to incorporate their experience into physics. They could say, On the other hand, a skeptic could reply that: Even if mind reading technology isn't good enough yet, our best models say that humans can be described as complicated computers with a particular neural network architecture. And we know that computers can have bugs in them causing them to say things when there is no logical justification. Also, we know that computers can lack perfect introspection so we know that even if it is utterly convinced that heaven is real, this could just be due to the fact that the computer is following its programming and is exceptionally stubborn. Heaven has no clear interpretation in our physical models. Yes, we could see that a supervenience is possible. But why rely on that hope? Isn't

4jessicata5y

It seems that doubting that we have observations would cause us to doubt physics, wouldn't it? Since physics-the-discipline is about making, recording, communicating, and explaining observations. Why think we're in a physical world if our observations that seem to suggest we are are illusory? This is kind of like if the people saying we live in a material world arrived at these theories through their heaven-revelations, and can only explain the epistemic justification for belief in a material world by positing heaven. Seems odd to think heaven doesn't exist in this circumstance. (Note, personally I lean towards supervenient neutral monism: direct observation and physical theorizing are different modalities for interacting with the same substance, and mental properties supervene on physical ones in a currently-unknown way. Physics doesn't rule out observation, in fact it depends on it, while itself being a limited modality, such that it is unsurprising if you couldn't get all modalities through the physical-theorizing modality. This view seems non-contradictory, though incomplete.)

2Slider5y

You seem to have similar characteristic in your beliefs I encountered on less wrong before. https://www.lesswrong.com/posts/TniCuWCDxQeqFSxut/arguments-for-the-existence-of-qualia-1?commentId=Zwyh8Xt5uaZ4ZBYbP There is the phenomenon of qualia and then there is the ontological extension. The word does not refer to the ontological extension. It would be like explaining lightning with lightning. Sure when we dig down there are non-lightning parts. But lightning still zaps people. Or it would be a category error like saying that if you can explain physics without coordinates by only positing that energy exists you should drop coordinates from your concepts. But coordinates are not a thing to believe in, it's a conceptual tool to specify claims not a hypothesis in itself. When physists believe in a particular field theory they are not agreeing with the greek philosphers that think that the world is made of a type of number.

2Matthew Barnett5y

My basic claim is that the way that people use the word qualia implicitly implies the ontological extensions. By using the term, you are either smuggling these extensions in, or you are using the term in a way that no philosopher uses it. Here are some intuitions: Qualia are private entities which occur to us and can't be inspected via third person science. Qualia are ineffable; you can't explain them using a sufficiently complex English or mathematical sentence. Qualia are intrinstic; you can't construct a quale if you had the right set of particles. etc. Now, that's not to say that you can't define qualia in such a way that these ontological extensions are avoided. But why do so? If you are simply re-defining the phenomenon, then you have not explained anything. The intuitions above still remain, and there is something still unexplained: namely, why people think that there are entities with the above properties. That's why I think that instead, the illusionist approach is the correct one. Let me quote Keith Frankish, who I think does a good job explaining this point of view, In the case of lightning, I think that the first approach would be correct, since lightning forms a valid physical category under which we can cast our scientific predictions of the world. In the case of the orbit of Uranus, the second approach is correct, since it was adequately explained by appealing to understood Newtonian physics. However, the third approach is most apt for bizarre phenomena that seem at first glance to be entirely incompatible with our physics. And qualia certainly fit the bill in that respect.

4gilch5y

When I say "qualia" I mean individual instances of subjective, conscious experience full stop. These three extensions are not what I mean when I say "qualia". ---------------------------------------- Not convinced of this. There are known neural correlates of consciousness. That our current brain scanners lack the required resolution to make them inspectable does not prove that they are not inspectable in principle. This seems to be a limitation of human language bandwidth/imagination, but not fundamental to what qualia are. Consider the case of the conjoined twins Krista and Tatiana, who share some brain structure and seem to be able "hear" each other's thoughts and see through each other's eyes. Suppose we set up a thought experiment. Suppose that they grow up in a room without color, like Mary's room. Now knock out Krista and show Tatiana something red. Remove the red thing before Krista wakes up. Wouldn't Tatiana be able to communicate the experience of red to her sister? That's an effable quale! And if they can do it, then in principle, so could you, with a future brain-computer interface. Really, communicating at all is a transfer of experience. We're limited by common ground, sure. We both have to be speaking the same language, and have to have enough experience to be able to imagine the other's mental state. Again, not convinced. Isn't your brain made of particles? I construct qualia all the time just by thinking about it. (It's called "imagination".) I don't see any reason in principle why this could not be done externally to the brain either.

1Slider5y

The Tatiana and krista experiment is quite interesting but stretches the concept of communication to it's limits. I am inclined to say that having a shared part of your conciousness is not communication in the same way that sharing a house is not traffic. It does strike me that communication involves directed construction of thoughts and it's easy to imagine that the scope of what this construction is capable would be vastly smaller than what goes on in the brain in other processes. Extending the construction to new types of thoughts might be a soft border rather than a hard one. With enough verbal sentences it should be in principle to be able to reconstruct an actual graphical image, but even with overtly descriptive prose this level is not really reached (I presume) but remains within the realm of sentence-like data structures. In the example Tatiana directs the visual cortex and Krista can just recall the representation later. But in a single conciouness brain nothing can be made "ready" but it must be assembled by the brain itself from sensory inputs. That is cognitive space probably has small funnels and for signficant objects they can't travel them as themselfs but must be chopped off into pieces and reassembled after passing the tube.

2gilch5y

Let's extend the thought experiment a bit. Suppose technology is developed to separate the twins. They rely on their shared brain parts for vital functions, so where we cut nerve connections we replace them with a radio transceiver and electrode array in each twin. Now they are communicating thoughts via a prosthesis. Is that not communication? ---------------------------------------- Maybe you already know what it is like to be a hive mind with a shared consciousness, because you are one: cutting the corpus callosum creates a split-brained patient that seems to have two different personalities that don't always agree with each other. Maybe there are some connections left, but the bandwidth has been drastically reduced. And even within hemispheres, the brain seems to be composed of yet smaller modules. Your mind is made of parts that communicate with each other and share experience, and some of it is conscious. I think the line dividing individual persons is a soft one. A sufficiently high-bandwidth communication interface can blur that boundary, even to the point of fusing consciousness like brain hemispheres. Shared consciousness means shared qualia, even if that connection is later severed, you might still remember what it was like to be the other person. And in that way, qualia could hypothetically be communicated between individuals, or even species.

1Slider5y

If you would copy my brain but make it twice as large that copy would be as "lonely" as I would be and this would remain after arbitrary doublings. Single individuals can be extended in space without communicating with other individuals. The "extended wire" thought experiement doesn't specify enough how that physical communication line is used. It's plausible that there is no "verbalization" process like there is an step to write an email if one replaces sonic communication with ip-packet communication. With huge relative distance would come speed of light delays, if one twin was on earth and another on the moon there would be a round trip latency of seconds which probably would distort how the combined brain works. (And I guess with doublign in size would need to come with proportionate slowing to have same function). I think there is a difference between a information system being spatially extended and having two information systems interface with each other. Say that you have 2 routers or 10 routers on the same length of line. It makes sense to make a distinction that each routers functions "independently" even if they have to be able to suggest each other enough that packets flow throught. To the first router the world "downline" seems very similar whether or not intermediate routers exist. I don't count information system internal processing as communicating thus I don't count "thinking" into communicating. Thus the 10 router version does more communicating than the 2 router version. I think the "verbalization" step does mean that even highbandwidth connection doesn't automatically mean qualia sharing. I am thinking of plugings that allow programming languages to share code. Even if there is a perfect 1-to-1 compatibility between the abstractions of the languages I think still each language only ever manipulates their version of that representation. Cross-using without translation would make it illdefined what would be correct function but if you do transla

2gilch5y

I am assuming that the twins communicating thoughts requires an act of will like speaking does. I do have reasons for this. Watching their faces when they communicate thoughts makes it seem voluntary. But most of what you are doing when speaking is already subconscious: One can "understand" the rules of grammar well enough to form correct sentences on nearly all attempts, and yet be unable to explain the rules to a computer program (or to a child or ESL student). There is an element of will, but it's only an element. It may be the case that even with a high-bandwidth direct-brain interface it would take a lot of time and practice to understand another's thoughts. Humans have a common cognitive architecture by virtue of shared genes, but most of our individual connectomes are randomized and shaped by individual experience. Our internal representations may thus be highly idiosyncratic, meaning a direct interface would be ad-hoc and only work on one person. How true this is, I can only speculate without more data. In your programming language analogy, these data types are only abstractions built on top of a more fundamental CPU architecture where the only data types are bytes. Maybe an implementation of C# could be made that uses exactly the same bit pattern for an int as Haskell does. Human neurons work pretty much the same way across individuals, and even cortical columns seem to use the same architecture. ---------------------------------------- I don't think the inability to communicate qualia is primarily due to the limitation of language, but due to the limitation of imagination. I can explain what a tesseract is, but that doesn't mean you can visualize it. I could give you analogies with lower dimensions. Maybe you could understand well enough to make a mental model that gives you good predictions, but you still can't visualize it. Similarly, I could explain what it's like to be a tetrachromat, how septarine and octarine are colors distinct from the others,

2Slider5y

Sure the difference between hearing about a tesseract and being able to visualise it is significant but I think the difference might not be an impossibility barrier but just skill level of imagination. Having learned some echolocation my qualia involved in hearing have changed and it makes it seem possible to be able to make a similar transition from a trichromat visual space into a tetrachromat visual space. The weird thing about it is that my ear receives as much information that it did before but I just pay attention to it differently. Having deficient understanding in the sense of getting things wrong is easy line to draw. But it seems at some point the understanding becomes vivid instead of theorethical.

1TAG5y

I'm pretty sure that's not what "intrinisc" is supposed to mean. From "The Qualities of Qualia" by David de Leon.

1Slider5y

I find it important in philosophy to be on the clear what you mean. It is one thing to explain and another to define what you mean. You might point to a yellow object and say yellow and somebody that misunderstood might think that you mean "roundness" by yellow. The accuracy is most important when the views are radical and talk in very different worlds. And "disproving" yellow by not being able to pick it out from ostensive differentation is not an argumentative victory but a communicative failure. Even if we use some other term I think that meaning is important to have. "Plogiston" might sneak in claims but that is just the more reason to have terms that have as little room for smuggling as possible. And we still need good terms to talk about burning. "oxygen" literally means "black maker" but we nowadays understand it as a term to refer to a element which has definitionally very little to do with the color black. I think the starting point that generated the word refers to a genuine problem. Having qualia in category three would mean that you claim that I do not have experiences. And if qualia is a bad loaded word to refer to the thing to be explained it would be good to make up a new term that refers to that. But to me qualia was just that word. I word like "dark matter" might experience similar "highjack pressure" by having wild claims thrown around about it. And there having things like "warm dark matter", "wimpy dark matter" makes the classification more fine making the conceptual analysis proceed. But requirements of clear thinking are different from tradition preservance. If you say that "warm dark matter" can't be the answer the question of dark matter still stands. Even if you succesfully argue that "qualia" can't be a attractive concept the issue of me not being a p-zombie still remains and it would be expected that some theorethical bending over backwards would happen.

2TAG5y

That argument has an inverse: "If we are able to explain why you believe in, and talk about an external without referring to an external world whatsoever in our explanation, then we should reject the existence of an external world as a hypothesis". People want reductive explanation to be unidirectional,so that you have an A and a B, and clearly it is the B which is redundant and can be replaced with A. But not all explanations work in that convenient way...sometimes A and B are mutually redundant, in the sense that you don't need both. The moral of the story being to look for the overall best explanation, not just eliminate redundancy.

1Chris_Leong5y

It's a strong argument, but there are strong arguments on the other side as well.

[-]Matthew Barnett2y30

[This is not a very charitable post, but that's why I'm putting it in shortform because it doesn't reply directly to any single person.]

I feel like recently there's been a bit of goalpost shifting with regards to emergent abilities in large language models. My understanding is that the original definition of emergent abilities made it clear that the central claim was that emergent abilities cannot be predicted ahead of time. From their abstract,

We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thu

... (read more)

9gwern2y

I was one of those people. Can you point to where they predict anything, as opposed to retrodict it?

2Matthew Barnett2y

I'm confused. You say that you were "one of those people" but I was talking about people who "responded... by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity". By asking me for examples of the original authors predicting anything, it sounds like you aren't one of the people I'm talking about. Rather, it sounds like you're one of the people who hasn't moved the goalposts, and agrees with me that predictability is the important part. If that's true, then I'm not replying to you. And perhaps we disagree about less than you think, since the comment you replied to did not make any strong claims that the paper showed that abilities are predictable (though I did make a rather weak claim about that). Regardless, I still think we do disagree about the significance of this paper. I don't think the authors made any concrete predictions about the future, but it's not clear they tried to make any. I suspect, however, that most important, general abilities in LLMs will be quite predictable with scale, for pretty much the reasons given in the paper, although I fully admit that I do not have much hard data yet to support this presumption.

[-]Matthew Barnett5y30

"Immortality is cool and all, but our universe is going to run down from entropy eventually"

I consider this argument wrong for two reasons. The first is the obvious reason, which is that even if immortality is impossible, it's still better to live for a long time.

The second reason why I think this argument is wrong is because I'm currently convinced that literal physical immortality is possible in our universe. Usually when I say this out loud I get an audible "what" or something to that effect, but I'm not kidding.

It&#x... (read more)

2gilch5y

In one scene in Egan's Permutation City, the Peer character experienced "infinity" when he set himself up in an infinite loop such that his later experience matched up perfectly with the start of the loop (walking down the side of an infinitely tall building, if I recall). But he also experienced the loop ending.

2Raemon5y

I don't know of physics rules ruling this out. However, I suspect this doesn't resolve the problems that the people I know who care most about immortality are worried about. (I'm not sure – I haven't heard them express clear preferences about what exactly they prefer on the billions/trillions year timescale. But they seem more concerned running out of ability to have new experiences than not-wanting-to-die-in-particular.) My impression is many of the people who care about this sort of thing also tend to think that if you have multiple instances of the exact same thing, it just counts as a single instance. (Or, something more complicated about many worlds and increasing your measure)

1Matthew Barnett5y

I agree with the objection. :) Personally I'm not sure whether I'd want to be stuck in a loop of experiences repeating over and over forever. However, even if we considered "true" immortality, repeat experiences are inevitable simply because there's a finite number of possible experiences. So, we'd have to start repeating things eventually.

2gilch5y

Virtual particles "pop into existence" in matter/antimatter pairs and then "pop out" as they annihilate each other all the time. In one interpretation, an electron positron pair (for example) can be thought of as one electron that loops around and goes back in time. Due to CPS symmetry, this backward path looks like a positron. https://www.youtube.com/watch?v=9dqtW9MslFk

2gilch5y

It sounds like you're talking about time travel. These "worms" are called "worldlines". Spacetime is not simply R^4. You can rotate in the fourth dimension--this is just acceleration. But you can't accelerate enough to turn around and bite your own tail because rotations in the fourth dimension are hyperbolic rather than circular. You can't exceed or even reach light speed. There are solutions to General Relativity that contain closed timelike curves, but it's not clear if they correspond to anything physically realizable.

1Slider5y

I have a previous high impliciation uncertainty about this (that would be a crux?). " you can't accelerate enough to turn around " seems false to me. The mathematical rotation seems like it ought to exist. The prevoius reasons I thought such a mathematical rotation would be impossible I have signficantly less faith in. If I draw a unit sphere analog in spacetime having a visual observation from the space-time diagram drawn on euclid paper is not sufficient to conclude that the future cone is far from past cone. And thinking that a sphere is "all within r distance" it would seem it should be continuous and simply connected under most instances. I think there also should exist a transformation that when repeated enough times returns to the original configuration. And I find it surprising that a boost like transformation would fail to be like that if it is a rotation analog. I have started to believe that the standrd reasoning why you can't go faster than light relies on a kind of faulty logic. With normal euclidean geometry it would go like: there is a maximum angle you can reach by increasing the y-coordinate and slope is just the ratio of x to y so at that maximum y maximum slope is reached so maximum angle that you can have is 90 degrees. So if you try to go at 100 degrees you have lesser y and are actually going slower. And in a way 90 degrees is kind of the maximum amount you can point in another direction. But normally degrees go up to 180 or 360 degrees. In the relativity side c is the maximum ratio but that is for coordinate time. If somebodys proper time would start pointing in a direction that would project negatively on the coordinate time axis the comparison between x per coordinate time and x per proper time would become significant. There is also a trajectory which seems to be timelike in all segments. A=(0,0,0,0),(2,1,0,0),B=(4,2,0,0),(2,3,0,0),C=(0,4,0,0),(2,5,0,0),D=(4,6,0,0). It would seem awfully a lot like the "corner" A B C would be of equal ma

1Matthew Barnett5y

I agree I would not be able to actually accomplish time travel. The point is whether we could construct some object in Minkowski space (or whatever General Relativity uses, I'm not a physicist) that we considered to be loop-like. I don't think it's worth my time to figure out whether this is really possible, but I suspect that something like it may be. Edit: I want to say that I do not have an intuition for physics or spacetime at all. My main reason for thinking this is possible is mainly that I think my idea is fairly minimal: I think you might be able to do this even in R^3.

1TAG5y

Nietzsche go to the there first. https://en.m.wikipedia.org/wiki/Eternal_return

[-]Matthew Barnett3y20

I now have a Twitter account that tweets my predictions.

I don't think I'm willing to bet on every prediction that I make. However, I pledge the following: if, after updating on the fact that you want to bet me, I still disagree with you, then I will bet. The disagreement must be non-trivial though.

For obvious reasons, I also won't bet on predictions that are old, and have already been replaced by newer predictions. I also may not be willing to bet on predictions that have unclear resolution criteria, or are about human extinction.

[-]Matthew Barnett5y10

I have discovered recently that while I am generally tired and groggy in the morning, I am well rested and happy after a nap. I am unsure if this matches other people's experiences, and haven't explored much research. Still, I think this is interesting to think about fully.

What is the best way to apply this knowledge? I am considering purposely sabotaging my sleep so that I am tired enough to take a nap by noon, which would refresh me for the entire day. But this plan may have some significant drawbacks, including being excessively tired for a few hours in the morning.

2Raemon5y

I'm assuming from context you're universally groggy in the morning no matter how much sleep you get? (i.e. you've tried the obvious thing of just 'sleep more'?)

1Matthew Barnett5y

Pretty much, yes. Even with 10+ hours of sleep I am not as refreshed as a nap. It's weird, but I think it's a real effect.

1eigen5y

Two easy things you can try to feel less groggy in the morning are: * Drinking a full glass of water as soon as you wake up. * Listening to music or a podcast (bluetooth earphones work great here!). Music does the trick for me, although I'm usually not in the mood and I prefer a podcast. About taking naps, while it seems to work for some people, I'm generally against it since it usually impairs my circadian clock greatly (I cannot keep consistent times and meddles with my schedule too much). At nights, I take melatonin and it seems to have been of great help to keep consistent times at which I go to sleep (taking it with L-Theanine seems to be better for me somehow). Besides that, I do pay a lot of attention to other zeitgebers such as exercise, eating behavior, light exposure, and coffee. This is to say—regulating your circadian clock may be what you're looking for. A link of interest is gwern's post about vitamin d experiment and other posts about sleep also.

Moderation Log

381Comments