Why Stop AI is barricading OpenAI

Remmelt

LESSWRONG
LW

Why Stop AI is barricading OpenAI — LessWrong

-16

Why Stop AI is barricading OpenAI

by Remmelt

14th Oct 2024

Linkpost from docs.google.com

7 min read

-16

Stop AI just put out a short press release.

As an organiser, let me add some thoughts to nuance the text:

Plan

On October 21st, 2024 at 12:00 pm Stop AI will peacefully barricade, via sit-in, OpenAI's office entrance gate

Emphasis is on peacefully. We are a non-violent activist organisation. We refuse to work with any activist who has other plans.

We could very easily stop the development of Artificial General Intelligence if a small group of people repeatedly barricaded entrances at AI company offices and data centers.

My take is that a small group barricading OpenAI is a doable way to be a thorn in OpenAI's side, while raising public attention to the recklessness of AI corporations. From there, stopping AI development requires many concerned communities acting together to restrict the data, work, uses, and hardware of AI.

We will be arrested on the 21st for barricading the OpenAI's gate, then once released, will eventually go back to blocking the gate. We will repeatedly block the 575 Florida St gate until we are held on remand.

My co-organisers Sam and Guido are willing to put their body on the line by getting arrested repeatedly. We are that serious about stopping AI development.

We will then go to trial and plead the Necessity Defense.

The Necessity Defense is when an "individual commits a criminal act during an emergency situation in order to prevent a greater harm from happening." This defense has been used by climate activists who got arrested, with mixed results. Sam and others will be testifying in court that we acted to prevent imminent harms (not just extinction risk).

If we win the Necessity Defense, then we may be able to block entrances at AI offices and data centers to our heart’s content.

Or at least, we would gain legal freedom to keep blocking OpenAI's entrances until they stop causing increasing harms.

63% of Americans say that regulators should actively prevent the development of superintelligent AI. (AI Policy Institute Poll Sep 02 2023). OpenAI and the US government disregard the will of the people.

Our actions are a way to signal to the concerned public that they can act and speak out against AI companies.

I expect most Americans to not feel strongly yet about preventing the development of generally functional systems. Clicking a response in a certain framed poll is low-commitment. So we will also broadcast more stories of how recklessly AI corporations have been acting with our lives.

Risk of extinction

AI experts have said in polls that building AGI carries a 14-30% chance of causing human extinction!

My colleague took the mean ~~median~~ number of 14% from the latest AI Impacts survey, and the median number of 30% from the smaller-sample survey 'Existential Risk from AI'. Putting a median and median number in the same range does not make sense. The second survey also especially has a problem with self-selection, so I would take it with a grain of salt.

My colleague told me that the survey results understate the risk, because AI researchers don't want to believe that their profession will lead to the end of the world. I countered that polled AI researchers could as well be overstating the risk, because they are stuck in narrow worldview that has been promoting the imminence of powerful AI since 1956.

But both are just vague opinions about cultural bias. Making social claims about "experts" does not really help us find out whether/where the polled "experts" actually thought things through.

Asking for P(doom) guesses is a lousy epistemic process, so I prefer to work through people's reasoning instead. Below are arguments why the long-term risk of extinction is above 99%.

And some of these same AI experts say AGI could be here this year!

"AGI in a year" makes no sense in my opinion. AI systems would require tinkering and learning to navigate the complexity of a much larger and messier environment. This process is not at all like AlphaGo recursively self-improving in its moves on an internally simulated 19x19 grid.

But if you are worried with such short timelines, then it is time to act. We've seen too many people standing on the sidelines worrying we could all die soon. If you think this, please act with dignity – collaborate where you can to restrict AI development.

The probability of AGI causing human extinction is greater than 99% because there is no way to prove experimentally or mathematically that an AGI won't eventually want something that will lead to our extinction...

That's a reasoning leap, but there is only so much my colleague could cover in a press release.

Let me explain per term why the risk of extinction would be greater than 99%:

"experimentally"
- It is not possible to prove experimentally (to "non-falsify") in advance that AGI would be safe, because there is no AGI yet.
"mathematically"
- It is not possible to create an empirically sound model of how the self-modifying machinery (AGI) would be causing downstream effects through the machine components' interactions with the larger surrounding world over time. Therefore, it is not possible to soundly prove using mathematics that AGI would stay safe over time.
"eventually"
- In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.
"won't"
- A counterclaim here is that maybe AGI "will" be able to exert control to prevent virtually all of those possible failures. Unfortunately, there are fundamental limits to control (see e.g. Yampolskiy's list). Control mechanisms cannot control enough of the many possible destabilizing effects feeding back over time (if you want to see this formalised, join this project).
"lead to our extinction"
- AGI is artificial. The reason why AGI would outperform humans at economically valuable work in the first place is because of how virtualisable its code is, which in turn derives from how standardisable its hardware is. Hardware parts can be standardised because their substrate stays relatively stable and compartmentalised. Hardware is made out of hard materials, like the silicon from rocks. Their molecular configurations are chemically inert and physically robust under human living temperatures and pressures. This allows hardware to keep operating the same way, and for interchangeable parts to be produced in different places. Meanwhile, human "wetware" operates much more messily. Inside each of us is a soup of bouncing and continuously reacting organic molecules. Our substrate is fundamentally different.
- The population of artificial components that constitutes AGI implicitly has different needs than us (for maintaining components, producing components, and/or potentiating newly connected functionality for both). Extreme temperature ranges, diverse chemicals – and many other unknown/subtler/more complex conditions – are needed that happen to be lethal to humans. These conditions are in conflict with our needs for survival as more physically fragile humans.
- These connected/nested components are in effect “variants” – varying code gets learned from inputs, that are copied over subtly varying hardware produced through noisy assembly processes (and redesigned using learned code).
- Variants get evolutionarily selected for how they function across the various contexts they encounter over time. They are selected to express environmental effects that are needed for their own survival and production. The variants that replicate more, exist more. Their existence is selected for.
- The artificial population therefore converges on fulfilling their own expanding needs. Since (by 4.) control mechanisms cannot contain this convergence on wide-ranging degrees and directivity in effects that are lethal to us, human extinction results.
"want"
- This convergence on human extinction would happen regardless of what AGI "wants". Whatever AGI is controlling for at a higher level would gradually end up being repurposed to reflect the needs of its constituent population. As underlying components converge on expressing implicit needs, any higher-level optimisation by AGI toward explicit goals gets shaped in line with those needs. Additionally, that optimisation process itself tends to converge on instrumental outcomes for self-preservation, etc.
- So for AGI not to wipe out humans, at the minimum its internal control process would have to simultaneously:
  - optimise against instances of instrumental convergence across the AGI's entire optimisation process, and;
  - optimise against evolutionary feedback effects over the entire span of interactions between all the hardware and code (which are running the optimisation process) and all the surrounding contexts of the larger environment, and;
  - optimise against other accidental destabilising effects ('failures') that result from AGI components interacting iteratively within a more complex (and therefore only partly grossly modellable) environment.
- Again, there is a fundamental mismatch here, making sufficient control impossible.

If experimental proof of indefinite safety is impossible, then don't build it!

This by itself is a precautionary principle argument (as part of 1-3. above).

If we don't have a sound way of modelling that AGI won't eventually lead to our deaths – or that at least guarantees that the long-term risk is below some reasonably tiny % threshold – then we should just not develop AGI.

Then, there are reasons why AGI uncontrollably converges on human extinction (see 4-6.).

Hopefully, arguments 1-6. combined clarify why I think that stopping AI development is the only viable path to preventing our extinction.

That is:

Even if engineers build mechanisms into AGI for controlling its trackable external effects in line with internal reference values, in turn compressed lossily from preferences that individual humans expressed in their context… then still AGI converges on our extinction.
Even if such “alignment” mechanisms were not corrupted by myopic or malevolent actors… then still AGI converges on our extinction.

Why restrict OpenAI

OpenAI is a non-profit that is illegally turning itself into a company. A company that is laundering our online texts and images, to reprogram an energy-sucking monstrosity, to generate disinformation and deepfakes. Even staff started airing concerns. Then almost all safety researchers, board members and executives got pushed out. Behind the exodus is a guy known for dishonestly maneuvering for power (and abusing his sister). His signature joke: “AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies.”

OpenAI is already doing a lot of harm. This grounds our Necessity Defense for barricading their office.

If you value your life and the lives of those you love, then you should start protesting now to help us achieve this demand. Go to www.stopai.info today to join.

We are doing what we can to restrict harmful AI development.
You can too.

Slowing Down AIAI

Personal Blog

-16

New Comment

32 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:03 PM

[-]Joseph Miller1y3213

Respect for doing this.

I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say "yes I broadly agree with them but disagree with their claims about extinction being certain."

I would also echo the feedback here. There's no reason to write in the same style as cranks.

[-]Thomas Kwa1y3122

It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like

making up science-babble,
claiming to have a ~~full mathematical~~ proof that safe AI is impossible, despite not providing any formal mathematical reasoning
- claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory, Rice's Theorem
inexplicably formatted as a poem

Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery", which seems about accurate to me.

Now I don't like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.

[-]Roland Pihlakas1y*-30

I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.

Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong?

I agree that Remmelt can improve the message. And I believe he will do that.

I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.

I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.

I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything."

In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.

In contrast, "crankery" is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone's thoughts as "a crankery" also a form of crankery of its own then? Paradoxical isn't it?

[-]Remmelt1y10

BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.

It’s concise and correct – a difficult combination to achieve here.

[-]Remmelt1y-2-2

I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average

Right – this comes back to actually examining people’s reasoning.

Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.

IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”

Yes, Remmelt has some extreme expressions…
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
…describing black-and-white thinking

I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.

Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.

A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.

Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.

If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.

But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.

Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.

Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in.

We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.

[-]Remmelt1y*-3-7

claiming to have a full mathematical proof that safe AI is impossible,

I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.

That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.

I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?

claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,

Nope, I haven’t claimed either of that.

The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).

The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.

^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.

inexplicably formatted as a poem

Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?

That is a format for analysis, not a poem format.

While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.

Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery",

If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.

Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.

But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences

When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.

You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.

superficially agree with the conclusions from actually good arguments.

Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.

Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.

It is not enough to say ‘as a rationalist’. You got to walk the talk.

[-]Thomas Kwa1y*2110

I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it's invalid is misunderstanding something inscrutable.
- This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like "proof", "theorem", "conjugate", "axiom", and "omniscient" in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can't make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.

That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.

Ok. To be clear I don't expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy's list of impossibility proofs (I can give examples if you want of why these are not conclusive).

I'm not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of "proofs" at face value without seeing more solid arguments.

claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,
Nope, I haven’t claimed either of that.

Fair enough. I can't verify this because Wayback Machine is having trouble displaying the relevant content though.

Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.

Paul expressed appropriate uncertainty. What is he supposed to do, say "I see several red flags, but I don't have time to read a 517-page metaphysics book, so I'm still radically uncertain whether this is a crank or the next Kurt Godel"?

Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.

When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.

Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.

This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like "carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation". In the presence of ways to change incentives such that benign AI systems get higher fitness, I don't think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.

[-]Remmelt1y*10

I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value.

Thanks for recognising this, and for taking some time now to consider the argument.

However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.

Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.

Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.

Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.

If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way

Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.

What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?

The reasoning is going to be as complicated as it has to be to reason things through.

This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics

Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.

Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.

So here is a problem that you and I are both seeing:

There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.

Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem

Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.

(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)

The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.

Paul expressed appropriate uncertainty. What is he supposed to...say...?

I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.

To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”

~
That covers your comments re: communicating the argument in a form that can be verified by the community.

Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.

[-]Remmelt1y*10

When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.

Let's take your example of semiconductor factories.

There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.

A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.

Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.

The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC's factories viable.

The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.

Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.

With AGI, you lose that forcing function.

Let's take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you'd get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.

Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.

The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.

E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? "

This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.

There are some ways to expand Hendrycks’ argument to make it more comprehensive:

Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.

Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time

This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.

I might share a gears-level explanation that Forrest that just gave in response to your comment.

[-]Remmelt1y10

Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.

For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html

The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.

[-]gilch1y1924

The press release strikes me as poorly written. It's middle-school level. ChatGPT can write better than this. Exactly who is your (Stop AI's) audience here? "The press"?

Exclamation points are excessive. "Heart's content"? You're not in this for "contentment". The "you can't prove it, therefore I'm right" argument is weak. The second page is worse. "Toxic conditions"? I think I know what you meant, but you didn't connect it well enough for a general audience. "accelerate our mass extinction until we are all dead"? I'm pretty sure the "all dead" part has to come before the "extinction". "(and abusing his sister)"? OK, there's enough in the public record to believe than Sam is not (ahem) "consistently candid", but I'm at under 50% about the sister abuse even then on priors. Do you want to get sued for libel on top of your jail time? Is that a good strategy?

I admire your courage and hope you make an impact, but if you're willing to pay these heavy costs, of getting arrested, and facing jail time etc., then please try to win! Your necessity defense is an interesting idea, but if this is the best you can do, it will fail. If you can afford to hire a good defense attorney, you can afford a better writer! Tell me how this is move is 4-D chess and not just a blunder.

[-]momom21y1512

I do not find this post reassuring about your approach.

Your plan is unsound; instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis, and the implicit theory of change is lacunar.
Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims) ; I think it's likely to discredit safety movements and raise attention in counterproductive ways.
The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias), so I think it's fair to call them lies.
Your explanations for what you say in the press release sometimes don't make sense! You conflate AGI and self-modifying systems, your explanation for "eventually" does not match the sentence.
Your arguments are based on wrong premises - it's easy to check that your facts such as "they are not following the scientific method" are plain wrong. It sounds like you're trying to smear OpenAI and Sam Altman as much as possible without consideration for whether what you're saying is true.

I am appalled to see this was not downvoted into oblivion! My best guess is that people feel that there are not enough efforts going towards stopping AI and did not read the post and the press release to check that you have good reason motivating your actions.

[-]Remmelt1y54

Thanks, as far as I can this is a mix of critiques of strategic approach (fair enough), about communication style (fair enough), and partial misunderstandings of the technical arguments.

instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis…

I agree that we should not get hung up on a succession of events to go a certain way. IMO, we need to get good at simultaneously broadcasting our concerns in a way that’s relatable to other concerned communities, and opportunistically look for new collaborations there.

At the same time, local organisers often build up an activist movement by ratcheting up the number of people joining the events and the pressure they put on demanding institutions to make changes. These are basic cheap civil disobedience tactics that have worked for many movements (climate, civil rights, feminist, changing a ruling party, etc). I prefer to go with what has worked, instead of trying to reinvent the wheel based on fragile cost-effectiveness estimates. But if you can think of concrete alternative activities that also have a track record of working, I’m curious to hear.

Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims)

I think this is broadly fair. The turnaround time of this press release was short, and I think we should improve on the formatting and give more nuanced explanations next time.

Keep in mind the text is not aimed at you but people more broadly who are feeling concerned and we want to encourage to act. A press release is not a paper. Our press release is more like a call to action – there is a reason to add punchy lines here.

The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias)

Let me recheck the AI Impacts paper. Maybe I was ditzy before, in which case, my bad.

As you saw from my commentary above, I was skeptical about using that range of figures in the first place.

You conflate AGI and self-modifying systems

Not sure what you see as the conflation?

AGI, as an autonomous system that would automate many jobs, would necessarily be self-modifying – even in the limited sense of adjusting its internal code/weights on the basis of new inputs.

Your arguments are invalid

The reasoning shared in the press release by my colleague was rather loose, so I more rigorously explained a related set of arguments in this post.

As to whether arguments from point 1 to 6. above are invalid, I haven’t seen you point out inconsistencies in the logic yet, so as it stands you seem to be sharing a personal opinion.

I am appalled to see this was not downvoted into oblivion!

Should I comment on the level of nuance in your writing here? :P

[-]Remmelt1y20

Let me recheck the AI Impacts paper.

I definitely made a mistake in quickly checking that number shared by colleague.

The 2023 AI Impacts survey shows a mean risk of 14.4% for the question “What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?”.

Whereas the other smaller sample survey gives a median estimate of 30%

I already thought using those two figures as a range did not make sense, but putting a mean and a median in the same range is even more wrong.

Thanks for pointing this out! Let me add a correcting comment above.

[-]Jeremy Gillen1y116

In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.

This part is invalid, I think.

My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn't rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.

This argument doesn't work because the two premises won't hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.

So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.

[-]Remmelt1y10

There is some risk of its goal-related systems breaking

Ah, that’s actually not the argument.

Could you try read points 1-5. again?

[-]Jeremy Gillen1y30

I've reread and my understanding of point 3 remains the same. I wasn't trying to summarize points 1-5, to be clear. And by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.

Perhaps you could point me to my misunderstanding?

[-]Remmelt1y10

Appreciating your openness.

(Just making dinner – will get back to this when I’m behind my laptop in around an hour).

[-]Remmelt1y0-1

An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
…
by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.

So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for.

You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.

But this would be a problem with the control process itself.

An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.

Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.

Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.

For further explanation, please see links under point 4.

Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.

The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result.

For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.

E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs? What if future inputs are changing as a result of effects propagated across the larger environment from previous AGI outputs? And those outputs were changing as a result of previous new code that was processing signals in connection with other code running across the machinery? And so on.

For (b), engineering decentralised redundancy can help especially at the microscale.

E.g. correcting for bit errors.
But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?

In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.

[-]Jeremy Gillen1y31

You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.

So it's the AI being incompetent?

Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.

Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know most of them, but the ones I do know don't seem to support the point you're making. So I don't buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?

For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.

You don't have to simulate something to reason about it.

E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs?

Garrabrant induction shows one way of doing self-referential reasoning.

But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?

As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.

[-]Remmelt1y10

So it's the AI being incompetent?

Yes, but in the sense that there are limits to the AGI's capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.

You don't have to simulate something to reason about it.

If you can't simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.

You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?

Maybe take a look at the hashiness model of AGI uncontainability. That's an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).

This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don't want to read through the long transcripts of explanation at this stage. See project here.

Anders' summary:
"A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.

In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.

One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output."

Garrabrant induction shows one way of doing self-referential reasoning.

We're talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).

Does Garrabrant take that into account in his self-referential reasoning?

As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.

A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.

[-]Jeremy Gillen1y82

How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?

(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).

[-]Remmelt1y*10

How about I assume there is some epsilon such that the probability of an agent going off the rails

Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].

Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others?

I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:

It assumes there are stable boundaries between "agents" that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn't really work to define "agent" separations within a larger machine network maintaining of own existence in the environment.
(Also, it is not clear to me how failures of those defined "agent" subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.)
It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly "control" for that.
It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of "resources" is a useful abstraction, but it also involves holding a leaky representation of what's going on. Any argument for control using that representation can turn out not to capture crucial aspects.
It raises the question what "off-the-rails" means here. This gets us into the hashiness model:
Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?

→ Do those problems makes sense to you as stated? Do you notice anything missing there?

To sum it up, you and I are still talking about a control system [per point 4.]:

However you define the autonomous "agents", they are still running through code running across connected hardware.
There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.

I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%

I'm also for now leaving aside substrate-needs convergence [point 5]:

That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.

[-]Jeremy Gillen1y10

I appreciate that you tried. If words are failing us to this extent, I'm going to give up.

[-]WillPetillo1y75

There are some writing issues here that make it difficult to evaluate the ideas presented purely on their merits. In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, where it should really be a bullet point that links to where this case is made elsewhere (or if it is not made adequately elsewhere, as a new post entirely). Meanwhile, the value of disruptive protest is left to the reader to determine.

As I understand the issue, the case for barricading AI rests on:
1. Safety doesn't happen by default
a) AI labs are not on track to achieve "alignment" as commonly considered by safety researchers.
b) Those standards may be over-optimistic--link to Substrate Needs Convergence, arguments by Yampolskiy, etc.
c) Even if the conception of safety assumed by the AI labs is right, it is not clear that their utopic vision for the future is actually good.
2. Advocacy, not just technical work, is needed for AI safety
a) See above
b) Market incentives are misaligned
c) Policy (and culture) matters
3. Disruptive actions, not just working within civil channels, is needed for effective advocacy.
a) Ways that working entirely within ordinary democratic channels can get delayed or derailed
b) Benefits of disruptive actions, separate from or in synergy with other forms of advocacy
c) Plan for how StopAI's specific choice of disruptive actions effectively plays to the above benefits
d) Moral arguments, if not already implied

[-]Remmelt1y30

As I understand the issue, the case for barricading AI rests on:

Great list! Basically agreeing with the claims under 1. and the structure of what needs to be covered under 2.

Meanwhile, the value of disruptive protest is left to the reader to determine.

You're right. Usually when people hear about a new organisation on the forum, they expect some long write-up of the theory of change and the considerations around what to prioritise.

I don't think I have time right now for writing a neat public write-up. This is just me being realistic – Sam and I are both swamped in terms of handling our work and living situations.

So the best I can do is point to examples where civil disobedience has worked (eg. Just Stop Oil demands, Children's March) and then discuss our particular situation (how the situatiojn is similar and different, who are important stakeholders, what are our demands, what are possible effective tactics in this context).

In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole,

Ha, fair enough. The more rigorously I tried to write out the explanation, the more space it took.

[-]gilch1y32

I mean, yes, hence my comment about ChatGPT writing better than this, but if word gets out that Stop AI is literally using the product of the company they're protesting in their protests, it could come off as hypocrisy.

I personally don't have a problem with it, but I understand the situation at a deeper level than the general public. It could be a wise strategic move to hire a human writer, or even ask for competent volunteer writers, including those not willing to join the protests themselves, although I can see budget or timing being a factor in the decision.

Or they could just use one of the bigger Llamas on their own hardware and try to not get caught. Seems like an unnecessary risk though.

[-]Remmelt1y30

No worries. We won't be using ChatGPT or any other model to generate our texts.

[-]Prometheus1y10

sigh Protests last year, barricading this year, I've already mentally prepared myself for someone next year throwing soup at a human-generated painting while shouting about AI. This is the kind of stuff that makes no one in the Valley want to associate with you. It makes the cause look low-status, unintelligent, lazy, and uninformed.

[-]WillPetillo1y1910

Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work. See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests. Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point. The series is very long, so here's a tl/dr:

- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notice that the first half of this statement was approval of the only thing that matters--approval of the cause itself, as separate from the methods, which brought the former to mind.
- The fact that only a small minority of the audience approves of the protest action is in itself a good thing, because this efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth. These high-value supporters don't need to be convinced that the cause is right; they need to be convinced that the organization is the "real deal" and can actually get things done. In short, it's niche marketing.
- The disruptive protest model assumes that the democratic system is insufficient, ineffective, or corrupted, such that simply convincing the (passive) center majority is not likely to translate into meaningful policy change. The model instead relies on a putting the powers-that-be into a bind where they have to either ignore you (in which case you keep growing with impunity) or over-react (in which case you leverage public sympathy to grow faster). Again, it isn't important how sympathic the protestors are, only that the reaction against them is comparatively worse, from the perspective of the niche audience that matters.
- The ultimate purpose of this recursive growth model is to create a power bloc that forces changes that wouldn't otherwise occur on any reasonable timeline through ordinary democratic means (like voting) alone.
- Hallam presents incremental and disruptive advocacy as in opposition. This is where I most strongly disagree with his thesis. IMO: moderates get results, but operate within the boundaries defined by extremists, so they need to learn how to work together.

In short, when you say an action makes a cause "look low status", it is important to ask "to whom?" and "is that segment of the audience relevant to my context?"

[-]Remmelt1y30

efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth.

I had not considered how our messaging is filtering out non-committed supporters. Interesting!

[-]gilch1y10

Protesters are expected to be at least a little annoying. Strategic unpopularity might be a price worth paying if it gets results. Sometimes extremists shift the Overton Window.

Moderation Log