Respect for doing this.
I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say "yes I broadly agree with them but disagree with their claims about extinction being certain."
I would also echo the feedback here. There's no reason to write in the same style as cranks.
It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like
Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery", which seems about accurate to me.
Now I don't like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.
claiming to have a full mathematical proof that safe AI is impossible,
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
inexplicably formatted as a poem
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
Paul Christiano read some of this and concluded "the entire scientific community would probably consider this writing to be crankery",
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
superficially agree with the conclusions from actually good arguments.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
Ok. To be clear I don't expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy's list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I'm not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of "proofs" at face value without seeing more solid arguments.
claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory,
Nope, I haven’t claimed either of that.
Fair enough. I can't verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
Paul expressed appropriate uncertainty. What is he supposed to do, say "I see several red flags, but I don't have time to read a 517-page metaphysics book, so I'm still radically uncertain whether this is a crank or the next Kurt Godel"?
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.
This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like "carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation". In the presence of ways to change incentives such that benign AI systems get higher fitness, I don't think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value.
Thanks for recognising this, and for taking some time now to consider the argument.
However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry's work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice's theorem
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
Paul expressed appropriate uncertainty. What is he supposed to...say...?
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~
That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
When you say failures will "build up toward lethality at some unknown rate", why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Let's take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC's factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let's take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you'd get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.
E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? "
This is pretty similar to Hendrycks's natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.
Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? (Regardless of the dismissive-judgemental title of this forum :P)
I agree that Remmelt can improve the message. And I believe he will do that.
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.
I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything."
In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.
In contrast, "crankery" is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone's thoughts as "a crankery" also a form of crankery of its own then? Paradoxical isn't it?
BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.
It’s concise and correct – a difficult combination to achieve here.
I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average
Right – this comes back to actually examining people’s reasoning.
Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.
IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”
Yes, Remmelt has some extreme expressions…
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
…describing black-and-white thinking
I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.
Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.
A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.
Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.
If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.
But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.
Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.
Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in.
We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.
The press release strikes me as poorly written. It's middle-school level. ChatGPT can write better than this. Exactly who is your (Stop AI's) audience here? "The press"?
Exclamation points are excessive. "Heart's content"? You're not in this for "contentment". The "you can't prove it, therefore I'm right" argument is weak. The second page is worse. "Toxic conditions"? I think I know what you meant, but you didn't connect it well enough for a general audience. "accelerate our mass extinction until we are all dead"? I'm pretty sure the "all dead" part has to come before the "extinction". "(and abusing his sister)"? OK, there's enough in the public record to believe than Sam is not (ahem) "consistently candid", but I'm at under 50% about the sister abuse even then on priors. Do you want to get sued for libel on top of your jail time? Is that a good strategy?
I admire your courage and hope you make an impact, but if you're willing to pay these heavy costs, of getting arrested, and facing jail time etc., then please try to win! Your necessity defense is an interesting idea, but if this is the best you can do, it will fail. If you can afford to hire a good defense attorney, you can afford a better writer! Tell me how this is move is 4-D chess and not just a blunder.
I do not find this post reassuring about your approach.
I am appalled to see this was not downvoted into oblivion! My best guess is that people feel that there are not enough efforts going towards stopping AI and did not read the post and the press release to check that you have good reason motivating your actions.
Thanks, as far as I can this is a mix of critiques of strategic approach (fair enough), about communication style (fair enough), and partial misunderstandings of the technical arguments.
instead of a succession of events which need to go your way, I think you should aim for incremental marginal gains. There is no cost-effectiveness analysis…
I agree that we should not get hung up on a succession of events to go a certain way. IMO, we need to get good at simultaneously broadcasting our concerns in a way that’s relatable to other concerned communities, and opportunistically look for new collaborations there.
At the same time, local organisers often build up an activist movement by ratcheting up the number of people joining the events and the pressure they put on demanding institutions to make changes. These are basic cheap civil disobedience tactics that have worked for many movements (climate, civil rights, feminist, changing a ruling party, etc). I prefer to go with what has worked, instead of trying to reinvent the wheel based on fragile cost-effectiveness estimates. But if you can think of concrete alternative activities that also have a track record of working, I’m curious to hear.
Your press release is unreadable (poor formatting), and sounds like a conspiracy theory (catchy punchlines, ALL CAPS DEMANDS, alarmist vocabulary and unsubstantiated claims)
I think this is broadly fair. The turnaround time of this press release was short, and I think we should improve on the formatting and give more nuanced explanations next time.
Keep in mind the text is not aimed at you but people more broadly who are feeling concerned and we want to encourage to act. A press release is not a paper. Our press release is more like a call to action – there is a reason to add punchy lines here.
The figures you quote are false (the median from AI Impacts is 5%) or knowingly misleading (the numbers from Existential risk from AI survey are far from robust and as you note, suffer from selection bias)
Let me recheck the AI Impacts paper. Maybe I was ditzy before, in which case, my bad.
As you saw from my commentary above, I was skeptical about using that range of figures in the first place.
You conflate AGI and self-modifying systems
Not sure what you see as the conflation?
AGI, as an autonomous system that would automate many jobs, would necessarily be self-modifying – even in the limited sense of adjusting its internal code/weights on the basis of new inputs.
Your arguments are invalid
The reasoning shared in the press release by my colleague was rather loose, so I more rigorously explained a related set of arguments in this post.
As to whether arguments from point 1 to 6. above are invalid, I haven’t seen you point out inconsistencies in the logic yet, so as it stands you seem to be sharing a personal opinion.
I am appalled to see this was not downvoted into oblivion!
Should I comment on the level of nuance in your writing here? :P
Let me recheck the AI Impacts paper.
I definitely made a mistake in quickly checking that number shared by colleague.
The 2023 AI Impacts survey shows a mean risk of 14.4% for the question “What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?”.
Whereas the other smaller sample survey gives a median estimate of 30%
I already thought using those two figures as a range did not make sense, but putting a mean and a median in the same range is even more wrong.
Thanks for pointing this out! Let me add a correcting comment above.
In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.
This part is invalid, I think.
My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn't rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.
This argument doesn't work because the two premises won't hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.
There is some risk of its goal-related systems breaking
Ah, that’s actually not the argument.
Could you try read points 1-5. again?
I've reread and my understanding of point 3 remains the same. I wasn't trying to summarize points 1-5, to be clear. And by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.
Perhaps you could point me to my misunderstanding?
Appreciating your openness.
(Just making dinner – will get back to this when I’m behind my laptop in around an hour).
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
…
by "goal-related systems" I just meant whatever is keeping track of the outcomes being optimized for.
So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.
For further explanation, please see links under point 4.
Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result.
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
For (b), engineering decentralised redundancy can help especially at the microscale.
~
In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
So it's the AI being incompetent?
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Yeah I think would be a good response to my argument against premise 2). I've had a quick look at the list of theorems in the paper, I don't know most of them, but the ones I do know don't seem to support the point you're making. So I don't buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
You don't have to simulate something to reason about it.
E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs?
Garrabrant induction shows one way of doing self-referential reasoning.
- But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.
So it's the AI being incompetent?
Yes, but in the sense that there are limits to the AGI's capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
You don't have to simulate something to reason about it.
If you can't simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
Maybe take a look at the hashiness model of AGI uncontainability. That's an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don't want to read through the long transcripts of explanation at this stage. See project here.
Anders' summary:
"A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output."
Garrabrant induction shows one way of doing self-referential reasoning.
We're talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can't destroy the world/country, as a crazy dictator would.
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
How about I assume there is some epsilon such that the probability of an agent going off the rails
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
Why can't the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others?
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
I'm gonna try and stay focused on a single point, specifically the argument that leads up to >99%
I'm also for now leaving aside substrate-needs convergence [point 5]:
There are some writing issues here that make it difficult to evaluate the ideas presented purely on their merits. In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole, where it should really be a bullet point that links to where this case is made elsewhere (or if it is not made adequately elsewhere, as a new post entirely). Meanwhile, the value of disruptive protest is left to the reader to determine.
As I understand the issue, the case for barricading AI rests on:
1. Safety doesn't happen by default
a) AI labs are not on track to achieve "alignment" as commonly considered by safety researchers.
b) Those standards may be over-optimistic--link to Substrate Needs Convergence, arguments by Yampolskiy, etc.
c) Even if the conception of safety assumed by the AI labs is right, it is not clear that their utopic vision for the future is actually good.
2. Advocacy, not just technical work, is needed for AI safety
a) See above
b) Market incentives are misaligned
c) Policy (and culture) matters
3. Disruptive actions, not just working within civil channels, is needed for effective advocacy.
a) Ways that working entirely within ordinary democratic channels can get delayed or derailed
b) Benefits of disruptive actions, separate from or in synergy with other forms of advocacy
c) Plan for how StopAI's specific choice of disruptive actions effectively plays to the above benefits
d) Moral arguments, if not already implied
As I understand the issue, the case for barricading AI rests on:
Great list! Basically agreeing with the claims under 1. and the structure of what needs to be covered under 2.
Meanwhile, the value of disruptive protest is left to the reader to determine.
You're right. Usually when people hear about a new organisation on the forum, they expect some long write-up of the theory of change and the considerations around what to prioritise.
I don't think I have time right now for writing a neat public write-up. This is just me being realistic – Sam and I are both swamped in terms of handling our work and living situations.
So the best I can do is point to examples where civil disobedience has worked (eg. Just Stop Oil demands, Children's March) and then discuss our particular situation (how the situatiojn is similar and different, who are important stakeholders, what are our demands, what are possible effective tactics in this context).
In particular, the argument for 99% extinction is given a lot of space relative to the post as a whole,
Ha, fair enough. The more rigorously I tried to write out the explanation, the more space it took.
I mean, yes, hence my comment about ChatGPT writing better than this, but if word gets out that Stop AI is literally using the product of the company they're protesting in their protests, it could come off as hypocrisy.
I personally don't have a problem with it, but I understand the situation at a deeper level than the general public. It could be a wise strategic move to hire a human writer, or even ask for competent volunteer writers, including those not willing to join the protests themselves, although I can see budget or timing being a factor in the decision.
Or they could just use one of the bigger Llamas on their own hardware and try to not get caught. Seems like an unnecessary risk though.
sigh Protests last year, barricading this year, I've already mentally prepared myself for someone next year throwing soup at a human-generated painting while shouting about AI. This is the kind of stuff that makes no one in the Valley want to associate with you. It makes the cause look low-status, unintelligent, lazy, and uninformed.
Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work. See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests. Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point. The series is very long, so here's a tl/dr:
- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notice that the first half of this statement was approval of the only thing that matters--approval of the cause itself, as separate from the methods, which brought the former to mind.
- The fact that only a small minority of the audience approves of the protest action is in itself a good thing, because this efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth. These high-value supporters don't need to be convinced that the cause is right; they need to be convinced that the organization is the "real deal" and can actually get things done. In short, it's niche marketing.
- The disruptive protest model assumes that the democratic system is insufficient, ineffective, or corrupted, such that simply convincing the (passive) center majority is not likely to translate into meaningful policy change. The model instead relies on a putting the powers-that-be into a bind where they have to either ignore you (in which case you keep growing with impunity) or over-react (in which case you leverage public sympathy to grow faster). Again, it isn't important how sympathic the protestors are, only that the reaction against them is comparatively worse, from the perspective of the niche audience that matters.
- The ultimate purpose of this recursive growth model is to create a power bloc that forces changes that wouldn't otherwise occur on any reasonable timeline through ordinary democratic means (like voting) alone.
- Hallam presents incremental and disruptive advocacy as in opposition. This is where I most strongly disagree with his thesis. IMO: moderates get results, but operate within the boundaries defined by extremists, so they need to learn how to work together.
In short, when you say an action makes a cause "look low status", it is important to ask "to whom?" and "is that segment of the audience relevant to my context?"
efficiently filters for people who are inclined to join the activist movement--especially on the hard-core "front lines"--whereas passive "supporters" can be more trouble than they're worth.
I had not considered how our messaging is filtering out non-committed supporters. Interesting!
Protesters are expected to be at least a little annoying. Strategic unpopularity might be a price worth paying if it gets results. Sometimes extremists shift the Overton Window.
Stop AI just put out a short press release.
As an organiser, let me add some thoughts to nuance the text:
Plan
Emphasis is on peacefully. We are a non-violent activist organisation. We refuse to work with any activist who has other plans.
My take is that a small group barricading OpenAI is a doable way to be a thorn in OpenAI's side, while raising public attention to the recklessness of AI corporations. From there, stopping AI development requires many concerned communities acting together to restrict the data, work, uses, and hardware of AI.
My co-organisers Sam and Guido are willing to put their body on the line by getting arrested repeatedly. We are that serious about stopping AI development.
The Necessity Defense is when an "individual commits a criminal act during an emergency situation in order to prevent a greater harm from happening." This defense has been used by climate activists who got arrested, with mixed results. Sam and others will be testifying in court that we acted to prevent imminent harms (not just extinction risk).
Or at least, we would gain legal freedom to keep blocking OpenAI's entrances until they stop causing increasing harms.
Our actions are a way to signal to the concerned public that they can act and speak out against AI companies.
I expect most Americans to not feel strongly yet about preventing the development of generally functional systems. Clicking a response in a certain framed poll is low-commitment. So we will also broadcast more stories of how recklessly AI corporations have been acting with our lives.
Risk of extinction
My colleague took the mean
mediannumber of 14% from the latest AI Impacts survey, and the median number of 30% from the smaller-sample survey 'Existential Risk from AI'. Putting a median and median number in the same range does not make sense. The second survey also especially has a problem with self-selection, so I would take it with a grain of salt.My colleague told me that the survey results understate the risk, because AI researchers don't want to believe that their profession will lead to the end of the world. I countered that polled AI researchers could as well be overstating the risk, because they are stuck in narrow worldview that has been promoting the imminence of powerful AI since 1956.
But both are just vague opinions about cultural bias. Making social claims about "experts" does not really help us find out whether/where the polled "experts" actually thought things through.
Asking for P(doom) guesses is a lousy epistemic process, so I prefer to work through people's reasoning instead. Below are arguments why the long-term risk of extinction is above 99%.
"AGI in a year" makes no sense in my opinion. AI systems would require tinkering and learning to navigate the complexity of a much larger and messier environment. This process is not at all like AlphaGo recursively self-improving in its moves on an internally simulated 19x19 grid.
But if you are worried with such short timelines, then it is time to act. We've seen too many people standing on the sidelines worrying we could all die soon. If you think this, please act with dignity – collaborate where you can to restrict AI development.
That's a reasoning leap, but there is only so much my colleague could cover in a press release.
Let me explain per term why the risk of extinction would be greater than 99%:
This by itself is a precautionary principle argument (as part of 1-3. above).
Then, there are reasons why AGI uncontrollably converges on human extinction (see 4-6.).
Hopefully, arguments 1-6. combined clarify why I think that stopping AI development is the only viable path to preventing our extinction.
That is:
Even if such “alignment” mechanisms were not corrupted by myopic or malevolent actors… then still AGI converges on our extinction.
Why restrict OpenAI
OpenAI is already doing a lot of harm. This grounds our Necessity Defense for barricading their office.
We are doing what we can to restrict harmful AI development.
You can too.