That's certainly an interesting result. Have you tried running the same prompt again and seeing if the response changes? I've noticed that some LLMs answer different things to the same prompt. For example, when I quizzed DeepSeek R1 on whether a priori knowledge exists it answered in affirmative the first time and in negative the second time.
If alignment by default is not the majority opinion, then what is (pardon my ignorance as someone who mostly interacts with alignment community via LessWrong)? Is it 1) that we are all ~doomed or 2) that alignment is hard but we have a decent shot at solving it or 3) something else entirely?
I got a feeling like people used to be a lot more pessimistic about our chances of survival in 2023 than in 2024 or 2025 (in other words, pessimism seems to be going down somewhat), but I could be completely wrong about this.
Thanks for the reply!
...The only general remarks that I want to make
are in regards to your question about
the model of 150 year long vaccine testing
on/over some sort of sample group and control group.
I notice that there is nothing exponential assumed
about this test object, and so therefore, at most,
the effects are probably multiplicative, if not linear.
Therefore, there are lots of questions about power dynamics
that we can overall safely ignore, as a simplification,
which is in marked contrast to anything involving ASI.
If we assume, as you requested, "no side ef
Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?
Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.
But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately rep...
Thanks for the response!
So we are to try to imagine a complex learning machine without any parts/components?
Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not.
Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?
If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence"...
I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.
If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.
If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply ...
Thanks for the response!
...Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.
And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction
of all humans (and everything el
Hey, Forrest! Nice to speak with you.
Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.
I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.
Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?
Here is my take:
We don't need ASI to be able to 100% predict ...
Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.
Yup, that's a good point, I edited my original comment to reflect it.
...Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force. Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers. The core idea that I want reader
Thank you for thoughtful engagement!
On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.
I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.
They are a decently popular group (as far as AI alignment groups go) an...
Thanks for responding again!
SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.
If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.
...(My understanding of) the counter here is that, if we are on the tra
This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly.
Yup, I agree.
It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it.
This makes my case even stronger! Basically, if a Friendly AI has no issues with simulating conscious beings in general, then we have good reasons to expect it to simulate more observers in blissful worlds than in worlds like ours.
If the Doomsday Arg...
Thank you for responding as well!
If the AI destroys itself, then it's obviously not an ASI for very long ;)
If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)
at what point does it stop being ASI?
It might stop being ASI immediately, depending on your definition, but this is absolutely fine with me. In these scenarios that I outlined, we build something that can be initia...
Firstly, I want to thank you for putting SNC into text. I also appreciated the effort of to showcasing a logic chain that arrives at your conclusion.
With that being said, I will try to outline my main disagreements with the post:
2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.
Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to stop using self-modifying machinery for such tasks.
...3. The space of unfo
She will be unconscious, but still send messages about pain. Current LLMs can do it. Also, as it is simulation, there are recording of her previous messages or of a similar woman, so they can be copypasted. Her memories can be computed without actually putting her in pain.
So if I am understanding your proposal correctly, then a Friendly AI will make a woman unconscious during moments of intense suffering and then implant her memories of pain. Why would it do it though? Why not just remove the experience of pain entirely? In fact, why does Friendly AI seem ...
If preliminary results on the poll hold, then that would be pretty in line with my hypothesis of most people preferring creating simulations with no suffering over a world like ours. However, it is pretty important to note that this might not be representative of human values in general, because looking at your Twitter account, your audience comes mostly from a very specific circles of people (those interested in futurism and AI).
Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation?
...No. Memory ab
I am sorry to butt into your conversation, but I do have some points of disagreement.
I think a more meta-argument is valid: it is almost impossible to prove that all possible civilizations will not run simulations despite having all data about us (or being able to generate it from scratch).
I think that's a very high bar to set. It's almost impossible to definitively prove that we are not in a Cartesian demon or brain-in-a-vat scenario. But this doesn't mean that those scenarios are likely. I think it is fair to say that more than a possibility is required ...
Diffractor's critique of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.
Since this post is about rebutting criticisms of AIXI, I feel it would be only fair to include Rob Bensinger's criticism. I considered it to be the strongest criticism of AIXI by a mile. Do you have any rebuttals for that post?
Thank you for your response! It basically covers all of the five issues that I had in mind. It is definitely some food for thought, especially your disagreement with Eliezer. I am much more inclined to think you are correct because his activity has considerably died down (at least on LessWrong). I am really looking forward to your "A broader path: survival on the default fast timeline to AGI" post.
E.g., I think software only singularity is more likely than you do, and think that worst cast scheming is more likely
By "software only singularity" do you mean a scenario where all humans are killed before singularity, a scenario where all humans merge with software (uploading) or something else entirely?
I agree with your view about organizational problems. Your discussion gave me an idea: Is it possible to shift employees dedicated to capability improvement to work on safety improvement? Set safety goals for these employees within the organization. This way, they will have a new direction and won't be idle, worried about being fired or resigning to go to other companies.
That seems to solve problem #4. Employees quitting becomes much less of an issue, since in any case they would only be able to share knowledge about safety (which is a good thing).
Do yo...
I actually really liked your alignment plan. But I do wonder how it would be able to deal with 5 "big" organizational problems of iterative alignment:
Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers h
Sorry it's taken me a while to get back to this.
No problem posting your questions here. I'm not sure of the best place but I don't think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.
I read that "Carefully Bootstrapped Alignment" is organizationally hard by Raemon and it did make some good points. Most of them I'd considered, but not all of them.
Pausing is hard. That's why my scenarios barely involve pauses and only address them because others see them as a possibility.
Basically I think we have to get alignm...
Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.
I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:
...
- Moving slowly and carefully
I have found another possible concern of mine.
Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compare...
After reading your comment I do agree that unipolar AGI scenario is probably better than a multipolar plan. Perhaps I underestimated how offense-favored our world is.
With that aside, your plan is possibly one of the clearest, most intuitive alignment plans that I've seen. All of the steps make sense and seem decently likely to happen, except maybe for one. I am not sure that your argument for why we have good odds for getting AGI into trustworthy hands works.
"It seems as though we've got a decent chance of getting that AGI into a trustworthy-enough power s...
I've asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I'd be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think we could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over AGI either (intuitively...
If it's not a big ask, I'd really like to know your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think I could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?
What's most worrying is the fact that in your post If we solve alignment, do we die anyway? you mentioned your worries about multipolar scenarios. However, I am not sure we'd be much better off in a unipolar scenario, though. If there is one group of people controlling AGI, then it might be actually even harder to make them give it up. They'd have a large amount of power and no real threat to it (no multipolar AGIs threatening to launch an attack).
However, I am not well-versed in literature on this topic, so if there is any plan for how we can safeguard ou...
Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?
That's a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
That was my takeaway from the experiments done in the aftermath of the alignment faking paper, so it's good to see that it's still holding.