I am not from the US, so I don't know anything about the organizations that you have listed. However, we can look at three main conventional sources of existential risk (excluding AI safety, for now, we will come back to it later):
As to your point about Hassabis not doing other projects for reducing e...
...Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.
It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following
Not OP, but for me, it comes down to LLMs correctly interpreting the intent behind my questions/requests. In other words, I don't need to be hyper specific in my prompts in order to get the results I want.
By value alignment, which one of the following do you mean?
1) alignment with humanity's CEV
2) alignment with some other specified way of interpreting humanity's values
3) alignment with some unspecified way of interpreting humanity's values that future AGI/ASI will hopefully come up with
This article seems to have found a big weakness of untrusted monitoring:
...Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.
To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identify
Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.
If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.
Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.
Let me put my attitudes in practical terms: I don't kick dogs, but I have destroyed a wasp's nest in my garage, and I don't donate to animal charities. (I don't donate to many other charities either, but there have been a few.) Let those who think saving the shrimps is worthwhile do so, but I do not travel on that path
This is what I expected. Your take when put in these terms seems pretty moderate. Whereas, when I read your original comment, this take (which presumably stayed the same) seemed very extreme.
In other words, my personal beliefs haven't changed...
Agreed!
There is also the issue of clarity, I am not sure if Richard has a moderate position that sounds like a very extreme position due to the framing or if he genuinely shares this extreme position.
I do think that animals, the larger ones at least, can suffer. But I don’t much care.
Does this mean a more moderate take of "I don't care enough to take any actions, because I don't believe that I am personally causing much suffering to animals" or a very radical take of "I would rather take $10 than substantially improve the well-being of all animals"?
I think there might be some double standard going on here.
You seem to not care much about animal well-being, and @quila evidently does, so would it not be fair from Quila's perspective to not care much about your well-being? And if Quila doesn't care about your well-being, then he might think that had you not existed, the world (in a utilitarian sense) would be better.
Quila can similarly say "I don't care for lives of people who are actively destroying something that I value a lot
Bring on the death threats!"
I will also reply to the edit here.
would you say that "we have qualia" is just a strong intuition imbued in us by the mathematical process of evolution, and your "realist" position is that you choose to accept {referring to it as if it is real} as a good way of communicating about that imbued intuition/process?
I wouldn't say that about qualia and I wouldn't say that about free will. What you described sounds like being only a nominal realist.
This is going into details, but I think there is a difference between semantic realism and nominal realism. Se...
By the way, I will quickly reply to your edit in this comment. Determinism certainly seems compatible with qualia being fundamental.
"Hard Determinism" is just a shorthand for a position (in a free will debate) that accepts the following claims:
I am pretty sure that the term has no use outside of discussions about free will.
I will try reply to your edit:
Also, another thing I'm interested in is how someone could have helped past you (and camp #1 people generally) understand what camp #2 even means. There may be a formal-alignment failure mode not noticeable to people in camp #1 where a purely mathematical (it may help to read that as '3rd-person')
I don't remember the exact specifics, but I came across Mary's Room thought experience (perhaps through this video). When presented in that way and when directly asked "does she learn anything new?" my surprising (to myself at the time) answer was an emphatic "yes".
what is qualia, in your ontology
It the direct phenomenal experience of stuff like pain, pleasure, colors, etc.
what does it mean for something to be irreducible or metaphysically fundamental
Something is irreducible in a sense that it can't be reduced to interactions between atoms. It can't also be completely completely described from a 3rd person perspective (the perspective from which science usually operates).
I understand the 'compatibilist' view on free will to be something like this: determinism is true, but we still run some algorithm which deterministically picks between options, so the algorithm that we are is still 'making a choice', which I choose to conceptualize as "free will and determinism are compatible"
I would rather conceptualize it as acting according to my value system without excessive external coercion. But I would also say that I agree with all four claims that you made there, namely:
one who believes that those in camp #2 merely have a strong intuition that they have <mysterious-word>
Under this definition I think I would still qualify as a former illusionist about qualia.
Ah, I see. I usually think of illusionism as a position where one thinks that a certain phenomenon isn't real but only appears real.
Under the terminology of your most recent comment, I would be a determinist and a realist about free will.
I used to be a hard determinist (I think that's what people usually mean by "illusionism"), but currently I am a compatibilist.
So yeah, right now I am not an illusionist about free will either.
Has there ever been a case of someone being in camp #1, but eventually realizing, according to their self-report, "I actually do have qualia; it was right in front of me this whole time, but I've only now noticed it as a separate thing needing explanation, even though in retrospect it seems obvious to me"?
This is almost exactly my story. I used to think that qualia was just another example of anti-physicalist magical thinking.
Currently, I am actually pretty firmly in Camp #2
illusionists actually do not experience qualia
As someone who used to be (but no lon...
It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.
Your link formatting got messed up here.
More generally when you're asked a question like "How much do you expect [politician's name] to win/lose by?" how do you go about generating your response? How do you generate your predictions?
My usual response would look something like this:
"Pre-scandal polls are showing Bodee leading by about 1-2%, but given the recent spying scandal and the strong emphasis of our society on ethical conduct, I expect him to lose by 3-4%."
If I am speaking to a person who (on my expectations) is following the polls, I would try to explain why my prediction differs from them.
But you seem to mean something broader with "possible worlds." Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world."
–> Am I interpreting you correctly?
Yup, that's roughly what I meant. However, one caveat would be that I would change "physically possible" to "metaphysically/logically possible" because I don't know if worlds with different physics could exist, whereas I am pr...
Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I shouldn't worry much either".
I think this is countered to a great extent by all the well-informed people who worry a lot about AI risk. I think the "well-informed people apparently disagree on this topic, I better look into it myself" environment promotes inquiry and is generally good for truth-seeking.
More generally, I agree with @Neel Nanda, it seems somewhat doubtful that people listening to a very niche Epoch Podcast aren't aware of all the smart people worried about AI risk.
Future humans in specific, will at least one of: [ die off early, run lots of high-fidelity simulations of our universe's history ["ancestor-simulations"], decide not to run such simulations
Is there any specific reason the first option is "die off early" and not "be unable to run lots of high-fidelity simulations"? The latter encompasses the former as well as scenarios where future humans survive but for one reason or the other can't run these simulations.
I think a more general argument, in my opinion, would look like this:
"Future humans will at leas...
Here is Richard Carrier's response to avturchin's comment (for the courtesy of those reading this thread later):
These do not respond to my article. Just re-stating what was refuted (that most civilizations will be irredeemably evil and wasteful) is not a rebuttal. It’s simply ignoring the argument.
I suggest sending this as a comment under his article if you haven't already. I am similarly interested in his response.
Richard Carrier has written an entire rebuttal of the argument on his blog. He always answers comments under his posts, so if you disagree with some part of the rebuttal, you can just leave a comment there explaining which claim makes no sense to you. Then, he will usually defend the claim in question or provide the necessary clarification.
By "possible worlds," I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction.
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn't assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
My answer is kind of similar to @quila's. I think that he means roughly the same thing by "space of possible mathematical things."...
The claim "alignment is solvable in principle" means "there are possible worlds where alignment is solved."
Consequently, the claim "alignment is unsolvable in principle" means "there are no possible worlds where alignment is solved."
A new method for reducing sycophancy. Sycophantic behavior is present in quite a few AI threat models, so it's an important area to work on.
The article not only uses activation steering to reduce sycophancy in AI models but also provides directions for future work.
Overall, this post is a valuable addition to the toolkit of people who wish to build safe advanced AI.
Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers.
Imagine a scenario where we are driving from Austin to Fort Worth. The range of outcomes where we arrive at our destination is perhaps less than 0.0000001% of the total range of outcomes. There are countless potential interruptions that might prevent us from arriving at Fort Worth: traffic accidents, vehicle breakdowns, family emergencies, sudden illness, extreme weather, or even highly improbable events like alien encounters. The universe o...
This article provides a concrete, object-level benchmark for measuring the faithfulness of CoT. In addition to that, a new method for improving CoT faithfulness is introduced (something that is mentioned in a lot of alignment plans).
The method is straightforward and relies on breaking questions into subquestions. Despite its simplicity, it is surprisingly effective.
In the future, I hope to see alignment plans relying on CoT faithfulness incorporate this method into their toolkit.
An (intentionally) shallow collection of AI alignment agendas that different organizations are working on.
This is the post I come back to when I want to remind myself what agendas different organizations are pursuing.
Overall, it is a solid and comprehensive post that I found very useful.
A very helpful collection of measures humanity can take to reduce risk from advanced AI systems.
Overall, this post is both detailed and thorough, making it a good read for people who want to learn about potential AI threat models as well as promising countermeasures we can take to avoid them.
I guess the word "reality" is kind of ambiguous, and maybe that's why we've been disagreeing for so long.
For example, imagine a scenario where we have 1) a non-simulated base world (let's say 10¹² observers in it) AND 2) a simulated world with 10¹¹ observers AND 3) a simulated world with 1 observer. All three worlds actually concretely exist. People from world #1 just decided to run two simulations (#2 and #3). Surely, in this scenario, as per SSA, I can say that I am a randomly selected observer from the set of all observers. As far as I see, this "set of...
The way I understand it, the main difference between SIA and SSA is the fact that in SIA "I" may fail to exist. To illustrate what I mean, I will have to refer to "souls" just because it's the easiest thing I can come up with.
SSA: There are 10¹¹ + 1 observers and 10¹¹ + 1 souls. Each soul gets randomly assigned to an observer. One of the souls is you. The probability of you existing is 1. You cannot fail to exist.
SIA: There are 10¹¹ + 1 observers and a very large (much larger than 10¹¹ + 1) amount of souls. Let's call this amount N. Each soul gets assigned to an observer. One of the souls is you. However, in this scenario, you may fail to exist. The probability of you existing is (10¹¹ + 1)/N
But the thing is that, there is a matter of fact of whether there are other observers in our world if it is simulated. Either you are the only observer or there are other observers, but one of them is true. Not just potentially true, but actually true.
The same is true of my last paragraph in the original answer (although perhaps, I could've used a clearer wording). If, as a matter of fact there actually exist 10¹¹ + 1 observers, then you are more likely to be in 10¹¹ group as per SSA. We don't know if there are actually 10¹¹ + 1 observers, but that is merely an epistemic gap.
Can you please explain why my explaination requires SIA? From a quick Google search: "The Self Sampling Assumption (SSA) states that we should reason as if we're a random sample from the set of actual existent observers"
My last paragraph in my original answer was talking about a scenario where simulators have actually simulated a) a world with 1 observers AND b) a world with 10¹¹ observer. So a set of "actual existent observers" includes 1 + 10¹¹ observers. You are randomly selected from that, giving you 1:10¹¹ odds of being in the world where you are the only observer. I don't see where SIA is coming in play here.
but the problem with your argument is that my original argument could also just consider only simulations in which I am the only observer. In which case Pr(I'm distinct | I'm in a simulation)=1, not 0.5. And since there's obviously some prior probability of this simulation being true, my argument still follows.
But then this turns Pr(I'm in a simulation) into Pr(I'm in a simulation) + Pr(only simulations with one observer exist | simulations exist). It's not enough that a simulation exists with only one observer. It needs to be so that simulations with m...
No, I don't think it is.
In this case, Pr(I'm distinct | I'm not in a simulation) = Pr(I'm distinct | I'm in a simulation) because no special treatment has been shown to you. If you think it is very unlikely that you just happened to be distinct in a real world, then in this scenario, you ought to think that it is very unlikely that you just happened to be distinct in a simulated world.
A very promising non-mainstream AI alignment agenda.
Learning-Theoretic Agenda (LTA) attempts to combine empirical and theoretical data, which is a step in the right direction as it avoids a lot of "we don't understand how this thing works, so no amount of empirical data can make it safe" concerns.
I'd like to see more work in the future integrating LTA with other alignment agendas, such as scalable oversight or Redwood's AI control.
If you are an AI researcher who desperately needs access to paywalled articles/papers to save humanity:
How do you supervise an AI system that is more capable than its overseer? This is the question this article sets to answer.
It brings together two somewhat different approaches: scalable oversight and weak-to-strong generalization. The article then shows how a unified solution would work under different assumptions (with or without scheming).
Overall, the solution seems quite promising. In the future, I'd like the unified solution to be (empirically) tested and separately compared with just scalable oversight or just weak-to-strong generalization to prove its increased effectiveness.
A concise and clear facilitation of a relatively unknown alignment strategy that relies on pursuing other relatively unknown alignment strategies.
If you've ever wondered how many promising alignment strategies never see the light of the day, AE Studio would be the place to ask this question.
Overall, I believe that this strategy will have a positive impact on "widening" the field of AI alignment, which will, in turn, improve our chances of avoiding catastrophic outcomes.
This article provides object-level arguments for thinking that deceptive alignment is very unlikely.
Recently, some organizations (Redwood Research, Anthropic) have been focusing on AI control in general and avoiding deceptive alignment in particular. I would like to see future works from these organizations explaining why deceptive alignment is likely enough to spend considerable resources on it.
Overall, while I don't agree that deceptive alignment is <1% likely, this article made me update towards deceptive alignment being somewhat less likely.
Perhaps the largest, most detailed piece of AI risk skepticism of 2023. It engages directly with one of the leading figures on the "high p(doom)" side of the debate.
The article generated a lot of discussion. As of January 4, 2025, it had 230 comments.
Overall, this article updated me towards strongly lowering my p(doom). It is thorough, it is clearly written and it proposes object-level solutions to problems raised by Yudkowski.
I was thinking of ways to improve upon the results of your paper to achieve even better control. You could try having the untrusted model and the trusted model both evaluate the code and then look for differences in how the code is evaluated.
For starters, any code below a -2 monitoring score by the trusted model is automatically run. The rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivat...
One of the few high-quality papers on automatic deception detection in black-box LLMs.
Asking completely unrelated questions is a simple yet effective way of catching AI red-handed. In addition, this lie detector generalizes well to 1) other LLM architectures, 2) LLMs fine-tuned to lie, 3) sycophantic lies, and 4) lies emerging in real-life scenarios such as sales.
Despite the solution's simplicity, the paper has been relatively overlooked in LessWrong community. I am hoping to see more future work combining this lie detector with techniques like those presented in Bürger et al., 2024
This article is an excellent foundation for alignment plans relying on faithful CoT reasoning.
Its main contributions include 1) evaluating the density of hidden information that steganography schemes can maintain in the presence of various defenses and 2) developing an effective mitigation strategy (paraphrasing).
Overall, this article is easy to follow, detailed, and insightful.
This discussion is very interesting to read and I look forward to hearing @Fabien Roger's thoughts on your latest comment.