All of Satron's Comments + Replies

Satron10

This discussion is very interesting to read and I look forward to hearing @Fabien Roger's thoughts on your latest comment.

Satron10

I am not from the US, so I don't know anything about the organizations that you have listed. However, we can look at three main conventional sources of existential risk (excluding AI safety, for now, we will come back to it later):

  • Nuclear Warfare - Cooperation strategies + decision theory are active academic fields.
  • Climate Change - This is a very hot topic right now, and a lot of research is being put into it.
  • Pandemics - There was quite a bit of research before COVID, but even more now.

As to your point about Hassabis not doing other projects for reducing e... (read more)

SatronΩ010

Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.

It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following

... (read more)
3Steven Byrnes
I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around: There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero. Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope. Demis Hassabis is the boss of a bunch of world-leading AI experts, with an ability to ask them to do almost arbitrary science projects. Is he asking them to do science projects that reduce existential risk? Well, there’s a DeepMind AI alignment group, which is great, but other than that, basically no. Instead he’s asking his employees to cure diseases (cf Isomorphic Labs), and to optimize chips, and do cool demos, and most of all to make lots of money for Alphabet. You think Sam Altman would tell his future powerful AIs to spend their cycles solving x-risk instead of making money or curing cancer? If so, how do you explain everything that he’s been saying and doing for the past few years? How about Mark Zuckerberg and Yann LeCun? How about random mid-level employees in OpenAI? I am skeptical. Also, even if the person asked the AI that question, then the AI would (we’re presuming) respond: “preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign…”. And then I expect the person would reply “wtf no, don’t you dare, I’ve seen what happens in sci-fi movies when people say yes to those kinds of proposals.” And then the AI would say “Well I could try something much more low-key and norm-following but it probably won’t work”, and the person would say “Yeah do that, we’ll hope for the best.” (More such examples in §1 here.)
Satron43

Not OP, but for me, it comes down to LLMs correctly interpreting the intent behind my questions/requests. In other words, I don't need to be hyper specific in my prompts in order to get the results I want.

1Kajus
That makes sense and is well pretty obvious. Why isn't claude getting me tho and he is getting other people? It's hard for me to even explain claude what kind of code he should write. It is just a skill issue? Can someone teach me how to prompt claude? 
Satron30

By value alignment, which one of the following do you mean?

1) alignment with humanity's CEV 

2) alignment with some other specified way of interpreting humanity's values

3) alignment with some unspecified way of interpreting humanity's values that future AGI/ASI will hopefully come up with

3Seth Herd
All of those. Value alignment is a set of all of the different propoesed methods of giving AGI values that align with humanity's values.
Satron10

This article seems to have found a big weakness of untrusted monitoring:

Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.

To create useful synthetic inputs, either the blue team must be able to match the generator's style well enough to consistently fool the checker, or it must be able to redact the generator's output to the point where the checker has no identify

... (read more)
7Fabien Roger
I left a comment!
Answer by Satron11

Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.

If a company fails at coming up with solutions for AGI corrigibility, they wouldn't stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.

Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.

Satron10

I think that, perhaps, after reading this clarifying comment from you, @quila would change his perception of your position.

Satron10

Let me put my attitudes in practical terms: I don't kick dogs, but I have destroyed a wasp's nest in my garage, and I don't donate to animal charities. (I don't donate to many other charities either, but there have been a few.) Let those who think saving the shrimps is worthwhile do so, but I do not travel on that path

This is what I expected. Your take when put in these terms seems pretty moderate. Whereas, when I read your original comment, this take (which presumably stayed the same) seemed very extreme.

In other words, my personal beliefs haven't changed... (read more)

Satron10

Agreed!

There is also the issue of clarity, I am not sure if Richard has a moderate position that sounds like a very extreme position due to the framing or if he genuinely shares this extreme position.

I do think that animals, the larger ones at least, can suffer. But I don’t much care.

Does this mean a more moderate take of "I don't care enough to take any actions, because I don't believe that I am personally causing much suffering to animals" or a very radical take of "I would rather take $10 than substantially improve the well-being of all animals"?

2Richard_Kennaway
What seems radical depends on where one stands. We each of us stand on our own beliefs, and the further away one looks, the more the beliefs over there differ from one's own. Look sufficiently far and everything you see in the distance will seem extreme and radical. Hence the fallacy that truth lies between extremes, instead of recognising the tautology that one's own beliefs always lie between those that are extremely different. Let me put my attitudes in practical terms: I don't kick dogs, but I have destroyed a wasp's nest in my garage, and I don't donate to animal charities. (I don't donate to many other charities either, but there have been a few.) Let those who think saving the shrimps is worthwhile do so, but I do not travel on that path
Satron10

I think there might be some double standard going on here.

You seem to not care much about animal well-being, and @quila evidently does, so would it not be fair from Quila's perspective to not care much about your well-being? And if Quila doesn't care about your well-being, then he might think that had you not existed, the world (in a utilitarian sense) would be better.

Quila can similarly say "I don't care for lives of people who are actively destroying something that I value a lot

Bring on the death threats!"

2Richard_Kennaway
I have no daydreams about quila, and others of like mind, not existing. Not even about Ziz.
3[anonymous]
(It's not actually the case that I don't value their well-being; I don't want them to suffer and if they were tortured, imagining that would make me sad; I'd prefer beings who don't care about some subset of other beings / are value-orthogonal to just be prevented from hurting them. I just think Richard, in the current world, probably causes more tragedy, based on the comment, so yes I think the current world would be better if it did not have any such people.) Agreed about the double standard part, that's something I was hoping to highlight.
Satron10

I will also reply to the edit here.

would you say that "we have qualia" is just a strong intuition imbued in us by the mathematical process of evolution, and your "realist" position is that you choose to accept {referring to it as if it is real} as a good way of communicating about that imbued intuition/process?

I wouldn't say that about qualia and I wouldn't say that about free will. What you described sounds like being only a nominal realist. 

This is going into details, but I think there is a difference between semantic realism and nominal realism. Se... (read more)

Satron10

By the way, I will quickly reply to your edit in this comment. Determinism certainly seems compatible with qualia being fundamental.

"Hard Determinism" is just a shorthand for a position (in a free will debate) that accepts the following claims:

  • Determinism is true.
  • Determinism is incompatible with free will

I am pretty sure that the term has no use outside of discussions about free will.

Satron10

I will try reply to your edit:

Also, another thing I'm interested in is how someone could have helped past you (and camp #1 people generally) understand what camp #2 even means. There may be a formal-alignment failure mode not noticeable to people in camp #1 where a purely mathematical (it may help to read that as '3rd-person')

I don't remember the exact specifics, but I came across Mary's Room thought experience (perhaps through this video). When presented in that way and when directly asked "does she learn anything new?" my surprising (to myself at the time) answer was an emphatic "yes".

Satron10

what is qualia, in your ontology

It the direct phenomenal experience of stuff like pain, pleasure, colors, etc.

what does it mean for something to be irreducible or metaphysically fundamental

Something is irreducible in a sense that it can't be reduced to interactions between atoms. It can't also be completely completely described from a 3rd person perspective (the perspective from which science usually operates).

2[anonymous]
Thanks! (I added a bunch of text to my comment while you were writing, also.)
Satron20

I understand the 'compatibilist' view on free will to be something like this: determinism is true, but we still run some algorithm which deterministically picks between options, so the algorithm that we are is still 'making a choice', which I choose to conceptualize as "free will and determinism are compatible"

I would rather conceptualize it as acting according to my value system without excessive external coercion. But I would also say that I agree with all four claims that you made there, namely:

  • Determinism is true
  • We run some algorithm which deterministi
... (read more)
2[anonymous]
Okay, interesting. If you want to give further confirmation, I'd ask "what is qualia, in your ontology", and "what does it mean for something to be irreducible or metaphysically fundamental?". Also, another thing I'm interested in is how someone could have helped past you (and camp #1 people generally) understand what camp #2 even means. There may be a formal-alignment failure mode not noticeable to people in camp #1 where a purely mathematical[1] prior does not contain any hypothesis corresponding to reality, analogous to if you used a prior only containing hypotheses capable of being expressed in some restricted formal system while receiving input generated by a mathematical world only expressible in some less restricted formal system. It sounds to me like you previously experienced qualia, but explained it away as "reducible" - but even if it could be reduced/decomposed (which maybe it can be), it would still be 'something which is metaphysically fundamental and not-pure-math', right? (I'm guessing past you would have... disagreed, or maybe have not thought about it in this way, kind of like hypothesis (A)). If that was the crux, how could it have been pointed out to you at the time? 1. ^ (per your next comment, it may help to read this word as '3rd-person')
Satron10

one who believes that those in camp #2 merely have a strong intuition that they have <mysterious-word>

Under this definition I think I would still qualify as a former illusionist about qualia.

Satron20

Ah, I see. I usually think of illusionism as a position where one thinks that a certain phenomenon isn't real but only appears real.

Under the terminology of your most recent comment, I would be a determinist and a realist about free will.

1[anonymous]
I understand the 'compatibilist' view on free will to be something like this: If so, is your view on qualia similar, in that you've mostly re-conceptualized it instead of having "noticed a metaphysically fundamental thing"? I.e., your beliefs about the underlying reality haven't changed? Is qualia reducible to a kind of mathematical process in the way everything else seems to be, in your view? Another way of asking this: would you say that "we have qualia" is just a strong intuition imbued in us by the mathematical process of evolution, and your "realist" position is that you choose to accept {referring to it as if it is real} as a good way of communicating about that imbued intuition/process? If so, then I'd guess you're not actually the kind of person I meant. These words in your original comment also confused me at the time but are explainable under this: "Nothing really changed, besides me getting a deeper understanding of the position of the other camp"; if you were the kind of person I meant, the core thing which changed wouldn't have been "understanding their view better" but "noticing there's actually a metaphysically-fundamental kind-of-thing which is not a math object". But I'm not confident because of communication about this being hard. (Another way of asking would be to ask you to explain what 'qualia' is, in your ontology)
Satron10

I used to be a hard determinist (I think that's what people usually mean by "illusionism"), but currently I am a compatibilist.

So yeah, right now I am not an illusionist about free will either.

1[anonymous]
That's not actually what I mean by illusionism (although under certain metaphysical views it could be related). I tried to define it my first comment. Determinism seems compatible with qualia being a metaphysically fundamental thing. (Edit: I replaced some text while you were replying. My original reply mentioned that I'm an anti-realist about free will and morality as an example)
Satron130

Has there ever been a case of someone being in camp #1, but eventually realizing, according to their self-report, "I actually do have qualia; it was right in front of me this whole time, but I've only now noticed it as a separate thing needing explanation, even though in retrospect it seems obvious to me"?

This is almost exactly my story. I used to think that qualia was just another example of anti-physicalist magical thinking.

Currently, I am actually pretty firmly in Camp #2

illusionists actually do not experience qualia

As someone who used to be (but no lon... (read more)

3FlorianH
I once had an epiphany that pushed me from fully in Camp #2 intellectually rather strongly towards Camp #1. I hadn't heard about illusionism before, so it was quite a thing. Since then, I've devised probably dozens of inner thought experiments/arguments that imho +- proof Camp #1 to be onto something, and that support the hypothesis that qualia can be a bit less special than we make them to be despite how impossible that may seem. So I'm intellectually quite invested in Camp #1 view. Meanwhile, my experience has definitely not changed, my day-to-day me is exactly what it always was, so in that sense definitely "experience" qualia just like anyone. Moreover, it is just as hard as ever before to take my intellectual belief that our 'qualia' might be a bit less absolutely special than we make it to be, seriously in day-to-day life. I.e. emotionally, I'm still +- 100% in Camp #2, and I guess I might be in a rather similar situation
2[anonymous]
Do you mean you're not an illusionist about free will either?
Satron10

It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.

Your link formatting got messed up here.

Satron50

More generally when you're asked a question like "How much do you expect [politician's name] to win/lose by?" how do you go about generating your response? How do you generate your predictions?

My usual response would look something like this:

"Pre-scandal polls are showing Bodee leading by about 1-2%, but given the recent spying scandal and the strong emphasis of our society on ethical conduct, I expect him to lose by 3-4%."

If I am speaking to a person who (on my expectations) is following the polls, I would try to explain why my prediction differs from them.

Satron10

But you seem to mean something broader with "possible worlds." Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world." 

–> Am I interpreting you correctly?

Yup, that's roughly what I meant. However, one caveat would be that I would change "physically possible" to "metaphysically/logically possible" because I don't know if worlds with different physics could exist, whereas I am pr... (read more)

Satron90

Listeners would understandably infer that "these well-informed people apparently don't really worry much, maybe I shouldn't worry much either".

I think this is countered to a great extent by all the well-informed people who worry a lot about AI risk. I think the "well-informed people apparently disagree on this topic, I better look into it myself" environment promotes inquiry and is generally good for truth-seeking.

More generally, I agree with @Neel Nanda, it seems somewhat doubtful that people listening to a very niche Epoch Podcast aren't aware of all the smart people worried about AI risk.

Satron11

Future humans in specific, will at least one of: [ die off early, run lots of high-fidelity simulations of our universe's history ["ancestor-simulations"], decide not to run such simulations

Is there any specific reason the first option is "die off early" and not "be unable to run lots of high-fidelity simulations"? The latter encompasses the former as well as scenarios where future humans survive but for one reason or the other can't run these simulations.

I think a more general argument, in my opinion, would look like this: 

"Future humans will at leas... (read more)

1Lorec
Yes, I think that's a validly equivalent and more general classification. Although I'd reflect that "survive but lack the power or will to run lots of ancestor-simulations" didn't seem like a plausible-enough future to promote it to consideration, back in the '00s.
Satron30

Here is Richard Carrier's response to avturchin's comment (for the courtesy of those reading this thread later):

These do not respond to my article. Just re-stating what was refuted (that most civilizations will be irredeemably evil and wasteful) is not a rebuttal. It’s simply ignoring the argument.

2avturchin
Satron10

I suggest sending this as a comment under his article if you haven't already. I am similarly interested in his response.

3Satron
Here is Richard Carrier's response to avturchin's comment (for the courtesy of those reading this thread later):
2avturchin
I did. It is under moderation now. 
Answer by Satron73

Richard Carrier has written an entire rebuttal of the argument on his blog. He always answers comments under his posts, so if you disagree with some part of the rebuttal, you can just leave a comment there explaining which claim makes no sense to you. Then, he will usually defend the claim in question or provide the necessary clarification.

4avturchin
It looks like he argues against the idea that friendly future AIs will simulate the past based on ethical grounds, and imagining unfriendly AI torturing past simulations is conspiracy theory. I comment the following: There are a couple of situations where future advance civilization will want to have many past simulation: 1. Resurrection simulation by Friendly AI.  They simulate the whole history of the earth incorporating all known data to return to live all people ever lived. It can also simulate a lot of simulation to win "measure war" against unfriendly AI and even to cure suffering of people who lived in the past.   2. Any Unfriendly AI will be interested to solve Fermi paradox, and thus will simulate many possible civilizations around a time of global catastrophic risks (the time we live). Interesting thing here is that we can be not ancestry simulation in that case.   
Satron*40

By "possible worlds," I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction.

For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn't assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.

My answer is kind of similar to @quila's. I think that he means roughly the same thing by "space of possible mathematical things."... (read more)

2Remmelt
With this example, you might still assert that "possible worlds" are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states. But you seem to mean something broader with "possible worlds". Something like "in theory, there is a physically possible arrangement of atoms/energy states that would result in an 'aligned' AGI, even if that arrangement of states might not be reachable from our current or even a past world".  –> Am I interpreting you correctly?   You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of "alignment" that is similarly worded to a technical claim others made. Yet their meaning of "alignment" could be quite different.  It's hard then to have a well-argued discussion, because you don't know whether people are equivocating (ie. switching between different meanings of the term).   That's a good summary list! I like the inclusion of "long-term outcomes" in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions. The way I deal with the wildly varying uses of the term "alignment" is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned. * Alignment is at the minimum the control of the AGI's components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.  
Answer by Satron*62

The claim "alignment is solvable in principle" means "there are possible worlds where alignment is solved."

Consequently, the claim "alignment is unsolvable in principle" means "there are no possible worlds where alignment is solved."

1Remmelt
Thanks! With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’? And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?

A new method for reducing sycophancy. Sycophantic behavior is present in quite a few AI threat models, so it's an important area to work on.

The article not only uses activation steering to reduce sycophancy in AI models but also provides directions for future work.

Overall, this post is a valuable addition to the toolkit of people who wish to build safe advanced AI.

Satron*51

Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers.

Imagine a scenario where we are driving from Austin to Fort Worth. The range of outcomes where we arrive at our destination is perhaps less than 0.0000001% of the total range of outcomes. There are countless potential interruptions that might prevent us from arriving at Fort Worth: traffic accidents, vehicle breakdowns, family emergencies, sudden illness, extreme weather, or even highly improbable events like alien encounters. The universe o... (read more)

This article provides a concrete, object-level benchmark for measuring the faithfulness of CoT. In addition to that, a new method for improving CoT faithfulness is introduced (something that is mentioned in a lot of alignment plans).

The method is straightforward and relies on breaking questions into subquestions. Despite its simplicity, it is surprisingly effective.

In the future, I hope to see alignment plans relying on CoT faithfulness incorporate this method into their toolkit.

An (intentionally) shallow collection of AI alignment agendas that different organizations are working on.

This is the post I come back to when I want to remind myself what agendas different organizations are pursuing.

Overall, it is a solid and comprehensive post that I found very useful.

A very helpful collection of measures humanity can take to reduce risk from advanced AI systems.

Overall, this post is both detailed and thorough, making it a good read for people who want to learn about potential AI threat models as well as promising countermeasures we can take to avoid them.

Satron*10

I guess the word "reality" is kind of ambiguous, and maybe that's why we've been disagreeing for so long.

For example, imagine a scenario where we have 1) a non-simulated base world (let's say 10¹² observers in it) AND 2) a simulated world with 10¹¹ observers AND 3) a simulated world with 1 observer. All three worlds actually concretely exist. People from world #1 just decided to run two simulations (#2 and #3). Surely, in this scenario, as per SSA, I can say that I am a randomly selected observer from the set of all observers. As far as I see, this "set of... (read more)

Satron10

The way I understand it, the main difference between SIA and SSA is the fact that in SIA "I" may fail to exist. To illustrate what I mean, I will have to refer to "souls" just because it's the easiest thing I can come up with.

SSA: There are 10¹¹ + 1 observers and 10¹¹ + 1 souls. Each soul gets randomly assigned to an observer. One of the souls is you. The probability of you existing is 1. You cannot fail to exist.

SIA: There are 10¹¹ + 1 observers and a very large (much larger than 10¹¹ + 1) amount of souls. Let's call this amount N. Each soul gets assigned to an observer. One of the souls is you. However, in this scenario, you may fail to exist. The probability of you existing is (10¹¹ + 1)/N

1AynonymousPrsn123
This is an interesting observation which may well be true, I'm not sure, but the more intuitive difference is that SSA is about actually existing observers, while SIA is about potentially existing observers. In other words, if you are reasoning about possible realities in the so-called "multiverse of possibilities," than you are using SIA. Whereas if you are only considering a single reality (e.g., the non-simulated world), you select a reference class from that reality (e.g., humans), you may choose to use use SSA to say that you are a random observer from that class (e.g., a random human in human history).
Satron10

But the thing is that, there is a matter of fact of whether there are other observers in our world if it is simulated. Either you are the only observer or there are other observers, but one of them is true. Not just potentially true, but actually true.

The same is true of my last paragraph in the original answer (although perhaps, I could've used a clearer wording). If, as a matter of fact there actually exist 10¹¹ + 1 observers, then you are more likely to be in 10¹¹ group as per SSA. We don't know if there are actually 10¹¹ + 1 observers, but that is merely an epistemic gap.

1AynonymousPrsn123
You are describing the SIA assumption to a T.
Satron10

Can you please explain why my explaination requires SIA? From a quick Google search: "The Self Sampling Assumption (SSA) states that we should reason as if we're a random sample from the set of actual existent observers"

My last paragraph in my original answer was talking about a scenario where simulators have actually simulated a) a world with 1 observers AND b) a world with 10¹¹ observer. So a set of "actual existent observers" includes 1 + 10¹¹ observers. You are randomly selected from that, giving you 1:10¹¹ odds of being in the world where you are the only observer. I don't see where SIA is coming in play here.

1AynonymousPrsn123
This is what I was thinking: If simulations exist, we are choosing between two potentially existing scenarios, either I'm the only real person in my simulation, or there are other real people in my simulation. Your argument prioritizes the latter scenario because it contains more observers, but these are potentially existing observers, not actual observers. SIA is for potentially existing observers. I have a kind of intuition that something like my argument above is right, but tell me if that is unclear. And note: one potential problem with your reasoning is that if we take it to it's logical extreme, it would be 100% certain that we are living in a simulation with infinite invisible observers. Because infinity dominates all the finite possibilities.
Satron10

but the problem with your argument is that my original argument could also just consider only simulations in which I am the only observer. In which case Pr(I'm distinct | I'm in a simulation)=1, not 0.5. And since there's obviously some prior probability of this simulation being true, my argument still follows.

But then this turns Pr(I'm in a simulation) into Pr(I'm in a simulation) + Pr(only simulations with one observer exist | simulations exist). It's not enough that a simulation exists with only one observer. It needs to be so that simulations with m... (read more)

1AynonymousPrsn123
I think you are overlooking that your explanation requires BOTH SSA and SIA, but yes, I understand where you are coming from.
Answer by Satron32

No, I don't think it is.

  1. Imagine a scenario in which the people running the simulation decided to simulate every human on Earth as an actual observer.

In this case, Pr(I'm distinct | I'm not in a simulation) = Pr(I'm distinct | I'm in a simulation) because no special treatment has been shown to you. If you think it is very unlikely that you just happened to be distinct in a real world, then in this scenario, you ought to think that it is very unlikely that you just happened to be distinct in a simulated world.

  1. I think what you are actually thinking abou
... (read more)
1AynonymousPrsn123
Other people here have responded in similar ways to you; but the problem with your argument is that my original argument could also just consider only simulations in which I am the only observer. In which case Pr(I'm distinct | I'm in a simulation)=1, not 0.5. And since there's obviously some prior probability of this simulation being true, my argument still follows. I now think my actual error is saying Pr(I'm distinct | I'm not in a simulation)=0.0001, when in reality this probability should be 1, since I am not a random sample of all humans (i.e., SSA is wrong), I am me. Is that clear? Lastly, your final paragraph is akin to the SSA + SIA response to the doomsday paradox, which I don't think is widely accepted since both those assumptions lead to a bunch of paradoxes.

A very promising non-mainstream AI alignment agenda.

Learning-Theoretic Agenda (LTA) attempts to combine empirical and theoretical data, which is a step in the right direction as it avoids a lot of "we don't understand how this thing works, so no amount of empirical data can make it safe" concerns.

I'd like to see more work in the future integrating LTA with other alignment agendas, such as scalable oversight or Redwood's AI control.

Satron70

If you are an AI researcher who desperately needs access to paywalled articles/papers to save humanity:

  • archive.is lets you access a lot of paywalled articles for free (one exception that I know of is Substack)
  • libgen.is lets you access a lot of paywalled academic papers for free (it has a much bigger library than SciHub and doesn't require DOI)

How do you supervise an AI system that is more capable than its overseer? This is the question this article sets to answer.

It brings together two somewhat different approaches: scalable oversight and weak-to-strong generalization. The article then shows how a unified solution would work under different assumptions (with or without scheming).

Overall, the solution seems quite promising. In the future, I'd like the unified solution to be (empirically) tested and separately compared with just scalable oversight or just weak-to-strong generalization to prove its increased effectiveness.

A concise and clear facilitation of a relatively unknown alignment strategy that relies on pursuing other relatively unknown alignment strategies.

If you've ever wondered how many promising alignment strategies never see the light of the day, AE Studio would be the place to ask this question.

Overall, I believe that this strategy will have a positive impact on "widening" the field of AI alignment, which will, in turn, improve our chances of avoiding catastrophic outcomes.

This article provides object-level arguments for thinking that deceptive alignment is very unlikely.

Recently, some organizations (Redwood Research, Anthropic) have been focusing on AI control in general and avoiding deceptive alignment in particular. I would like to see future works from these organizations explaining why deceptive alignment is likely enough to spend considerable resources on it.

Overall, while I don't agree that deceptive alignment is <1% likely, this article made me update towards deceptive alignment being somewhat less likely.

4Martin Randall
I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn't go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won't do that. I'm now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don't know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won't. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Perhaps the largest, most detailed piece of AI risk skepticism of 2023. It engages directly with one of the leading figures on the "high p(doom)" side of the debate.

The article generated a lot of discussion. As of January 4, 2025, it had 230 comments.

Overall, this article updated me towards strongly lowering my p(doom). It is thorough, it is clearly written and it proposes object-level solutions to problems raised by Yudkowski.

Satron*41

I was thinking of ways to improve upon the results of your paper to achieve even better control. You could try having the untrusted model and the trusted model both evaluate the code and then look for differences in how the code is evaluated.

For starters, any code below a -2 monitoring score by the trusted model is automatically run. The rest is passed to a round where the trusted model judges whether the evaluation of the untrusted model is suspiciously more naive than the evaluation of the trusted model (though you might have to ask each model to motivat... (read more)

One of the few high-quality papers on automatic deception detection in black-box LLMs.

Asking completely unrelated questions is a simple yet effective way of catching AI red-handed. In addition, this lie detector generalizes well to 1) other LLM architectures, 2) LLMs fine-tuned to lie, 3) sycophantic lies, and 4) lies emerging in real-life scenarios such as sales.

Despite the solution's simplicity, the paper has been relatively overlooked in LessWrong community. I am hoping to see more future work combining this lie detector with techniques like those presented in Bürger et al., 2024

This article is an excellent foundation for alignment plans relying on faithful CoT reasoning.

Its main contributions include 1) evaluating the density of hidden information that steganography schemes can maintain in the presence of various defenses and 2) developing an effective mitigation strategy (paraphrasing).

Overall, this article is easy to follow, detailed, and insightful.

Load More