LESSWRONG
LW

All of Charbel-Raphaël's Comments + Replies

What convincing warning shot could help prevent extinction from AI?

Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula warning shot.

What convincing warning shot could help prevent extinction from AI?

Charbel-Raphaël4dΩ14-7

Coming back to this comment: we got a few clear examples, and nobody seems to care:

"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.

This time we catched it. Next time, maybe we won't be able to catch it.

2peterbarnett4d

I partially agree. Although I think this result would be a lot more convincing if the situation wasn't this contrived (eg Anthropic in the scenario is doing a bad thing, a sympathetic lab employee talks to the model, breaking out is unrealistically easy). I think if the model was caught trying to escape (especially in normal scenarios where Anthropic wasn't being evil) by carrying out sophisticated hacks (eg weakening security, learning about security vulnerabilities, coordinating with other copies etc), in a wide range of scenarios, then this would be much more convinvcing

🇫🇷 Announcing CeSIA: The French Center for AI Safety

Charbel-Raphaël1mo30

https://www.youtube.com/watch?v=ZP7T6WAK3Ow

ryan_greenblatt's Shortform

Charbel-Raphaël2moΩ120

Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.

btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically "Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?" makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.

4ryan_greenblatt2mo

Yes, but I don't see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don't get around to getting the boost from hiring humans.

ryan_greenblatt's Shortform

Charbel-Raphaël2moΩ120

I was saying 2x because I've memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?

Summary of the experiment process and results (described in following paragraph)

6ryan_greenblatt2mo

This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot's interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding. I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.

ryan_greenblatt's Shortform

Charbel-Raphaël2moΩ240

How much faster do you think we are already? I would say 2x.

7ryan_greenblatt2mo

I'd guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they've released. (I think this due to updates from o3 and general vibes.)

AI: Practical Advice for the Worried

Charbel-Raphaël2mo20

What do you don't fully endorse anymore?

The Field of AI Alignment: A Postmortem, and What To Do About It

Charbel-Raphaël2mo20

I would be happy to discuss in a dialogue about this. This seems to be an important topic, and I'm really unsure about many parameters here.

Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël2mo*Ω37798Review for 2023 Review

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a ... (read more)

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël2moΩ391Review for 2023 Review

Ok, time to review this post and assess the overall status of the project.

Review of the post

What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand... (read more)

3Ben Goldhaber1mo

This post was one of my first introductions to davidad's agenda and convinced me that while yes it was crazy, it was maybe not impossible, and it led me to working on initiatives like the multi-author manifesto you mentioned. Thank you for writing it!

5David Matolcsi2mo

What is an infra-Bayesian Super Mario supposed to mean? I studied infra-Bayes under Vanessa for half a year, and I have no idea what this could possibly mean. I asked Vanessa when this post came out and she also said she can't guess what you might mean under this. Can you explain what this is? It makes me very skeptical that the only part of the plan I know something about seems to be nonsense. Also, can you give more information ir link to a resource on what Davidad's team is currently doing? It looks like they are the best funded AI safety group that currently exist (except if you count Anthropic), but I never hear about them.

The Field of AI Alignment: A Postmortem, and What To Do About It

Charbel-Raphaël2mo62

Maybe you have some information that I don't have about the labs and the buy-in? You think this applies to OpenAI and not just Anthropic?

But as far as open source goes, I'm not sure. Deepseek? Meta? Mistral? xAI? Some big labs are just producing open source stuff. DeepSeek is maybe only 6 months behind. Is that enough headway?

It seems to me that the tipping point for many people (I don't know for you) about open source is whether or not open source is better than close source, so this is a relative tipping point in terms of capabilities. But I think we sho... (read more)

6ryan_greenblatt2mo

As far as open source, the quick argument is that once AI becomes sufficiently powerful, it's unlikely that the incentives are toward open sourcing it (including goverment incentives). This isn't totally obvious though, and this doesn't rule out catastrophic bioterrorism (more like COVID scale than extinction scale) prior to AI powerful enough to substantially accelerate R&D across many sectors (including bio). It also doesn't rule out powerful AI being open sourced years after it is first created (though the world might be radically transformed by this point anyway). I don't have that much of an inside view on this, but reasonable people I talk to are skeptical that open source is a very big deal (in >20% of worlds) from at least an x-risk perspective. (Seems very sensitive to questions about government response, how much stuff is driven by ideology, and how much people end up being compelled (rightly or not) by "commoditize your complement" (and ecosystem) economic arguments.) Open source seems good on current margins, at least to the extent it doesn't leak algorithmic advances / similar.

The Field of AI Alignment: A Postmortem, and What To Do About It

Charbel-Raphaël2mo6-1

No, AI control doesn't pass the bar, because we're still probably fucked until we have a solution for open source AI or race for superintelligence, for example, and OpenAI doesn't seem to be planning to use control, so this looks to me like the research that's sort of being done in your garage but ignored by the labs (and that's sad, control is great I agree).

4ryan_greenblatt2mo

I think this somewhat understates the level of buy in from labs. I agree that "quickly building superintelligence" makes control look notably less appealing. (Though note that this also applies for any prosaic method that is unlikely to directly productively scale to superintelligence.) I'm not very worried about open source AI at the moment, but I am quite worried about inadequate security undermining control and other hopes.

The Field of AI Alignment: A Postmortem, and What To Do About It

Charbel-Raphaël2mo20

What do you think of my point about Scott Aaronson? Also, since you agree with points 2 and 3, it seems that you also think that the most useful work from last year didn't require advanced physics, so isn't this a contradiction with you disagreing with point 1?

The Field of AI Alignment: A Postmortem, and What To Do About It

Charbel-Raphaël2mo5319

I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:

Empirically, we've already kind of tested this, and it doesn't work.
- I don't think that what Scott

... (read more)

4Noosphere892mo

I'd probably add AI control to this list, as it's a method to use AIs of a specific capability range without AIs escaping even assuming misalignment of AIs. Unusually relative to most AI governance people, I think regulation is most helpful in cases where AI alignment succeeds by a combination of instruction following/personal intent alignment, but no CEV of humanity occurs, and CEV alignment only occurs some of the time (and even then, it's personal CEV alignment), which I think is the most plausible world right now.

4Noosphere892mo

The big issue is that a lot of the swiss cheese strategy assumes failures are sufficiently uncorrelated that multiple defense layers stack, but AI can coordinate failures such that unlikely events become probable ones even through layers of defenses. However, I think AI progress is slow and continuous enough that I do think that swiss cheese models are reasonably useful, and I do think there's a band of performance where optimization doesn't totally obsolete the strategy.

4WCargo2mo

I agree with claim 2-3 but not with claim 1 * I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory * We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences) * Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

Hire (or Become) a Thinking Assistant

Charbel-Raphaël2mo41

I've been testing this with @Épiphanie Gédéon for a few months now, and it's really, really good for doing more work that's intellectually challenging. In my opinion, the most important sentence in the post is the fact that it doesn't help that much during peak performance moments, but we’re not at our peak that often. And so, it's super important. It’s really a big productivity boost, especially when doing cognitively demanding tasks or things we struggle to "eat the frog". So, I highly recommend it.

But the person involved definitely needs to be pre... (read more)

Consciousness as a conflationary alliance term for intrinsically valued internal experiences

Charbel-Raphaël2moΩ9119Review for 2023 Review

I often find myself revisiting this post—it has profoundly shaped my philosophical understanding of numerous concepts. I think the notion of conflationary alliances introduced here is crucial for identifying and disentangling/dissolving many ambiguous terms and resolving philosophical confusion. I think this applies not only to consciousness but also to situational awareness, pain, interpretability, safety, alignment, and intelligence, to name a few.

I referenced this blog post in my own post, My Intellectual Journey to Dis-solve the Hard Problem of Conscio... (read more)

Are we dropping the ball on Recommendation AIs?

Charbel-Raphaël4mo20

I don't Tournesol is really mature currently, especially for non french content, and I'm not sure they try to do governance works, that's mainly a technical projet, which is already cool.

Are we dropping the ball on Recommendation AIs?

Charbel-Raphaël4mo62

Yup, we should create an equivalent of the Nutri-Score for different recommendation AIs.

Are we dropping the ball on Recommendation AIs?

Charbel-Raphaël4mo20

"I really don't know how tractable it would be to pressure compagnies" seems weirdly familiar. We already used the same argument for AGI safety, and we know that governance work is much more tractable than expected.

1alenoach4mo

Helping to enforce the DSA might be one opportunity. The DSA is a constraining piece of legislation, but the team tasked with monitoring online platforms and enforce it is understaffed, especially in the early days (these roles were actually on the 80,000 Hours jobs board). So, perhaps there could be an opportunity e.g. in finding ways to help them automatically detect or verify compliance issues, if they accept such contributions. For Tournesol, their website doesn't appear to have changed so much during the last year, so I suppose it is pretty mature. They also have other projects, and they tend to foster a community of French volunteers interested in recommendation algorithms. It depends on whether such projects could have a large-scale impact.

Are we dropping the ball on Recommendation AIs?

Charbel-Raphaël4moΩ152

I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.

4alenoach4mo

Good recommendation engines are really important for our epistemic environment, in my opinion more than for example prediction markets. Because it indeed affects so much of the content that people ingest in their daily lives, on a large scale. The tough question is how tractable it is. Tournesol has some audience, but also seems to struggle to scale it up despite pretty mature software. I really don't know how effective it would be to pressure companies like Facebook or TikTok, or to push for regulation, or to conduct more research on how to improve recommendation algorithms. Seems worth investigating whether there are cost-effective opportunities, whether through grants or job recommendations.

The case for more Alignment Target Analysis (ATA)

Charbel-Raphaël5mo20

Fair enough.

I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models), I don't see how you want to implement ATA, and this isn't really a priority?

8ThomasCederborg5mo

Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify. Summary Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now. Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now. The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now. What I mean with Alignment Target Analysis (ATA) The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do AT

The case for more Alignment Target Analysis (ATA)

Charbel-Raphaël5mo43

I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy.

I don't want AIs to takeover.

5ThomasCederborg5mo

Let's reason from the assumption that you are completely right. Specifically, let's assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let's assume that there exists a feasible Better Long Term Solution (BLTS). In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for reducing the overall probability of SAIP. And this tool would not be easy to replace with other methods. ATA could convince the designers of a specific SAIP that their particular project should be abandoned. If ATA results in the description of necessary features, then it might even help a (member of a) design team see that it would be bad if a secret project were to successfully hit a completely novel, unpublished, alignment target (for example along the lines of this necessary Membrane formalism feature). ATA would also be a project where people can collaborate despite almost opposite viewpoints on the desirability of SAIP. Consider Bob who mostly just wants to get some SAIP implemented as fast as possible. But Bob still recognizes the unlikely possibility of dangerous alignment targets with hidden flaws (but he does not think that this risk is anywhere near large enough to justify waiting to launch a SAIP). You and Bob clearly have very different viewpoint regarding how the world should deal with AI. But there is actually nothing preventing you and Bob from cooperating on a risk reduction focused ATA project. This type of diversity of perspectives might actually be very productive for such a project. You are not trying to build a bridge on a deadline. You are not trying to win an election. You do not have to be on the same page to get things done. You are trying to make novel conceptual progress, looking for

Covert Malicious Finetuning

Charbel-Raphaël6moΩ120

Thank you for this post and study. It's indeed very interesting.

I have two questions:

In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.

If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Charbel-Raphaël6moΩ120

Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?

Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.

In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:

https://github.com/EffiSciencesResearch/challenge_data_ens_2023/blob/main/assets/DivDis.png?raw=true

The DivDis paper presents a simple ... (read more)

Constructability: Plainly-coded AGIs may be feasible in the near future

Charbel-Raphaël7mo20

Here is the youtube video from the Guaranteed Safe AI Seminars:

Finding the Wisdom to Build Safe AI

Charbel-Raphaël8moΩ242

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Charbel-Raphaël9moΩ214923

Strongly agree.

Related: It's disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don't grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, "When you write this kind of software, you always control what's going to happen, all the outputs the software can have." As long as such ... (read more)

4complicated.world8mo

All of the examples that you mentioned share one critical non-technical aspect though. Their results are publicly available (I guess they were funded by general public, e.g. in case of "BadLlama" - by donations and grants to a foundation Palisade Research and IIIT, an Indian national institute). If you took the very same "technical" research and have it only available to a potentially shady private company, then that technical information could help them to indeed circumvent Llama's safeguards. At that point, I'm not sure if one could still confidently call it "overwhelmingly positive". I agree that the works that you mentioned are very positive, but I think that the above non-technical aspect is necessary to take into consideration.

kave9moΩ91317

It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)

One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

Two easy things that maybe Just Work to improve AI discourse

Charbel-Raphaël9mo42

Strong agree. I think twitter and reposting stuff on other platforms is still neglected, and this is important to increase safety culture

2mesaoptimizer9mo

I think Twitter systematically underpromotes tweets with links external to the Twitter platform, so reposting isn't a viable strategy.

2Michaël Trazzi9mo

want to also stress that even though I presented a lot of counter-arguments in my other comment, I basically agree with Charbel-Raphaël that twitter as a way to cross-post is neglected and not costly and i also agree that there's a 80/20 way of promoting safety that could be useful

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphaël9moΩ120

doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".

I agree that's a bit too much, but it seems to me that we're not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates.

But I still think no return to loss of control because it might be ... (read more)

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphaël9mo*Ω34-2

Thanks for this comment, but I think this might be a bit overconfident.

constantly fighting off the mitigations that humans are using to try to detect them and shut them down.

Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:

1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open

... (read more)

4Richard_Ngo9mo

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed. A world which can pause AI development is one which can also easily throttle ARA AIs. This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed". An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop). Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

We might be dropping the ball on Autonomous Replication and Adaptation.

Charbel-Raphaël9moΩ120

Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP.

Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;

would we react before the point of no return?
Where should we place the red line? Should this red line apply to labs?
Is this going to be exponential?

... (read more)

7[anonymous]9mo

Potentially unpopular take, but if you have the skillset to do so, I'd rather you just come up with simple/clear explanations for why ARA is dangerous, what implications this has for AI policy, present these ideas to policymakers, and iterate on your explanations as you start to see why people are confused. Note also that in the US, the NTIA has been tasked with making recommendations about open-weight models. The deadline for official submissions has ended but I'm pretty confident that if you had something you wanted them to know, you could just email it to them and they'd take a look. My impression is that they're broadly aware of extreme risks from certain kinds of open-sourcing but might benefit from (a) clearer explanations of ARA threat models and (b) specific suggestions for what needs to be done.

Awakening

Charbel-Raphaël9mo50

Thanks for writing this.

I like your writing style, this inspired me to read a few more things

2lsusr9mo

Glad you enjoyed. :)

Brainstorming positive visions of AI

Charbel-Raphaël10mo20

Seems like we are here today

AI Safety Camp final presentations

Charbel-Raphaël10mo30

are the talks recorded?

4Linda Linsefors10mo

Yes, thanks for asking * Block 1: Stop/Pause AI * Block 2: Mech-Interp * Block 3: Mech-Interp & Evals * Block 4: Misc * Block 5: Misc

Constructability: Plainly-coded AGIs may be feasible in the near future

Charbel-Raphaël10mo20

Corrected

Constructability: Plainly-coded AGIs may be feasible in the near future

Charbel-Raphaël10mo*50

[We don't think this long term vision is a core part of constructability, this is why we didn't put it in the main post]

We asked ourselves what should we do if constructability works in the long run.

We are unsure, but here are several possibilities.

Constructability could lead to different possibilities depending on how well it works, from most to less ambitious:

Using GPT-6 to implement GPT-7-white-box (foom?)
Using GPT-6 to implement GPT-6-white-box
Using GPT-6 to implement GPT-4-white-box
Using GPT-6 to implement Alexa++, a humanoid housekeeper robot t

... (read more)

A Dilemma in AI Suffering/Happiness

Charbel-Raphaël10mo2-1

You might be interested in reading this. I think you are reasoning in an incorrect framing.

Effectively Handling Disagreements - Introducing a New Workshop

Charbel-Raphaël11mo96

I have tried Camille's in-person workshop in the past and was very happy with it. I highly recommend it. It helped me discover many unknown unknowns.

What convincing warning shot could help prevent extinction from AI?

Answer by Charbel-RaphaëlApr 15, 202480

Deleted paragraph from the post, that might answer the question:

Surprisingly, the same study found that even if there were an escalation of warning shots that ended up killing 100k people or >$10 billion in damage (definition), skeptics would only update their estimate from 0.10% to 0.25% ^[1]: There is a lot of inertia, we are not even sure this kind of “strong” warning shot would happen, and I suspect this kind of big warning shot could happen beyond the point of no return because this type of warning shot requires autonomous replication and adapt... (read more)

3Bogdan Ionut Cirstea11mo

A 10-year global pause would allow for a lot of person-years-equivalents of automated AI safety R&D. E.g. from Some thoughts on automating alignment research (under some assumptions mentioned in the post): 'each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.' And for different assumptions the numbers could be [much] larger still: 'For a model trained with 1000x the compute, over the course of 4 rather than 12 months, you could 100x as many models in parallel.[9] You’d have 1.5 million researchers working for 15 months.' This would probably obsolete all previous AI safety R&D. Of course, this assumes you'd be able to use automated AI safety R&D safely and productively. I'm relatively optimistic that a world which would be willing to enforce a 10-year global pause would also invest enough in e.g. a mix of control and superalignment to do this.

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo50

in your case, you felt the problem, until you decided that an AI civilization might spontaneously develop a spurious concept of phenomenal consciousness.

This is the best summary of the post currently

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo40

Thanks for jumping in! And I'm not that emotionally struggling with this, this was more of a nice puzzle, so don't worry about it :)

I agree my reasoning is not clean in the last chapter.

To me, the epiphany was that AI would rediscover everything like it rediscovered chess alone. As I've said in the box, this is a strong blow to non-materialistic positions, and I've not emphasized this enough in the post.

I expect AI to be able to create "civilizations" (sort of) of its own in the future, with AI philosophers, etc.

Here is a snippet of my answer to Kaj,... (read more)

4Rafael Harth11mo

If the Turing thesis is correct, AI can, in principle, solve every problem a human can solve. I don't doubt the Turing thesis and hence would assign over 99% probability to this claim: (I'm actually not sure where your 5% doubt comes from -- do you assign 5% on the Turing thesis being false, or are you drawing a distinction between practically possible and theoretically possible? But even then, how could anything humans do be practically impossible for AIs?) But does this prove eliminativism? I don't think so. A camp #2 person could simply reply something like "once we get a conscious AI, if we look at the precise causal chain that leads it to claim that it is conscious, we would understand why that causal chain also exhibits phenomenal consciousness". Also, note that among people who believe in camp #2 style consciousness, almost all of them (I've only ever encountered one person who disagreed) agree that a pure lookup table would not be conscious. (Eliezer agrees as well.) This logically implies that camp #2 style consciousness is not about ability to do a thing, but rather about how that thing is done (or more technically put, it's not about the input/output behavior of a system but an algorithmic or implementation-level description). Equivalently, it implies that for any conscious algorithm A, there exists a non-conscious algorithm A′ with identical input/output behavior (this is also implied by IIT). Therefore, if you had an AI with a certain capability, another way that a camp #2 person could respond is by arguing that you chose the wrong algorithm and hence the AI is not conscious despite having this capability. (It could be the case that all unconscious implementations of the capability are computationally wasteful like the lookup table and hence all practically feasible implementations are conscious, but this is not trivially true, so you would need to separately argue for why you think this.) Epiphenomenalism is a strictly more complex theory than Elim

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo40

Thank you for clarifying your perspective. I understand you're saying that you expect the experiment to resolve to "yes" 70% of the time, making you 70% eliminativist and 30% uncertain. You can't fully update your beliefs based on the hypothetical outcome of the experiment because there are still unknowns.

For myself, I'm quite confident that the meta-problem and the easy problems of consciousness will eventually be fully solved through advancements in AI and neuroscience. I've written extensively about AI and path to autonomous AGI here, and I would ask pe... (read more)

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo5-6

hmm, I don't understand something, but we are closer to the crux :)

You say:

To the question, "Would you update if this experiment is conducted and is successful?" you answer, "Well, it's already my default assumption that something like this would happen".
To the question, "Is it possible at all?" You answer 70%.

So, you answer 99-ish% to the first question and 70% to the second question, this seems incoherent.

It seems to me that you don't bite the bullet for the first question if you expect this to happen. Saying, "Looks like I was right," ... (read more)

5Kaj_Sotala11mo

Yeah I think there's some mutual incomprehension going on :) For me "the default assumption" is anything with more than 50% probability. In this case, my default assumption has around 70% probability. Sorry, I don't understand this. What question am I dodging? If you mean the question of "would I update", what update do you have in mind? (Of course, if I previously gave an event 70% probability and then it comes true, I'll update from 70% to ~100% probability of that event happening. But it seems pretty trivial to say that if an event happens then I will update to believing that the event has happened, so I assume you mean some more interesting update.) I may have misinterpreted you; I took you to be saying "if you expect to see this happening, then you might as well immediately update to what you'd believe after you saw it happen". Which would have directly contradicted "Equivalently, the mere expectation of encountering evidence—before you’ve actually seen it—should not shift your prior beliefs". Okay. But that seems more like an intuition than even a sketch of a proof to me. After all, part of the standard argument for the hard problem is that even if you explained all of the observable functions of consciousness, the hard problem would remain. So just the fact that we can build individual bricks of the pyramid isn't significant by itself - a non-eliminativist might be perfectly willing to grant that yes, we can build the entire pyramid, while also holding that merely building the pyramid won't tell us anything about the hard problem nor the meta-problem. What would you say to them to convince them otherwise?

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo20

Let's put aside ethics for a minute.

"But it wouldn't be necessary the same as in a human brain."

Yes, this wouldn't be the same as the human brain; it would be like the Swiss cheese pyramid that I described in the post.

Your story ended on stating the meta problem, so until it's actually solved, you can't explain everything.

Take a look at my answer to Kaj Sotala and tell me what you think.

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo5-2

Thank you for the kind words!

Saying that we'll figure out an answer in the future when we have better data isn't actually giving an answer now.

Okay, fair enough, but I predict this would happen: in the same way that AlphaGo rediscovered all of chess theory, it seems to me that if you just let the AIs grow, you can create a civilization of AIs. Those AIs would have to create some form of language or communication, and some AI philosopher would get involved and then talk about the hard problem.

I'm curious how you answer those two questions:

Let's say we imple

... (read more)

1Towards_Keeperhood18d

The meta problem of consciousness is about explaining why people think they are conscious. Even if we get such a result with AIs where AIs invent a concept like consciousness from scratch, that would only tell us that they also think they have sth that we call consciousness, but not yet why they think this. That is, unless we can somehow precisely inspect the cognitive thought processes that generated the consciousness concept in AIs, which on anything like the current paradigm we won't be. Another way to frame it: Why would it matter that an AI invents the concept of consciousness, rather than another human? Where is the difference that lets us learn more about the hard/meta problem of consciousness in the first place? ---------------------------------------- Separately, even if we could analyze the thought processes of AIs in such a case so we would solve the meta problem of consciousness by seeing explanations of why AIs/people talk about consciousness the way they do, that doesn't mean you already have solved the meta-problem of consciousness now. Aka just because you know it's solvable doesn't mean you're done. You haven't solved it yet. Just like the difference between knowing that general relativity exists and understanding the theory and math.

4Kaj_Sotala11mo

Well, it's already my default assumption that something like this would happen, so the update would mostly just be something like "looks like I was right". You mean one where AIs that were trained with no previous discussion of the concept of consciousness end up reinventing the hard problem on their own? 70% maybe. That sounds like it would violate conservation of expected evidence: I don't see how it does? It just suggests that a possible approach by which the meta-problem could be solved in the future. Suppose you told me that you had figured out how to create cheap and scalable source of fusion power. I'd say oh wow great! What's your answer? And you said that, well, you have this idea for a research program that might, in ten years, produce an explanation of how to create cheap and scalable fusion power. I would then be disappointed because I thought you had an explanation that would let me build fusion power right now. Instead, you're just proposing another research program that hopes to one day achieve fusion power. I would say that you don't actually have it figured it out yet, you just think you have a promising lead. Likewise, if you tell me that you have a solution to the meta-problem, then I would expect an explanation that lets me understand the solution to the meta-problem today. Not one that lets me do it ten years in the future, when we investigate the logs of the AIs to see what exactly it was that made them think the hard problem was a thing. I also feel like this scenario is presupposing the conclusion - you feel that the right solution is an eliminativist one, so you say that once we examine the logs of the AIs, we will find out what exactly made them believe in the hard problem in a way that solves the problem. But a non-eliminativist might just as well claim that once we examine the logs of the AIs, we will eventually be forced to conclude that we can't find an answer there, and that the hard problem still remains mysterious. Now per

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo20

But I have no way to know or predict if it is like something to be a fish or GPT-4

But I can predict what you say; I can predict if you are confused by the hard problem just by looking at your neural activation; I can predict word by word the following sentence that you are uttering: "The hard problem is really hard."

I would be curious to know what you think about the box solving the meta-problem just before the addendum. Do you think it is unlikely that AI would rediscover the hard problem in this setting?

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo20

I would be curious to know what you think about the box solving the meta-problem just before the addendum.

Do you think it is unlikely that AI would rediscover the hard problem in this setting?

3TAG11mo

I don't think it would tell you much, because it only excludes the case where the HP is a meme, not the case where it's a repeatable error. Boxed AI tells you there is an HP. * Could be because it is phenomenally conscious, and it has noticed there is a real HP. * Could be because it is repeating a conceptual conclusion. Boxed AI tells you there is not an HP. * Could be because it is a zombie, so it can't understand what PC is. * Could be because it has PC, but isn't subject to the erroneous thinking that causes the HP. Note that you can't specify whether an AI is or isn't conscious, or that it's a perfect reasoner. Note that philosophers don't agree on what constitutes a conceptual confusion. Note that being able to trace back the causal history of an output doesn't tell you it wasn't caused by PC: one of the possible solutions to the HP is that certain kinds of physical activity, or information processing are identical to PC, so there is.not necessarily an xor between PC and physical causation. Of course, there is also a fact that human pronouncements have alone sort of causal history, and it doesn't settle much. Note that, as things stand, the thought experiment is an intuition pump like Mary's Room, etc.

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo40

I'm not saying that LeCun's rosy views on AI safety stem solely from his philosophy of mind, but yes, I suspect there is something there.

It seems to me that when he says things like "LLMs don't display true understanding", "or true reasoning", as if there's some secret sauce to all this that he thinks can only appear in his Jepa architecture or whatever, it seems to me that this is very similar to the same linguistic problems I've observed for consciousness.

Surely, if you will discuss with him, he will say things like "No, this is not just a linguistic deb... (read more)

3Mitchell_Porter11mo

OK, maybe I understand. If I put it in my own words: You think "consciousness" is just a word denoting a somewhat arbitrary conjunction of cognitive abilities, rather than a distinctive actual thing which people are right or wrong about in varying degrees, and that the hard problem of consciousness results from reifying this conjunction. And you suspect that LeCun in his own thinking e.g. denies that LLMs can reason, because he has added unnecessary extra conditions to his personal definition of "reasoning". Regarding LeCun: It strikes me that his best-known argument about the capabilities of LLMs rests on a mathematical claim, that in pure autoregression, the probability of error necessarily grows. He directly acknowledges that if you add chain of thought, it can ameliorate the problem... In his JEPA paper, he discusses what reasoning is, just a little bit. In Kahneman's language, he calls it a system-2 process, and characterizes it as "simulation plus optimization". Regarding your path to eliminativism: I am reminded of my discussion with Carl Feynman last year. I assume you both have subjective experience that is made of qualia from top to bottom, but also have habits of thought that keep you from seeing this as ontologically problematic. In his case, the sense of a problem just doesn't arise and he has to speculate as to why other people feel it; in your case, you felt the problem, until you decided that an AI civilization might spontaneously develop a spurious concept of phenomenal consciousness. As for me, I see the problem and I don't feel a need to un-see it. Physical theory doesn't contain (e.g.) phenomenal color; reality does; therefore we need a broader theory. The truth is likely to sound strange, e.g. there's a lattice of natural qubits in the cortex, the Cartesian theater is how the corresponding Hilbert space feels from the inside, and decohered (classical) computation is unconscious and functional only.

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo20

Sure, "everything is a cluster" or "everything is a list" is as right as "everything is emergent". But what's the actual justification for pruning that neuron? You can prune everything like that.

The justification for pruning this neuron seems to me to be that if you can explain basically everything without using a dualistic view, it is so much simpler. The two hypotheses are possible, but you want to go with the simpler hypothesis, and a world with only (physical properties) is simpler than a world with (physical properties + mental properties).

I would be ... (read more)

2TAG11mo

Explaining everything involves explaining phenomenal consciousness, so it's literally solving the Hard Problem, as opposed to dissolving it.

2Rafael Harth11mo

Argument needed! You cannot go from "H1 asserts the existence of more stuff than H2" to "H1 is more complex than H2". Complexity is measured as the length of the program that implements a hypothesis, not as the # of objects created by the hypothesis. The argument goes through for Epiphenomenalism specifically (bc you can just get rid of the code that creates mental properties) but not in general.

2Signer11mo

Sounds unethical. At least don't kill them afterwards. Any conclusions would raise usual questions about how much AI's reasoning is about real things and how much it is about extrapolating human discourse. The actual implementation of this reasoning in AI could be interesting, especially given that AI would have different assumptions about its situation. But it wouldn't be necessary the same as in a human brain. Philosophically I mostly don't see how is that different from introspecting your sensations and thoughts and writing isomorphic Python program. I guess Chalmers may agree that we have as much evidence of AIs' consciousness as of other humans', but would still ask why the thing that implements this reasoning is not a zombie? But the most fun to think about are cases where it wouldn't apparently solve the problem: like if the reasoning was definitely generated by a simple function over relevant words, but you still couldn't find where it differs from human reasoning. Or maybe the actual implementation would be so complex, that humans couldn't comprehend it on lower level, than what we have now. Yeah, but can you? Your story ended on stating the meta problem, so until it's actually solved, you can't explain everything. So how did you actually check that you would be able to explain everything once it's solved? Just stating the meta problem of consciousness is like stating the meta problem of why people talk about light and calling the idea of light "a virus".

My intellectual journey to (dis)solve the hard problem of consciousness

Charbel-Raphaël11mo90

Frontpage comment guidelines:

Aim to explain, not persuade
Try to offer concrete models and predictions
If you disagree, try getting curious about what your partner is thinking
Don't be afraid to say 'oops' and change your mind

4Mitchell_Porter11mo

I would like to defer any debate over your conclusion for a moment, because that debate is not new. But this is: This is the first time I've heard anyone say that LeCun's rosy views of AI safety stem from his philosophy of mind! Can you say more?