LESSWRONG
LW

All of Ram Potham's Comments + Replies

0. CAST: Corrigibility as Singular Target

I believe a recursively aligned AI model would be more aligned and safe than a corrigible model, although both would be susceptible to misuse.

Why do you disagree with the above statement?

1Max Harms2d

My reading of the text might be wrong, but it seems like bacteria count as living beings with goals? More speculatively, possible organisms that might exist somewhere in the universe also count for the consensus? Is this right? If so, a basic disagreement is that I don't think we should hand over the world to a "consensus" that is a rounding error away from 100% inhuman. That seems like a good way of turning the universe into ugly squiggles. If the consensus mechanism has a notion of power, such that creatures that are disempowered have no bargaining power in the mind of the AI, then I have a different set of concerns. But I wasn't able to quickly determine how the proposed consensus mechanism actually works, which is a bad sign from my perspective.

1. The CAST Strategy

Ram Potham3d21

Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.

Why Should I Assume CCP AGI is Worse Than USG AGI?

Ram Potham14d30

Thanks, updated the comment to be more accurate

1. The CAST Strategy

Ram Potham14dΩ010

If you ask a corrigible agent to bring you a cup of coffee, it should confirm that you want a hot cup of simple, black coffee, then internally check to make sure that the cup won’t burn you, that nobody will be upset at the coffee being moved or consumed, that the coffee won’t be spilled, and so on. But it will also, after performing these checks, simply do what’s instructed. A corrigible agent’s actions should be straightforward, easy to reverse and abort, plainly visible, and comprehensible to a human who takes time to think about them. Corrigible a

... (read more)

1Max Harms3d

It does not make sense to me to say "it becomes a coffee maximizer as an instrumental goal." Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it's only a "coffee maximizer" within the boundary of states that are equally corrigible. As an analogue, let's say you're hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a "getting in the car maximizer." One perspective that might help is that of a whitelist. Corrigible agents don't need to learn the human's preferences to learn what's bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok. A corrigible agent won't want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured -- instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default "cure cancer" is bad, just as all actions with large changes to the world are bad. Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I'll work harder to respond more quickly in the near future.)

0. CAST: Corrigibility as Singular Target

Ram Potham14d10

How does corrigibility relate to recursive alignment? It seems like recursive alignment is also a good attractor - is it that you believe it is less tractable?

1Max Harms3d

Alas, I'm not very familiar with Recursive Alignment. I see some similarities, such as the notion of trying to set up a stable equilibrium in value-space. But a quick peek does not make me think Recursive Alignment is on the right track. In particular, I strongly disagree with this opening bit: What appeals to you about it?

0. CAST: Corrigibility as Singular Target

Ram Potham14d10

What assumptions do you disagree with?

Making AIs less likely to be spiteful

Ram Potham14d10

Thanks for this insightful post! It clearly articulates a crucial point: focusing on specific failure modes like spite offers a potentially tractable path for reducing catastrophic risks, complementing broader alignment efforts.

You're right that interventions targeting spite – such as modifying training data (e.g., filtering human feedback exhibiting excessive retribution or outgroup animosity) or shaping interactions/reward structures (e.g., avoiding selection based purely on relative performance in multi-agent environments, as discussed in the post) – ai... (read more)

Impact, agency, and taste

Ram Potham14d10

Appreciate the insights on how to maximize leveraged activities.

With the planning fallacy making it very difficult to predict engineering timelines, how do top performers / managers create effective schedules and track progress against the schedule?

I get the feeling that you are suggesting to create a Gantt chart, but from your experience, what practices do teams use to maximize progress in a project?

Why Should I Assume CCP AGI is Worse Than USG AGI?

Ram Potham19d*16-10

Based on previous data, it's plausible like CCP AGI will perform worse on safety benchmarks than US AGI. Take Cisco Harmbench evaluation results:

DeepSeek R1: Demonstrated a 100% failure rate in blocking harmful prompts according to Anthropic's safety tests.
OpenAI GPT-4o: Showed an 86% failure rate in the same tests, indicating better but still concerning gaps in safety measures.
Meta Llama-3.1-405B: Had a 96% failure rate, performing slightly better than DeepSeek but worse than OpenAI.

Though, if it was just CCP making AGI or just US making AGI it migh... (read more)

4Stephen Fowler16d

You have conflated two separate evaluations, both mentioned in the TechCrunch article. The percentages you quoted come from Cisco’s HarmBench evaluation of multiple frontier models, not from Anthropic and were not specific to bioweapons. Dario Amondei stated that an unnamed DeepSeek variant performed worst on bioweapons prompts, but offered no quantitative data. Separately, Cisco reported that DeepSeek-R1 failed to block 100% of harmful prompts, while Meta’s Llama 3.1 405B and OpenAI’s GPT-4o failed at 96 % and 86 %, respectively. When we look at performance breakdown by Cisco, we see that all 3 models performed equally badly on chemical/biological safety.

jacquesthibs's Shortform

Ram Potham20d23

It seems like a somewhat natural but unfortunate next step. At Epoch AI, they see massive future market in AI automation and have a better understanding of evals and timelines, so now they aim to capitalize it.