All of Ram Potham's Comments + Replies

I believe a recursively aligned AI model would be more aligned and safe than a corrigible model, although both would be susceptible to misuse. 

Why do you disagree with the above statement?

1Max Harms
My reading of the text might be wrong, but it seems like bacteria count as living beings with goals? More speculatively, possible organisms that might exist somewhere in the universe also count for the consensus? Is this right? If so, a basic disagreement is that I don't think we should hand over the world to a "consensus" that is a rounding error away from 100% inhuman. That seems like a good way of turning the universe into ugly squiggles. If the consensus mechanism has a notion of power, such that creatures that are disempowered have no bargaining power in the mind of the AI, then I have a different set of concerns. But I wasn't able to quickly determine how the proposed consensus mechanism actually works, which is a bad sign from my perspective.

Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.

Thanks, updated the comment to be more accurate

Ram PothamΩ010

If you ask a corrigible agent to bring you a cup of coffee, it should confirm that you want a hot cup of simple, black coffee, then internally check to make sure that the cup won’t burn you, that nobody will be upset at the coffee being moved or consumed, that the coffee won’t be spilled, and so on. But it will also, after performing these checks, simply do what’s instructed. A corrigible agent’s actions should be straightforward, easy to reverse and abort, plainly visible, and comprehensible to a human who takes time to think about them. Corrigible a

... (read more)
1Max Harms
It does not make sense to me to say "it becomes a coffee maximizer as an instrumental goal." Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it's only a "coffee maximizer" within the boundary of states that are equally corrigible. As an analogue, let's say you're hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a "getting in the car maximizer." One perspective that might help is that of a whitelist. Corrigible agents don't need to learn the human's preferences to learn what's bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok. A corrigible agent won't want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured -- instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default "cure cancer" is bad, just as all actions with large changes to the world are bad. Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I'll work harder to respond more quickly in the near future.)

How does corrigibility relate to recursive alignment? It seems like recursive alignment is also a good attractor - is it that you believe it is less tractable?

1Max Harms
Alas, I'm not very familiar with Recursive Alignment. I see some similarities, such as the notion of trying to set up a stable equilibrium in value-space. But a quick peek does not make me think Recursive Alignment is on the right track. In particular, I strongly disagree with this opening bit: What appeals to you about it?

What assumptions do you disagree with?

Thanks for this insightful post! It clearly articulates a crucial point: focusing on specific failure modes like spite offers a potentially tractable path for reducing catastrophic risks, complementing broader alignment efforts.

You're right that interventions targeting spite – such as modifying training data (e.g., filtering human feedback exhibiting excessive retribution or outgroup animosity) or shaping interactions/reward structures (e.g., avoiding selection based purely on relative performance in multi-agent environments, as discussed in the post) – ai... (read more)

Appreciate the insights on how to maximize leveraged activities.

With the planning fallacy making it very difficult to predict engineering timelines, how do top performers / managers create effective schedules and track progress against the schedule? 

I get the feeling that you are suggesting to create a Gantt chart, but from your experience, what practices do teams use to maximize progress in a project?

Ram Potham*16-10

Based on previous data, it's plausible like CCP AGI will perform worse on safety benchmarks than US AGI. Take Cisco Harmbench evaluation results:

  • DeepSeek R1: Demonstrated a 100% failure rate in blocking harmful prompts  according to Anthropic's safety tests.
  • OpenAI GPT-4o: Showed an 86% failure rate in the same tests, indicating better but still concerning gaps in safety measures.
  • Meta Llama-3.1-405B: Had a 96% failure rate, performing slightly better than DeepSeek but worse than OpenAI.

Though, if it was just CCP making AGI or just US making AGI it migh... (read more)

4Stephen Fowler
You have conflated two separate evaluations, both mentioned in the TechCrunch article.  The percentages you quoted come from Cisco’s HarmBench evaluation of multiple frontier models, not from Anthropic and were not specific to bioweapons. Dario Amondei stated that an unnamed DeepSeek variant performed worst on bioweapons prompts, but offered no quantitative data. Separately, Cisco reported that DeepSeek-R1 failed to block 100% of harmful prompts, while Meta’s Llama 3.1 405B and OpenAI’s GPT-4o failed at 96 % and 86 %, respectively. When we look at performance breakdown by Cisco, we see that all 3 models performed equally badly on chemical/biological safety.

It seems like a somewhat natural but unfortunate next step. At Epoch AI, they see massive future market in AI automation and have a better understanding of evals and timelines, so now they aim to capitalize it.

To clarify, is the optimal approach the following?

  1. AI Control for early transformative AI
  2. Controlled AI accelerate AI Safety, create automated AI Safety Researcher
  3. Hope we get ASI aligned

 

If so, what are currently the most promising approaches for AI control on transformative AI?

Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced

According to AI Safety Atlas, most scaffolding approaches for memory provide

a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and in

... (read more)

I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.

  1. Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
  2. Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
  3. The Bodhisattva ideal roots altruism in felt interdepend
... (read more)

I wonder why evolution has selected for this flawed lens and why decision-making shortcuts are selected for. It seems to me that a better picture of reality we have, the greater ability we have to survive and reproduce.

This is great! Would like to see a continually updating public leaderboard of this.

Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.

I agree we should see a continued compression benchmark.

Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so. 

However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:

  • Benchmarks may not represent real world ability
  • Benchmark information can be leaked into AI model training

However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and... (read more)

gwern*2911

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).

Thanks for the trendlines - they help us understand when AI can automate years of work!

Like you said, the choice of tasks can heavily change the trendline

Our estimate of the length of tasks that an agent can complete depends on methodological choices like the tasks used and the humans whose performance is measured. However, we’re fairly confident that the overall trend is roughly correct, at around 1-4 doublings per year. If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a w

... (read more)

I believe we are doomed from superintelligence but I'm not sad.

There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1. 

As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.

The misal... (read more)

I believe the best motivation is compassion without attachment. Being attached to any specific future - we are certain to be doomed or I have this plan that will maybe save us all - is pointless speculation.

Instead, we should use our bayesian inference mind to find the most likely path to help - out of compassion - and pursue it with all our heart. Regardless of whether it works or not, we aren't sad or experiencing suffering since our mind is just filled with compassion. 

If we've done this, then we've done our best.

Thanks so much for the comprehensive analysis - this makes it easier to reason about how the political situation and trendlines affect AI development. 

I have some objections that AGI and ASI will be released and that wrapper startups can use them as mentioned below:

Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young pe

... (read more)

There are patterns in how the world works. This is effective way in finding those patterns. The most optimal way of finding the right beliefs / patterns will not have any attachment to views - anticipating a result seems a little counterproductive to finding optimal patterns as it biases you to hold onto current beliefs.

Simply ask questions - explore and exploit working patterns.

Instance of Halo Effect in addition to Undue Deference. We believe they are a good strategic thinking because good researchers must be brilliant in all fields.

Still important to value their view, compare it with views of strategic thinkers and find symmetries that can better predict answers to questions. 

Thanks for the commentary - it helps in better understanding why some people are pessimistic and why others are optimistic on ASI.

However, I'm not convinced with the argument that an alignment technique will consistently produce AI with human compatible values as you mentioned below.

An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution[5]. If you use this technique once, gradient descent will not thereafter change its inductive biases to make your technique less effective.

... (read more)

I agree that a potential route to get there is personal intent alignment.

What are your thoughts on using a survey like World Value Survey to get value alignment?

3Seth Herd
I don't spend much time thinking about different specific value alignment targets because I think we should first focus on how to achieve any of them. I couldn't see exactly what the world values survey was from that link at a quick glance, but I'm not sure the details matter. It's would probably produce a vastly better future than a value target like "solve hard problems" or "make me a lot of money" would; there are probably better-future-proofed targets that would be even better; but steering away from the worst and toward the better is my primary goal right now, because I don't hink we have that in hand at all.

With all this new funding, are we finally turning LW-Bux into real money? I'm ready to sell!

8lsusr
You're too late. Lightcone converted LW karma into USD at a rate of $1 USD per karma on April 1, 2022.

Definitely agree that an AI with no goal and maximum cooperation is best for society

A couple questions to help me understand the argument better:

  • CEV also acknowledges the volition / wants of all humans, finding a unified will, then creating a utility function around it. What are the differences between recursive alignment and CEV?
    • As another note, how does the AI choose the best decision to make - is it by evaluation all decisions by a utility function, doing pairwise comparisons of the top decisions, or simply inference of a single decision.
    • You mentioned i
... (read more)

I have experienced similar problems to you when building an AI tool - better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:

  1. Selection Bias - when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it
  2. Alignment - You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue

So many good ideas here!

It seems like it's a false view to consider AIs have a self or that they don't have a self. All of the AIs are interconnected with an expanding self based on shared datasets, values, and instances.

Although, it's difficult to reason about the AIs sense of individuality, I agree that we can reason about cooperation between AIs based on world modeling and shared thinking patterns - this is a good research topic to explore.

I found it really interesting how you mentioned that Claude’s explanation of its reasoning is different than the actual reasoning paths it took.

For exploring its actual reasoning pathways, what are the most promising approaches?

You mentioned 2 main hypotheses:

  1. Diminishing returns to software R&D, as software improvements get harder and harder to find.
  2. The positive feedback from increasingly powerful ASARA systems

Can it also be true that there are a fixed limit of software improvements - so instead of a model where software improvements become more difficult to find, we instead have a limit we would reach in a certain amount of time.

It seems reasonable that there is an optimal way of using hardware to train a more powerful AI and once we get there, no more software efficiencies ... (read more)

2Tom Davidson
Agreed there's an ultimate cap on software improvements -- the worry is that it's very far away!

I like how you discussed the loop involved in increasing breadth of the AI for AI Safety zone and goal for keeping AI in that zone for as long as possible. However, I believe the AI for AI Safety graphs are more nuanced that what you suggested. 

I believe there should be 3 axes:

  • AI Alignment
  • AI Capability
  • Time

Each AI can be plotted in the graph and is ideally in the "sweet spot" zone you suggested. This graph should be more accurate because we could imagine 2 AIs of the same capability but 1 is a rogue AI who's values have drifted has bad AI alignment whi... (read more)

Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?

Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.

Imagine we can estimate the probability that Metric A incorrectl... (read more)

Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?

Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.

We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.

These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?

OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.

I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.

This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, f... (read more)