Ram Potham

Agentic tools to accelerate AI Safety Research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

To clarify, is the optimal approach the following?

  1. AI Control for early transformative AI
  2. Controlled AI accelerate AI Safety, create automated AI Safety Researcher
  3. Hope we get ASI aligned

 

If so, what are currently the most promising approaches for AI control on transformative AI?

Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced

According to AI Safety Atlas, most scaffolding approaches for memory provide

a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network's weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.

We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.

Any thoughts on ensuring safety in agents that can update their memory?

I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.

  1. Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
  2. Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
  3. The Bodhisattva ideal roots altruism in felt interdependence: the world’s suffering is one’s own. It supplies deep motivation and inner peace, but gives no algorithm for choosing the most beneficial act.

A rational Bodhisattva combines the strengths and cancels the weaknesses:

  • Motivation: Like a Bodhisattva, they experience others’ suffering as their own, so compassion is effortless and durable.
  • Method: Using reason and evidence (from effective altruism and two‑level utilitarianism), they pick the action that maximizes overall benefit.
  • Flexibility: They apply the “middle way,” recognizing that different compassionate choices can be permissible when values collide.

Illustration

Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.

  • A strict effective altruist/utilitarian would donate to GiveWell.
  • A purely sentimental agent might fund the treatment.
  • The rational Bodhisattva weighs both outcomes, also including duties into the calculation, acts from compassion, and accepts the result without regret. In most cases they will choose the option with the greatest net benefit, but they can act otherwise when a compassionate rule or relational duty justifies it.

Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.

I wonder why evolution has selected for this flawed lens and why decision-making shortcuts are selected for. It seems to me that a better picture of reality we have, the greater ability we have to survive and reproduce.

This is great! Would like to see a continually updating public leaderboard of this.

Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.

I agree we should see a continued compression benchmark.

Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so. 

However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:

  • Benchmarks may not represent real world ability
  • Benchmark information can be leaked into AI model training

However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.

By reproducing a paper, something many researchers do, we ensure it has real world ability.

By only using papers after a certain model finished training, we ensure data is not leaked.

This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:

  • We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
  • Benchmark data is always recent and relevant

Any thoughts?

Thanks for the trendlines - they help us understand when AI can automate years of work!

Like you said, the choice of tasks can heavily change the trendline

Our estimate of the length of tasks that an agent can complete depends on methodological choices like the tasks used and the humans whose performance is measured. However, we’re fairly confident that the overall trend is roughly correct, at around 1-4 doublings per year. If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.

I believe SWE-bench is the best benchmark to control for variables like the choice of task and how the agentic system is built, so I'm leaning more towards the doubling time of 70 days.

For a large scale / complex app, it takes around 1 year of development (though this is not a completely fair estimate since it doesn't take into account the number of man-hours), but going with this estimate and doubling in SWE-bench, it takes around 13 doublings from the beginning of 2025 or June 2027 to automate production of entire apps / complex websites.

Another big factor that this trendline and other trendlines don't take into account is the amount of AI acceleration. If AI automates a large portion of the work, the time to double would shorten as AI gets better, I'd be interested to see how that would affect this model.

I believe we are doomed from superintelligence but I'm not sad.

There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1. 

As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.

The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.

We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn't matter, because we live in the present and we constantly work on making the present better by expected value. 

By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make. 

We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.

Then, even knowing the future is bad, we remain happy in the present working on AI safety. 

I believe the best motivation is compassion without attachment. Being attached to any specific future - we are certain to be doomed or I have this plan that will maybe save us all - is pointless speculation.

Instead, we should use our bayesian inference mind to find the most likely path to help - out of compassion - and pursue it with all our heart. Regardless of whether it works or not, we aren't sad or experiencing suffering since our mind is just filled with compassion. 

If we've done this, then we've done our best.

Load More