Ram Potham's Shortform

Ram Potham

Ram Potham's Shortform

23rd Mar 2025

1 min read

1

This is a special post for quick takes by Ram Potham. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Ram Potham's Shortform

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:11 PM

[-]Ram Potham20d10

I've launched Forecast Labs, an organization focused on using AI forecasting to help reduce AI risk.

Our initial results are promising. We have an AI model that is outperforming superforecasters on the Manifold Markets benchmark, as evaluated by ForecastBench. You can see a summary of the results at our website: https://www.forecastlabs.org/results.

This is just the preliminary scaffolding, and there's significant room for improvement. The long-term vision is to develop these AI forecasting capabilities to a point where we can construct large-scale causal models. These models would help identify the key decisions and interventions needed to navigate the challenges of advanced AI and minimize existential risk.

I'm sharing this here to get feedback from the community and connect with others interested in this approach. The goal is to build powerful tools for foresight, and I believe that's a critical component of the AI safety toolkit.

[-]Ram Potham2mo*10

Reading Resources for Technical AI Safety independent researchers upskilling to apply to roles:

GabeM - Leveling up in AI Safety Research
EA - Technical AI Safety
Michael Aird: Write down Theory of Change
Marius Hobbhahn - Advice for Independent Research
Rohin Shah - Advice for AI Alignment Researchers
gw - Working in Technical AI Safety
Richard Ngo - AGI Safety Career Advice
rmoehn - Be careful of failure modes
Bilal Chughtai - Working at a frontier lab
Upgradeable - Career Planning
Neel Nanda - Improving Research Process
Neel Nanda - Writing a Good Paper
Ethan Perez - Tips for Empirical Alignment Research
Ethan Perez - Empirical Research Workflows
Gabe M - ML Research Advice
Lewis Hommend - AI Safety PhD advice
Adam Gleave - AI Safety PhD advice

Application and Upskilling resources;

[-]Ram Potham3mo10

Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.

According to AI Safety Atlas, most scaffolding approaches for memory provide

a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network's weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.

We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.

Any thoughts on ensuring safety in agents that can update their memory?

[-]Ram Potham3mo10

I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.

Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
The Bodhisattva ideal roots altruism in felt interdependence: the world’s suffering is one’s own. It supplies deep motivation and inner peace, but gives no algorithm for choosing the most beneficial act.

A rational Bodhisattva combines the strengths and cancels the weaknesses:

Motivation: Like a Bodhisattva, they experience others’ suffering as their own, so compassion is effortless and durable.
Method: Using reason and evidence (from effective altruism and two‑level utilitarianism), they pick the action that maximizes overall benefit.
Flexibility: They apply the “middle way,” recognizing that different compassionate choices can be permissible when values collide.

Illustration

Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.

A strict effective altruist/utilitarian would donate to GiveWell.
A purely sentimental agent might fund the treatment.
The rational Bodhisattva weighs both outcomes, also including duties into the calculation, acts from compassion, and accepts the result without regret. In most cases they will choose the option with the greatest net benefit, but they can act otherwise when a compassionate rule or relational duty justifies it.

Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.

[This comment is no longer endorsed by its author]Reply

[-]Ram Potham3mo10

Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.

However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:

Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training

However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.

By reproducing a paper, something many researchers do, we ensure it has real world ability.

By only using papers after a certain model finished training, we ensure data is not leaked.

This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:

We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant

Any thoughts?

[-]gwern3mo*2911

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).

[-]jsd3mo50

there's this https://github.com/Jellyfish042/uncheatable_eval

[-]gwern3mo50

It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.

[-]Ram Potham3mo10

This is great! Would like to see a continually updating public leaderboard of this.

[-]Ram Potham3mo10

Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.

I agree we should see a continued compression benchmark.

[-]Ram Potham3mo10

I believe we are doomed from superintelligence but I'm not sad.

There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.

As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.

The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.

We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn't matter, because we live in the present and we constantly work on making the present better by expected value.

By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.

We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.

Then, even knowing the future is bad, we remain happy in the present working on AI safety.

[-]Ram Potham4mo10

Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?

Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.

We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.

These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?

[-]Ram Potham3mo01

Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?

Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.

Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.

We could then select a subset of metrics, or even define a threshold of "satisfaction" across multiple metrics, based on a target false positive rate for the overall alignment evaluation.

[-]Ram Potham4mo00

OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.

I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.

This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.

Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.

Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.

By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.

Moderation Log

Curated and popular this week