I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.
A rational Bodhisattva combines the strengths and cancels the weaknesses:
Illustration
Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.
Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.
Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network's weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.
We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.
Any thoughts on ensuring safety in agents that can update their memory?
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
Any thoughts?
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
I believe we are doomed from superintelligence but I'm not sad.
There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.
The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.
We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn't matter, because we live in the present and we constantly work on making the present better by expected value.
By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.
We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.
Then, even knowing the future is bad, we remain happy in the present working on AI safety.
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.
These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of "satisfaction" across multiple metrics, based on a target false positive rate for the overall alignment evaluation.
OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.