Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.
Thanks, updated the comment to be more accurate
...If you ask a corrigible agent to bring you a cup of coffee, it should confirm that you want a hot cup of simple, black coffee, then internally check to make sure that the cup won’t burn you, that nobody will be upset at the coffee being moved or consumed, that the coffee won’t be spilled, and so on. But it will also, after performing these checks, simply do what’s instructed. A corrigible agent’s actions should be straightforward, easy to reverse and abort, plainly visible, and comprehensible to a human who takes time to think about them. Corrigible a
How does corrigibility relate to recursive alignment? It seems like recursive alignment is also a good attractor - is it that you believe it is less tractable?
What assumptions do you disagree with?
Thanks for this insightful post! It clearly articulates a crucial point: focusing on specific failure modes like spite offers a potentially tractable path for reducing catastrophic risks, complementing broader alignment efforts.
You're right that interventions targeting spite – such as modifying training data (e.g., filtering human feedback exhibiting excessive retribution or outgroup animosity) or shaping interactions/reward structures (e.g., avoiding selection based purely on relative performance in multi-agent environments, as discussed in the post) – ai...
Appreciate the insights on how to maximize leveraged activities.
With the planning fallacy making it very difficult to predict engineering timelines, how do top performers / managers create effective schedules and track progress against the schedule?
I get the feeling that you are suggesting to create a Gantt chart, but from your experience, what practices do teams use to maximize progress in a project?
Based on previous data, it's plausible like CCP AGI will perform worse on safety benchmarks than US AGI. Take Cisco Harmbench evaluation results:
Though, if it was just CCP making AGI or just US making AGI it migh...
It seems like a somewhat natural but unfortunate next step. At Epoch AI, they see massive future market in AI automation and have a better understanding of evals and timelines, so now they aim to capitalize it.
To clarify, is the optimal approach the following?
If so, what are currently the most promising approaches for AI control on transformative AI?
Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
...a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and in
I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.
I wonder why evolution has selected for this flawed lens and why decision-making shortcuts are selected for. It seems to me that a better picture of reality we have, the greater ability we have to survive and reproduce.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and...
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
Thanks for the trendlines - they help us understand when AI can automate years of work!
Like you said, the choice of tasks can heavily change the trendline
...Our estimate of the length of tasks that an agent can complete depends on methodological choices like the tasks used and the humans whose performance is measured. However, we’re fairly confident that the overall trend is roughly correct, at around 1-4 doublings per year. If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a w
I believe we are doomed from superintelligence but I'm not sad.
There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.
The misal...
I believe the best motivation is compassion without attachment. Being attached to any specific future - we are certain to be doomed or I have this plan that will maybe save us all - is pointless speculation.
Instead, we should use our bayesian inference mind to find the most likely path to help - out of compassion - and pursue it with all our heart. Regardless of whether it works or not, we aren't sad or experiencing suffering since our mind is just filled with compassion.
If we've done this, then we've done our best.
Thanks so much for the comprehensive analysis - this makes it easier to reason about how the political situation and trendlines affect AI development.
I have some objections that AGI and ASI will be released and that wrapper startups can use them as mentioned below:
...Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young pe
There are patterns in how the world works. This is effective way in finding those patterns. The most optimal way of finding the right beliefs / patterns will not have any attachment to views - anticipating a result seems a little counterproductive to finding optimal patterns as it biases you to hold onto current beliefs.
Simply ask questions - explore and exploit working patterns.
Instance of Halo Effect in addition to Undue Deference. We believe they are a good strategic thinking because good researchers must be brilliant in all fields.
Still important to value their view, compare it with views of strategic thinkers and find symmetries that can better predict answers to questions.
Thanks for the commentary - it helps in better understanding why some people are pessimistic and why others are optimistic on ASI.
However, I'm not convinced with the argument that an alignment technique will consistently produce AI with human compatible values as you mentioned below.
...An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution[5]. If you use this technique once, gradient descent will not thereafter change its inductive biases to make your technique less effective.
I agree that a potential route to get there is personal intent alignment.
What are your thoughts on using a survey like World Value Survey to get value alignment?
With all this new funding, are we finally turning LW-Bux into real money? I'm ready to sell!
Definitely agree that an AI with no goal and maximum cooperation is best for society
A couple questions to help me understand the argument better:
I have experienced similar problems to you when building an AI tool - better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:
So many good ideas here!
It seems like it's a false view to consider AIs have a self or that they don't have a self. All of the AIs are interconnected with an expanding self based on shared datasets, values, and instances.
Although, it's difficult to reason about the AIs sense of individuality, I agree that we can reason about cooperation between AIs based on world modeling and shared thinking patterns - this is a good research topic to explore.
I found it really interesting how you mentioned that Claude’s explanation of its reasoning is different than the actual reasoning paths it took.
For exploring its actual reasoning pathways, what are the most promising approaches?
You mentioned 2 main hypotheses:
- Diminishing returns to software R&D, as software improvements get harder and harder to find.
- The positive feedback from increasingly powerful ASARA systems
Can it also be true that there are a fixed limit of software improvements - so instead of a model where software improvements become more difficult to find, we instead have a limit we would reach in a certain amount of time.
It seems reasonable that there is an optimal way of using hardware to train a more powerful AI and once we get there, no more software efficiencies ...
I like how you discussed the loop involved in increasing breadth of the AI for AI Safety zone and goal for keeping AI in that zone for as long as possible. However, I believe the AI for AI Safety graphs are more nuanced that what you suggested.
I believe there should be 3 axes:
Each AI can be plotted in the graph and is ideally in the "sweet spot" zone you suggested. This graph should be more accurate because we could imagine 2 AIs of the same capability but 1 is a rogue AI who's values have drifted has bad AI alignment whi...
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectl...
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.
These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?
OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, f...
I believe a recursively aligned AI model would be more aligned and safe than a corrigible model, although both would be susceptible to misuse.
Why do you disagree with the above statement?