We work on this mission because we believe that if we all collectively know more about AI, we will make better decisions on average.
Indeed, this is what I'd expect by default, but I wonder how much I should update from having seen not one, but three members of staff recently make an incredibly bad decision, especially since you'd expect an Epoch staff member to be far more informed than the average reader.
People used to believe that the Internet would usher in an Age of Enlightenment. Yeah, that didn't happen..
Like, what looks my motte and what looks like my bailey?
Between an overwhelmingly strong effect and a pretty strong effect.
There are precious few, if any, arguments of the form "But here's a logical reason why we're doomed!" that can distinguish between these two worlds.
Feels like that's a Motte and Bailey?
Sure, if you believe that the claimed psychological effects are very, very strong, then that's the case.
However, let's suppose you don't believe they're quite that strong. Even if you believe that these effects are pretty strong, then sufficiently strong arguments may still provide decent evidence.
In terms of why the evidence I provided is strong:
So it's not just that these are strong arguments. It's that these are arguments that you might expect to provide some signal even if you thought the claimed effect was strong, but not overwhelmingly so.
I don't know man, seems a lot less plausible now that we have the CAIS letter and inference-time compute started demonstrating a lot of concrete examples of AIs misbehaving[1].
So before the CAIS letter there were some good outside view arguments and in the ChatGPT 3.5-4 era alignment-by-default seemed quite defensible[2], but now we're in an era where neither of these is the case, so your argument pulls a lot less of a punch.
In retrospect, it isn't plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
Why do you say this?
A major factor for me is the extent that they expect the book to bring new life into the conversation about AI Safety. One problem with running a perfectly neutral forum is that people explore 1000 different directions at the cost of moving the conversation forward. There's a lot of value in terms of focusing people's attention in the same direction such that progress can be made.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the solution will be that simple, I don't think the relation is quite as straightforward as you claim.
c) The discussion of "how do you think the LLM's feel about these experiments" is interesting, but it is also overly anthromorphic. LLM's are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn't account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you're being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there's value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It's very easy to say "oh this is poor quality research because it doesn't my favourite objection". I've probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you'd most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You're like "oh, the future, the future, people are always saying it'll happen in the future", which probably sounds convincing to folks who haven't been following that closely, but it's a lot less persuasive if you know that we've been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you're trying to figure out how to conduct solid research in a new domain, of course it's going to take some time.
I think it's valuable for some people to say that it's a terrible idea in advance so they have credibility after things go wrong.
Why the focus on wise AI advisors? (plus FAQ 🙋🏻♂️) - Party Edition - Working Draft
About this post
This post is being written collaboratively by Chris Leong (primary author) and Christopher Clay (second author) in the voice of Chris Leong. We hope you learn something from it.
This is the party edition of the post. I'm also working on an edition for more formal contexts.
This post is still in draft, so any feedback would be greatly appreciated. It'll be posted as a full, proper Less Wrong/EA Forum/Alignment forum post, as opposed to just a short-form, when it's ready.
Links:
🔗 Short-link
✉️ PM via LW profile
📲 QR Code
Acknowledgements:
The initial version of this post was produced during the 10th edition of AI Safety Camp. Thanks to my team: Matthew Hampton, Richard Kroon, and Chris Cooper, for their feedback. Unlike most other AI Safety Camp projects, which focus on a single, unitary project, we were more of a research collective with each person pursuing their own individual projects.
Further acknowledgements: Jonathan Kummerfeld, and many others, including some I've unfortunately forgotten.
📖 Basic Terminology - Wise AI advisors? Wisdom? (I recommend skipping initially) 🙏
⬇️ I suggest initially skipping this section and focusing on the core argument for now[2] ⬇️
Wise AI Advisors?
Do you mean?:
• AIs trained to provide advice to humans ✅
• AIs trained to act wisely in the world ❌
• Humans trained to provide wise advise about AI ❌
Most work on Wise AI is focused on the question of how AI could learn to act wisely in the world[3], however, I'm more interested in the former as it allows humans to compensate for the weaknesses in the AIs[4].
Even though Wise AI Advisers doesn't refer to humans, I am primarily interested in how Wise AI Advisors could be deployed as part of a cybernetic system.
Training humans to be wiser would help with this project:
• Wiser humans can train wiser AI
• When we combine AI and humans into a cybernetic system, wiser humans will both be better able to elicit capabilities from the AI and also better able to plug any gaps in the AIs wisdom.
What do you mean by wisdom?
For the purposes of the following argument, I'd encourage you to first consider this in relation to how you conceive of wisdom rather than worrying too much about how I conceive of wisdom. Two main reasons:
• I suspect this reduces the chance of losing someone partway through the argument because they conceive of wisdom slightly differently than I do[5].
• I believe that there are many different types of wisdom that are useful for steering the world in a positive direction and many perspectives on wisdom that are worth investigating. I'm encouraging readers to first consider this argument in relation to their own understanding of wisdom in order to increase the diversity of approaches pursued.
Even though I'd encourage you to read the core argument first, if you really want to hear more about how I conceive of wisdom right now, you can scroll down to the Clarifications section to find out more about what I believe. 🔜🧘 or ⬇️⬇️⬇️😔⭐ My core argument... in a single sentence ☝️⭐
✋ A necessity? That's a strong claim! - 👍🛡️[6]
Yes, and I[7] stand by that. I believe that the future is unlikely to go well without them[8][9]. In worlds where we achieve good outcomes, I expect Wise AI Advisors to have played a substantial role.
I won't just be presenting my case; I'll also address a long list of objections[10] ⚔️🌎. Let's go 🪂!
Please note: This post is still a work in progress. Some sections have undergone more editing and refinement than others. Feedback is greatly appreciated 🙏.
⭐ My core argument... in a single sentence ☝️⭐
We're starting with a very abstract description, but it'll get more specific as we go on[11].
Unaided human wisdom is vastly insufficient for the task before us..
of navigating an entire series of highly uncertain and deeply contested decisions
where a single mistake could prove ruinous
with a greatly compressed timeline
⭐ ... or in less than a minute ⏱️⭐
🅅 𝚄 𝙻 𝙽 𝙴 𝚁 𝙰 𝙱 𝙸 𝙻 𝙸 𝚃 𝚈 – 🌊🚣:
The development of advanced AI technologies will have a massive impact on society[14] given the essentially infinite ways to deploy such a general technology. There are lots of ways this could go well, and lots of ways w
e all diethis could go extremely poorly[15][16]:🅄 𝙽 𝙲 𝙴 𝚁 𝚃 𝙰 𝙸 𝙽 𝚃 𝚈 – 🌅💥:
We have massive disagreement on what we expect the development of AI, let alone the best strategy[18]. Making the wrong call could prove catastrophic.
🅂 𝙿 𝙴 𝙴 𝙳 – 😅⏳:
AI Is developing incredibly rapidly... We have limited time to act and to figure out how to act.[19]
Β) 🚨💪🦾: Whilst we can and should be developing top governance and strategy talent, this is unlikely to be sufficient by itself. Given the stakes, we need every advantage we can get, we can't afford to leave anything on the table (‼️)[21].
In light of this:
📌 I believe that AI is unlikely to go well for humanity unless we succeed in developing wise AI advisors...
I am serious about making this happen...
If you are serious about this too, please:
✉️ message me
after you've finished reading this post. Don't forget!Clarifications
andObjections
to the initial argument, then I'll coverWhat makes this agenda so great
. After that, I'll list some concrete projects and address some more questions 🙏.🤔 Clarifications
Reminder: If you're looking for a definition of Wise AI Advisors, it's at the top of the page.
I've read through your argument and substituted my own understanding of wisdom. Now that
you've wasted my timeI've done this, perhaps you could clarify how you think about wisdom? 👍🙏Sure. I've written at the top of this post ⬆️ why I often try to dodge this question initially. But given that you've made it this far, I'll share some thoughts. Just don't over-update on my views 😂.
Given how rapidly AI is developing, I suspect that we're unlikely to resolve the millennia-long philosophical debate about the true nature of wisdom before AGI is built. Therefore, I suggest that we instead sidestep this question by identifying specific capabilities related to wisdom that might useful for steering the world in a positive direction.
I'd suggest that examples might include: the ability to make wise strategic decisions, non-manipulation persuasion and the ability to find win-wins. I'll try to write up a longer list in the future.
Some of these capabilities will be more or less useful for steering the world in positive directions. On the other hand, negative externalities such as accelerating timelines or enabling malicious actors.
My goal is to figure out which capabilities to prioritise by balancing the benefits against the costs and then incorporating feasibility.
It may be the case that some people decide that wisdom requires different capabilities than those I end up deciding are important. As long as they pick a capability that isn't net-negative, I don't see that as bad. In fact, I see different people pursuing different understandings of wisdom as adding robustness to different world models.
If wisdom ultimately breaks down into specific capabilities, why not simply talk about these capabilities and avoid using a vague concept like wisdom? 👎
So the question is: "Why do I want to break wisdom down into separate capabilities instead of choosing a unitary definition of wisdom and attempting to train that into an AI system?"
Firstly, I think the chance of us being able to steer the world towards a positive direction is much higher if we're able to combine multiple capabilities together, so it makes sense to have a handle for the broader project, in addition to handles for individual sub-projects. I believe that techniques designed for one capability will often carry over to other capabilities, as will the challenges, and having a larger handle makes it easier to make these connections. I also think there's a chance that these capabilities amplify each other (as per the final few paragraphs of
Imagining and Building Wise Machines
[22] by Johnson, Bengio, Grossmann, et al).Secondly, I believe we should be aiming to increase both human wisdom and AI wisdom simultaneously. In particular, I believe it's important to increase the wisdom of folks creating AI systems and that this will then prove useful for a wide variety of specific capabilities that we might wish to train.
Finally, I'm interested in investigating this frame as part of a more ambitious plan to solve alignment on a principled level. Instead of limiting the win condition to building an AI that always (competently) acts in line with human values, the Wise AI Advisors frame broadens it such that the AI only needs to inspire humans to make the right decision. It's hard to know in advance whether this reframing will be any easier, but even if it doesn't help, I have a strong intuition that understanding why it doesn't help would shed light on the barriers to solving the core alignment problem.
Weren't you focused on Wise AI Advisors via Imitation Learning before? 👍
Yep, I was focused on it before. I now see that goal as overly narrow. The goal is to produce Wise AI Advisors via any means. I think that Imitation Learning is underrated, but there are lots of other approaches that are worth exploring as well.
✋ Objections
Isn't this argument a bit pessimistic? 👎⚖️. I prefer to be optimistic. 👌
Optimism isn't about burying your head in the sand and ignoring the massive challenges facing us. That's denialism. Optimism is about rolling up your sleeves and doing what needs to be done.
The nice thing about this proposal from an optimistic standpoint is that, assuming there is a way for AI to go well for humanity, then it seems natural to expect that there is some way to leverage AI to help us find it[23].
Additionally, the argument for developing wise AI advisors isn't in any way contingent on a pessimistic view of the world. Even if you think AI is likely to go well by default, wise AI advisors could still be of great assistance for making things go even better. For example, facilitating negotiations between powers, navigating the safety-openness tradeoff and minimising any transitional issues.
But I don't think timelines are short. They could be long. Like more than a decade. 👌
Short timelines add force to the argument I've made above, but they aren't at all a necessary component[24].
Even if AI will be developed over decades rather than years, there's still enough different challenges and key decisions that unaugmented human wisdom is unlikely to be sufficient.
In fact, my proposal might even work better over long timelines, as it provides more time for AI advisers to help steer the world in a positive direction[25].
Don't companies have a commercial incentive to train wise AI by default? 👎
I'm extremely worried about the incentives created by a general chatbot product. The average user is low-context and this creates an incentive towards sychophancy 🤥.
I believe that a product aimed at providing advice for critically important decisions would provide better incentives, even it were created by the same company.
Furthermore, given the potential for short timelines, it seems extremely risky 🎰 to rely purely on the profit motive, especially since there is a much stronger profit motive to pursue capabilities 💰💰💰. A few months' delay could easily mean that a wise AI advisor isn't available for a crucial decision 🚌💨. Humanity has probably already missed having such advisors available to assist us during a number of key decision points 😭.
Won't this just be used by malicious actors? Doesn't this just accelerate capabilities? 🤔👎
I expect both the benefits and the externalities to vary hugely by capability. I expect some to be positive, some to be negative and some to be extremely hard to determine. More work is required to figure out which capabilities are best from a differential technology development perspective.
I understand that this answer might be frustrating, but I think it's worth sharing these ideas even though i haven't yet had time to run this analysis. I have a
list of projects
that I hope will prove fairly robust, despite all the uncertainty.Is there value in wisdom given that wisdom is often illegible and this makes it non-verifiable? 👍
Oscar Delany makes this argument in
Tentatively against making AIs 'wise'
(runner up in theAI Impact Essay Competition
).This will depend on your definition of wisdom.
I admit that this tends to be a problem with how I tend to conceive of wisdom, however I will note that
Imagining and Building Wise Machines
(summary
) takes the opposite stance - the wisdom, conceived of as metacognition - can actually assist with explainability.But let's assume that this is a problem. How much of a problem is it?
I suspect that this varies significantly by actor. There's a reasonable argument that public institutions shouldn't be using such tools for reasons of justice. However, these arguments have much less force when it comes to private actors.
Even for private actors, it makes sense to use more legible techniques as much as possible, but I don't think this will be sufficient for all decisions. In particular, I don't think objective reasoning is sufficient for navigating the key decisions facing society in the transition to advanced AI.
But I also want to push back against the non-verifiability. You can do things like run a pool of advisors and only take dramatic action if more than a certain proportion agree, plus you can do testing, even advanced things like latent adversarial testing. It's not as verifiable as we'd like, but it's not like we're completely helpless here.
There will also be ways to combine wise AI advisors with more legible systems.
I'm mostly worried about inner alignment. Does this proposal address this? 🤞👍
Inner alignment is an extremely challenging issue. If only we had some... Wise AI Advisors to help us navigate this problem.
"But this doesn't solve the problem, this assumes that these advisors are aligned themselves": Indeed, that is a concern. However, I suspect that the wise AI advisors approach has less exposure to these kinds of risks as it allows us to achieve certain goals at a lower base model capability level.
• Firstly, wise people don't always have huge amounts of intellectual firepower. So I believe that we will be able to achieve a lot without necessarily using the most powerful models.
• Secondly, the approach of combined human-AI teams allows the humans to compensate for any weaknesses present in the AIs.
In summary, this approach might help in two ways: by reducing exposure and advising us on how to navigate the issue.
🌅 What makes this agenda so great?
Thanks for asking 🥳. Strangely enough I was looking for an excuse to hit a bunch of my key talking points 😂. Here's my top three:
i: 🤝🧭🌎 Imagine a coalition of organisations attempting to steer the world towards positive outcomes[26], with Wise AI Advisors helping them refine strategy, improve coordination and engage in non-manipulative persuasion. How much much do you think this could shift our trajectory? It seems to me that even a small coalition could make a significant difference over time.
ii: 🆓🍒✳️ Wise AI advisors are most likely complementary with almost any plan for making AGI go well[29]. I expect that much of this work could take the form of a startup and that non-profits working in this space would appeal to new funders, such that there's very little opportunity cost in terms of pursuing this direction.
iii: 🧑🔬🧭🌅 Wisdom tech seems favourable from a differential technology development perspective. Most AI technology differentially advantages "reckless and unwise" since "responsible & wise" actors[31] need more time to figure out how to deploy a technology. There's a limit to how much we can speed up human processes, but wise AI advisors could likely reduce this time lag further[32]. There's also a reasonable chance than wisdom tech allows "reckless and unwise" actors to realise their own foolishness[33]. For those who favour openness, wisdom tech might be safer to distribute.
🎁 Not persuaded yet? Here's three more bonus points :
I believe that this proposal is particularly promising because it has so many different plausible theories of change. It's hard to know in advance what assumptions will or will not pan out
You might also be interested in my draft post
N Stories of Impact for Wise AI Advisors 🏗️
which attempts to cleanly seperate out the various possible theories of impact.I'm sold! How can I get involved? 🥳🎁
As I said, if you're serious, please
✉️ PM me
. If you think you might be serious, but need to talk it through, please reach out as well.I'd be useful for me to know your background and how you think you could contribute. Maybe tell you a few facts about yourself, what interests you and drop a link to your LinkedIn profile?
If you scroll down, you'll see that I've answered some more questions about getting involved, but I'll include some useful links here as well:
•
List of potentially useful projects
: for those who want to know what could be done concretely. Scroll down further for a list of project lists ⬇️.• : since I believe there should be multiple organisations pursuing this agenda
•
Resources for getting started in AI Safety more broadly
: since some of these resources might be useful here as wellNow that we've clarified some of the possible theories of impact, the next section will delve more into specific approaches and projects, including listing some (hopefully) robustly beneficial projects that could be pursued in this space.
I also intend for some of my future work to be focused on making things more concrete. As an example, I'm hoping to spend some time during my ERA fellowship attempting to clarify what kinds of wisdom are most important for steering the world in positive directions.
So whilst I think it is important to make this proposal more concrete, I'm not going to rush. Doing things well is often better than doing them as fast as possible. It took a long time for AI safety to move from abstract theoretical discussions to concrete empirical research and I expect it'll also take some time for these ideas to mature[39].
🌱 What can I do? Is this viable? Is this useful?
Do you have specific projects that might be useful? 📚
See
Potentially Useful Projects in Wise AI
. It contains a variety of projects from ones that would be marginally helpful to incredibly ambitious moonshot.What other project lists exist?
Here are some project lists (or larger resources containing a project list) for Wise AI or related areas:
•
AI Impacts Essay Competition
: Covers the automation of philosophy and wisdom.•
Fellowship on AI for Human Reasoning - Future of Life Foundation
: AI tools for coordination and epistemics•
AI for Epistemics - Benjamin Todd
- He writes: "The ideal founding team would cover the bases of: (i) forecasting / decision-making expertise (ii) AI expertise (iii) product and entrepreneurial skills and (iv) knowledge of an initial user-type. Though bear in mind that if you have a gap in one of these areas now, you could probably fill it within a year" and then provides a list of projects.•
Project ideas: Epistemics - Forethought
- focuses on improving epistemics in general, not just AI solutions.Do you have any advice for creating a startup in this space? 📚
See the for resources including articles, incubation programs, fiscal sponsorship and funding.
You might be interested in one of these upcoming programs (at time of writing):
•
Fellowship on AI for Human Reasoning
: 12-week incubator fellowship to help talented researchers and builders start working on AI tools for coordination and epistemics•
Truth-seeking Cosmos x FIRE
: Funding for early-stage projects to advance truth-seekingIs it really useful to make AI incrementally wiser? 👍
AI being incrementally wiser might still be the difference between making a correct or incorrect decision at a key point.
Often several incremental improvements stack on top of each other, leading to a major advance.
Further, we can ask AI advisors to advise us on more ambitious projects to train wise AI. And the wiser our initial AI advisors are, the more likely this is to go well. That is, improvements that are initially incremental might be able to be leveraged to gain further improvements.
In the best case, this might kick off a (strong) positive feedback cycle (aka wisdom explosion).
Sure, you can make incremental progress. But is it really possible to train an AI to be wise in any deep way? 🤞👍🤷
Possibly 🤷.
I'm not convinced it's harder than any of the other ambitious alignment agendas and we won't know how far we can go without giving it a serious effort. Is training an AI to be wise really harder than aligning it? If anything, it seems like a less stringent requirement.
Compare:
• Ambitious mechanistic interpretability aims to perfectly understand how a neural network works at the level of individual weights
• Agent foundations attempting to truly understand what concepts like agency, optimisation, decisions are values are at a fundamental level
• Davidad's Open Agency architecture attempting train AI's that come with proof certificates that an AI has less than a certain probability of having unwanted side-effects
Is it obvious that any of these are easier than training a truly wise AI advisor? I can't answer for you, but it isn't obvious to me.
Given the stakes, I think it is worth pursuing ambitious agendas anyway. Even if you think timelines are short, it's hard to justify holding a probability approaching 100%, so it makes sense for folk to be pursuing plans on different timelines.
I understand your point about all ambitious alignment proposals being extremely challenging, but do you have any specific ideas for training AI to be deeply wise (even if speculative)? 👍
This section provides a high-level overview of some of my more speculative ideas. Whilst these ideas are very far from fully developed, I nonetheless thought it was worthwhile sharing an early version of them. That said, I'm very much on the set of "let a hundred flowers bloom". Whilst I think the approach I propose is promising, I also think it's much more likely than not that someone else comes up with an even better idea.
I believe that imitation learning is greatly underrated. I have a strong intuition that using the standard ML approach for training wisdom will fail because of the traditional Goodhart's law reasons, except that it'll be worse because wisdom is such a fuzzy thing (Who the hell knows what wisdom really is?).
I feel that training a deeply wise AI system requires an objective function that we can optimise hard on. This naturally draws me towards imitation learning. It has its flaws, but it certainly seems like we could optimise much harder on an imitation function than with attempting to train wisdom directly.
Now, imitation learning is often thought of as weak and there's certainly some truth in this when we're just talking about an initial imitation model. However, this isn't a cap on the power of imitation learning, only a floor. There's nothing stopping us from training a bunch of such models and using amplification techniques such as debate, trees of agents or even iterated distillation and amplification. I expect there to be many other such techniques for amplifying your initial imitation models.
See
An Overview of "Obvious" Approaches to Training Wise AI Advisors
for more information. This post was the first half of my entry into the AI Impacts Essay Competition on the Automation of Wisdom and Philosophy which placed third.Many of the methods you've proposed are non-embodied. Doesn't wisdom require embodiment? 🤔👎🤷
There's a very simplistic version argument for this that proves too much: people have used this claim to argue that LLMs were never going to be able to learn to reason, whilst o3 seems to have conclusively disproven this. So I don't think that the standard version of this argument works.
It's also worth noting that in robotics, there are examples of zero-shot transfer to the real world. Even if wisdom isn't directly analogous, this suggests that large amounts of real-world experience might be less crucial than it appears at first glance.
All this said, I find it plausible that a more sophisticated version of this argument might post a greater challenge. Even if this were the case, I see no reason why we couldn't combine both embodied and non-embodied methods. So even if there was a more sophisticated version that ultimately turned out to be true, this still wouldn't demonstrated that research into non-embodied methods was pointless.
✍️ Conclusion:
Coming soon 😉.
Okay, you've convinced me now. Any chance you could repeat how to get involved so I don't have to scroll up? 🙏📚
Sure 😊. Round
1️⃣2️⃣ 🔔, here we go 🪂!As I said, if you're serious, please
✉️ PM me
. If you think you might be serious, but need to talk it through, please reach out as well.I'd be useful for me to know your background and how you think you could contribute. Maybe tell you a few facts about yourself, what interests you and drop a link to your LinkedIn profile?
Useful link:
•
List of potentially useful projects
: for those who want to know what could be done concretely. Scroll down further for a list of project lists ⬇️.• : since I believe there should be multiple organisations pursuing this agenda
•
Resources for getting started in AI Safety more broadly
: since some of these resources might be useful here as well📘 Appendix:
Who are you? 👋
Hi, I'm Chris (Leong). I've studied maths, computer science, philosophy and psychology. I've been interested in AI safety for nearly a decade and I've participated in quite a large number of related opportunities.
With regards to the intersection of AI safety and wisdom:
I won third prize in the
AI Impacts Competition on the Automation of Wisdom and Philosophy
. It's divided into two parts:•
An Overview of “Obvious” Approaches to Training Wise AI Advisors
•
Some Preliminary Notes on the Promise of a Wisdom Explosion
I recently ran an
AI Safety Camp project on Wise AI Advisors
.More recently, I was invited to continue my research as a ERA Cambridge Fellow in Technical AI Governance.
Feel free to connect with me on
LinkedIn
(ideally mentioning your interest in Wise AI advisors so I know to accept) or✉️ PM me
.Hi, I'm Christopher Clay. I was previously a Non-Trivial Fellow and I participated in the Global Challenges Project.
Does anyone else think Wise AI or Wise AI advisors are important? 👍
Yes, here are a few:
Imagining and Building Wise Machines: The centrality of AI metacognition
original paper
summary
Authors: Yoshua Bengio, Igor Grossmann, Melanie Mitchell and Samuel Johnson[40] et al
The paper mostly focuses on wise AI agents, however, not exclusively so:
AI Impacts Automation of Wisdom and Philosophy Competition
Organised by: Owen Cotton-Barrett
Judges: Andreas Stuhlmüller, Linh Chi Nguyen, Bradford Saad and David Manley[41]
competition announcement
Wise AI Wednesdays post
winners and judge's comments
[42]Beyond Artificial Intelligence (AI): Exploring Artificial Wisdom (AW)
Authors: Dilip V Jest, Sarah A Graham, Tanya T Nguyen, Colin A Depp, Ellen E Lee, Ho-Cheol Kim
paper
What do you see as the strongest counterarguments against this general direction? 🔜🙏
I've discussed several possible counter-arguments above, but I'm going to hold off publicly posting about which ones seem strongest to me for now. I'm hoping this will increase diversity of thought and nerdsnipe people into helping red-team my proposal for me 🧑🔬👩🔬🎯. Sometimes with feedback there's a risk where if you say, "I'm particularly worried about these particular counter-arguments" you can redirect too much of the discussion onto those particular points, at the expense of other, perhaps stronger criticisms.
Do you have any recommendations for further reading? 📚
I've created a weekly post series on Less Wrong called
Wise AI Wednesdays
.Artificial Wisdom
Imagining and building wise machines: The centrality of AI metacognition by Johnson, Karimi, Bengio, et al. -
paper
,summary
: This paper argues that wisdom involves two kinds of strategies (task-level strategies & metacognitive strategies). Since current AI is pretty good at the former, they argue that we should pursue the latter as a path to increasing AI wisdom.Finding the Wisdom to Build Safe AI by Gordon Seidoh Worley -
LW post
: Seidoh talks about his own journey in becoming wiser through Zen and outlines a plan for building wise AI. In particular, he argues that it will be hard to produce wise AI without having a wise person to evaluate it.Designing Artificial Wisdom: The Wise Workflow Research Organisation by Jordan Arel -
EA forum post
: Jordan proposes mapping the workflows within an organisation that is researching a topic like AI safety or existential risk. AI could be used to automate or augment parts of their work. This proportion would increase over time, with the hope being that this would eventually allow us to fully bootstrap an artificially wise system.Should we just be building more datasets? by Gabriel Recchia -
Substack
: Argues that an underrated way of increasing the wisdom of AI systems would be building more datasets (whilst also acknowledging the risks).Tentatively Against Making AIs 'wise' by Oscar Delany -
EA forum post
: This articles argues that insofar as wisdom is conceived of as being more intuitive than carefully reasoned pursuing AI wisdom would be a mistake as we need AI reasoning to be transparent. I've included this because it seems valuable to have at least one critical article.Neighbouring Areas of Research
What's Important In "AI for Epistemics"? by Lukas Finnveden -
Forethought
: AI for Epistemics is a subtly different but overlapping area. It is close enough that this article is worth reading. It provides an overview of why you might want to work on this, heuristics for good interventions and concrete projects.AI for AI Safety by Joe Carlsmith (
LW post
): Provides a strategic analysis of why AI for AI safety is important whether it's for making direct safety progress, evaluating risks, restraining capabilities or improving "backdrop capacity". Great diagrams.AI Tools for Existential Security by Lizka Vaintrob and Owen Cotton-Barratt --
Forethought
: Discusses how applications of AI can be used to reduce existential risks and suggests strategic implications.Not Superintelligence: Supercoordination -
forum post
[44]: This article suggests that software-mediated supercoordination could be beneficial for steering the world in positive directions, but also identifies the possibility of this ending up as a "horrorshow".Human Wisdom
Stanford Encyclopedia of Philosophy Article on Wisdom by Sharon Ryan -
SEP article
: SEP articles tend to be excellent, but also long and complicated. In contrast, this article maintains the excellence while being short and accessible.Thirty Years of Psychology Wisdom Research: What We Know About the Correlates of an Ancient Concept by Dong, Weststrate and Fournier -
paper
: Provides an excellent overview of how different groups within psychology view wisdom.The Quest for Artificial Wisdom by Sevilla -
paper
: This article outlines how wisdom is viewed in the Contemplative Sciences discipline. It has some discussion of how to apply this to AI, but much of this discussion seems outdated in light of the deep learning paradigm.Some of my own work:
My third prize-winning entry in the
AI Impacts Automation of Wisdom and Philosophy Competition
(split into two parts):Some Preliminary Notes on the Promise of a Wisdom Explosion
: Defines a wisdom explosion as a recursive self-improvement feedback loop that enhances wisdom, unlike intelligence as per the more traditional intelligence explosion. Argues that wisdom tech is safer from a differential technology perspective.An Overview of "Obvious" Approaches to Training Wise AI Advisors
: Compares four different high-level approaches to training Wise AI: direct training, imitation learning, attempting to understand what wisdom is at a deep principled level, the scattergun approach. One of the competition judges wrote: "I can imagine this being a handy resource to look at when thinking about how to train wisdom, both as a starting point, a refresher, and to double-check that one hasn’t forgotten anything important".Draft of N Stories of Impact for Wise AI Advisors 🏗️
: Different stories about how wise AI advisors could be useful for having a positive impact on the world.We'll go through each point individually in the core argument section 🔜🧘.
"Then why is this section first?": You might find it hard to believe, but there's actually folks who'd freak out if I didn't do this 🔱🫣. Yep... 🤷♂️
I recommend diving straight into the core argument, and seeing if you agree with it when you substitute your own understanding of what wisdom is. I promise you 💍, I'm not doing this arbitrarily. However, if you really want to read the definitions first, then you should feel free to just go for it 🧘👌.
For example, see a and b.
I also believe that this increases our chances of being able to set up a positive feedback loop 🤞🤞🤞 of increasing wisdom (aka "wisdom explosion" 🦉🎆) compared to strategies that don't leverage humans. But I don't want to derail the discussion by delving into this right now 🐇🕳️.
I'd like to think of myself as being the kind of person who's sensible enough to avoid getting derailed by small or irrelevant details 🤞🤞. Nonetheless, I can recall
countlessa few times when I've failed at this in the past 🚇💥. Honestly, I wouldn't be surprised if many people were in the same boat 🧑🤝🧑🧑🤝🧑🧑🤝🧑🌏🤞.I'm using emojis to provide an indication of what my answer is to a question without you having to expand the box 😉.
This post is being written collaboratively by Chris Leong (primary author) and Christopher Clay (second author) in the voice of Chris Leong 🤔🤝✍️. The blame for any cringe in the footnotes can be pinned on Chris Leong 🎯.
I believe that good communication involves being as open about the language game ❤️🦉 being being played as possible 🫶.
Given this, I want to acknowledge that this post is not intended to deliver a perfectly balanced analysis, but rather to lay out an optimistic case for Wise AI Advisors 🌅. In fact, feel free to think of it as a kind of manifesto 📖🚪📜📌.
I suspect that many, if not most, readers of Less Wrong will be worried that this must necessarily come at the expense of truth-seeking 🧘. In fact, I probably would have agree with this not too long ago. However, recently I've been resonating a lot more with the idea that there's value in all kinds of language games, so long as they are deployed in the right context and pursued in the right way 🤔[45]. Most communication shouldn't be a manifesto, but the optimal number of manifestos isn't zero (❗).
Stone-cold rationality is important, but so is the ability to dream 😴💭🎑. It's important to be able to see clearly, but also to lay out a vision 🔭✨. These are in tension, sure, but not in contradiction ⚖️. Synthesis is hard, but not impossible 🧗.
If you're interested in further discussion, see the next footnote...
"Two footnotes on the exact same point? This is madness!": No, this is Sparta!
🌍⛰️🐰🕳️Spartans don't quit and neither do we 💪. So here it is: Why it makes sense to lay out an optimistic case, Round 🔔🔔.
Who doesn't love a second bite at the apple? 👸🏻🍏💀😴
Adopting a lens of optimism unlocks creativity by preventing you from discarding ideas too early (conversely, a lens of pessimism aids with red-teaming by preventing you from discarding objections too easily). It increases the chance that bad ideas make it through your initial filter, but these can always be filtered further down the line. I believe that such a lens is the right choice for now, given that this area of thoughtspace is quite neglected. Exploration is a higher priority than filtration 🌬️⛵🌱.
"But you're biasing people toward optimism": I take this seriously 😔🤔. At the same time, I don't believe that readers need to be constantly handheld lest they form the wrong inference. Surely, it's worth risking some small amount of bias to avoid coming off as overly paternalistic?
Should texts be primarily written for the benefit of those who can't apply critical thinking? To be honest, that idea feels rather strange to me 🙃. I think it's important to primarily write for the benefit of your target audience and I don't think that the readers I care about most are just going to absorb the text wholesale, without applying any further skepticism or cognitive processing 🤔. Trusting the reader has risks, sure, but I believe there are many people for whom this is both warranted and deserved 🧭⛵🌞.
"Why all the emojis?": All blame belongs to Chris Leong 🎯.
"Not who deserves to be shot, but why are you using them?": A few different reasons: they let me bring back some of the expressiveness that's possible in verbal conversation, but which is typically absent in written communication 🥹😱, and striking a more casual tone helps differentiate it from any more academic or rigorous treatment (😉) that might hypothetically come later 🤠🛤️🎓 and they help keep the reader interested 🎁.
But honestly: I mostly just feel drawn to the challenge. The frame "both/and" has been resonating a lot with me recently and one such way is that I've been feeling drawn towards trying to produce content that manages to be both fun and serious at the same time 🥳⚖️🎩.
Literaly even ✍️.
I selected this name because it's easy to say and remember, but in my heart it'll always be the MST/MSU/MSV trifecta (minimally spare time, massive strategic uncertainty, many scary vulnerabilities 💔🚢🌊🎶. I greatly appreciate Will MacAskill recommending that I simplify the name 🙏.
My view is actually a bit more complicated than I lay out here 🕸️. There are many different arguments that could be made for pursuing this research direction. However, I've shared this one because it is the most legible and compelling 🔤🌊. For more information on alternate theories of impact, see this
draft 🏗️
.I'm not exactly making anything super original or unprecedented here 😭😱, but I don't know if I've ever really seen anyone really go hard on this shape of argument before 🌊🌊🌊.
Claim 🅅 🌊🚣 doesn't claim any particular timelines. I believe it's likely to be
absurdlyquite fast, but I don't claim this until claim 🅂😅⏳. If you find yourself doubting claim 🅅 🌊🚣, there's a good chance it's because you've assumed that I'm talking about the short or medium term, rather than the long term 🤔.Even if you believe that a number of these don't actually pose any realistic risk of catastrophe, the fact is that it's extremely unclear to society which issues are or aren't the catastrophic ones. It's hard for us avoid catastrophe as we don't know where to focus our efforts. So it's extremely likely we'll completely drop the ball somewhere we can't afford for this to happen ⚪🪙⚫ 🫡.
I purposefully avoided mentioning misaligned superintelligence below given that it is simultaneously extremely controversial and not at all necessary for this argument 🤷. I would prefer not to lose people if I don't have to 🙏.
Some of these risks could directly cause catastrophe. Others might indirectly lead to this via undermining vital societal infrastructure such that we are then exposed to other threats 🕷️🕸️.
One of the biggest challenges here is that AI throws so many civilisational scale challenges towards us at once. Humanity can deal with civilisational scale challenges, but global co-ordination is hard and when there's so many different issues to deal with its hard to give each problem the proper attention it deserves 😅🌫️💣💣💣.
In many ways, the figuring out how to act part of the problem is the much harder component 🧭🌫️🪨. Given perfect co-ordination between nation states, these issues could almost certainly be resolved quite quickly, but given there's no agreement on what needs to be done and politics being the mind-killer, I wouldn't hold your breadth... 😯😶🐌🚀🚀🚀
I received feedback that calling these alpha and beta sounded kind of cringe. On one hand, I agree with this, on the other hand "alpha" and "beta" totally sounds like the right of terms to describe strategies, so went with the compromise choice and used the upper case versions of the Greek letters instead that look virtually identical 🧐 to the English letters 😉.
"Why can't we afford to leave anything on the table?" - The task we face is extremely challenging, the stakes sky high, it's extremely unclear what would have to happen for AI to go well, we have no easy way on gaining clarity on what would have to happen and we're kind of in an AI arms race which basically precludes us from fully thinking everything through 😅⏳⏰.
This paper argues that robustness, explainability and cooperation all facilitate each other 🤞💪. In case the link to wisdom is confusing (🦉❓), the authors argue that metacognition is both key to wisdom and key to these capabilities ☝️.
This argument could definitely be more rigorous, however, this response is addressed to optimists and this it is especially likely to be true if we adopt an optimistic stance on technical and social alignment. So I feel fine leaving it as is 🤞🙏.
⚠️ If timelines are extremely short, then many of my proposals might become irrelevant because there won't be time to develop them 😭⏳. We might be limited to simple things like tweaking the system prompt or developing a new user interface, as opposed to some of the more ambitious approaches that might require changing how we train our models. However, there's a lot of uncertainty with timelines, so more ambitious projects might make sense anyway. 🤔🌫️🤞🤷.
"If your proposal might even work better over long timelines, why does your core argument focus on short timelines 🤔⁉️": There are multiple arguments for developing Wise AI Advisors and I wanted to focus on one that was particularly legible 🔤🙏.
Andrew Critch has labelled this - "pivoting" civilisation across multiple acts across multiple persons and institutions - a pivotal process. He compares this favourably to the concept of a pivotal act - "pivoting" civilisation in a single action - on the basis of it being easier, safer and more legitimate 🗺️🌬️⛵️⛵️⛵️🌎.
I say "forget X" partly in jest. My point is that if there's a realistic chance of realising stronger possibility, then it would likely be much more significant than the weaker possibility 👁️⚽.
I think there's a strong case that if these advisers can help you gain allies at all, they can help you gain many allies 🎺🪂🪂🪂.
In
So You Want To Make Marginal Progress...
, John Wentworth argues that "when we don't already know how to solve the main bottleneck(s)... a solution must generalize to work well with the whole wide class of possible solutions to other subproblems... (else) most likely, your contribution will not be small; it will be completely worthless" 🗑️‼️.My default assumption is that people will be making further use of AI advice going forwards. But that doesn't contradict my key point, that certain strategies may become viable if we "go hard on" producing wise AI advisors 🗺️🔓🌄.
"But "reckless and unwise" and "responsible & wise" aren't the only possible combinations!" - True 👍, but they co-occur sufficiently often that I think this is okay for a high-level analysis 🤔🤞🙏.
They could also speed up the ability of malicious and "reckless & unwise" actors to engage in destructive actions, however I still think such advisors would most likely to improve the comparative situation since many defences and precautions can be put in place before they are needed 🤔🤞💪.
This argument is developed further in
Some Preliminary Notes on the Promise of a Wisdom Explosion
🦉🎆📄.I agree it's highly unlikely that we could convince all the major labs to pursue a wisdom explosion instead of an intelligence explosion 🙇. Nonetheless, I don't think this renders the question irrelevant ☝️. Let's suppose (😴💭) it really were the case that the transition to AGI was would most likely go well if society had decided to pursue a wisdom explosion rather than an intelligence explosion ‼️. I don't know, but that seems like it'd be the kind of thing that would be pretty darn useful to know 🤷♂️. My experience with a lot of things is that it's a lot easier to find additional solutions than it is to find the first. In other words, solutions aren't just valuable for their potential to be implemented, but because of what they tell us about the shape of the problem 🗝️🚪🌎.
How can we tell which framings are likely to be fruitful and ones are likely to be not 👩🦰🍎🐍🌳? This isn't an easy question, but my intuition is that a framing more likely to be worth pursuing if they lead to questions that feel like they should (❓) be being asked, but have been neglected due to unconscious framing effects 🙈🖼️. In contrast, a framing is less likely to be fruitful if it asks questions that seem interesting prima facie, but for which the best researchers within the existing paradigm have good answers for, even if most researchers do not 🏜️✨.
I acknowledge that "intelligence" is only rather loosely connect with the capabilities AI systems have in practise. Factors like prestige, economic demand and tractability play a larger role in terms of what capabilities are developed than the name used to describe the area of research. Nonethless, I think it'd be hard to argue that the framing effect hasn't exterted any influence. So I think it is worthwhile understanding how this may have shaped the field and whether this influence has been for the best 🕵️🤔.
A common way that an incorrect paradigm can persist is if existing researchers don't have good answers to particular questions, but see them as unimportant 🙄👉.
I have an intuition that it's much more valuable for the AI safety and governance community to be generating talent with a distinct skillset/strengths than to just be recruiting more folk like those we already have 🧩. As we attract more people with the same skillset we likely experience decreasing marginal returns due to the highest impact roles already being filled 🙅♂️📉.
That said, given how fast AI development is going, I'm hoping the process of concretisation can be significantly sped up. Fortunately, I think it'll be easier to do it this time as I've already seen what the process looked like for alignment 🔭🚀🤞.
First author 🥇.
Wei Dai who was originally announced as a judge withdrew 🕳️.
Including me 🏆🍀😊.
They write: "The precise opinions expressed in this post should not be taken as institutional views of AI Impacts, but as approximate views of the competition organizers" ☝️👌.
Linking to this article does not constitute an endorsement of Sofiechan or any other views shared there. Unfortunately, I am not aware of other discussions of this concept 😭🤷, so I will keep this link in this post temporarily 🙏🔜.
You could even say: with the right person, and to the right degree, and at the right time, and for the right purpose, and in the right way ❤️🦉.
I purposefully avoided mentioning misaligned superintelligence given that it is simultaneously extremely controversial and not at all necessary for this argument 🤷. I would prefer not to lose people if I don't have to 🙏.