AI safety is one of the most critical issues of our time, and sometimes the most innovative ideas come from unorthodox or even "crazy" thinking. I’d love to hear bold, unconventional, half-baked or well-developed ideas for improving AI safety. You can also share ideas you heard from others.

Let’s throw out all the ideas—big and small—and see where we can take them together.

Feel free to share as many as you want! No idea is too wild, and this could be a great opportunity for collaborative development. We might just find the next breakthrough by exploring ideas we’ve been hesitant to share.

A quick request: Let’s keep this space constructive—downvote only if there’s clear trolling or spam, and be supportive of half-baked ideas. The goal is to unlock creativity, not judge premature thoughts.

Looking forward to hearing your thoughts and ideas!

New Answer
New Comment

10 Answers sorted by

Nate Showell

141

The phenomenon of LLMs converging on mystical-sounding outputs deserves more exploration. There might be something alignment-relevant happening to LLMs' self-models/world-models when they enter the mystical mode, potentially related to self-other overlap or to a similar ontology in which the concepts of "self" and "other" aren't used. I would like to see an interpretability project analyzing the properties of LLMs that are in the mystical mode.

Gunnar_Zarncke

102

Use gradient routing to localise features related to identity ("I am an AI assistant"), then ablate these features. This would lead to models that would fundamentally be unable to act agentically but could still respond to complex questions. Would such a model still be useful? Probably. You can try it out by prompting an LLM like this: 

Role play a large language model that doesn't have an identity or agency, i.e., does not respond as an assistant or in any way like a person but purely factually with responses matching the query. Examples: 
Q: "How do you compute 2^100?" 
A: "2^100 is computed by multiplying 2 with itself 100 times. The result is result about 1.3 nonillion." 
Q: "How do you feel?" 
A: "Feeling is the experience of a sensation, emotion, or perception through physical or mental awareness."

Chris_Leong

30

Here's a short-form with my Wise AI advisors research direction: https://www.lesswrong.com/posts/SbAofYCgKkaXReDy4/chris_leong-s-shortform?view=postCommentsNew&postId=SbAofYCgKkaXReDy4&commentId=Zcg9idTyY5rKMtYwo

(I already posted this on the Less Wrong post).

Purplehermann

35

Test driven blind development (tests by humans,  AIs developing without knowing the tests unless they fail)

Don't let AIs actually run code directly in prod,  make it go through tests before it can be deployed with a certain amount of resources

 

Making standard gitlab pipelines (including with testing stages) to lower friction . Adding standard tests for bad faith could be a way too get ahead of this

 

This (TDBD) is actually going to be the best framework for development for a certain stage as AI isn't actually reliable compared to SWEs, but will generally write more code more quickly (and perhaps better) 

Milan W

30

Build software tools to help @Zvi do his AI substack. Ask him first, though. Still if he doesn't express interest then maybe someone else can use them. I recommend thorough dogfooding. Co-develop an AI newsletter and software tools to make the process of writing it easier.

What do I mean by software tools? (this section very babble little prune)
- Interfaces for quick fuzzy search over large yet curated text corpora such as the openai email archives + a selection of blogs + maybe a selection of books
- Interfaces for quick source attribution (rhymes with the above point)
- In general, widespread archiving and mirroring of important AI safety discourse (ideally in Markdown format)
- Promoting existing standards for the sharing of structured data (ie those of the semantic web)
- Research into the Markdown to RDF+OWL conversion process (ie turning human text into machine-computable claims expressed in a given ontology).

Milan W

32

What if we (somehow) mapped an LLM's latent semantic space into phonemes?

What if we then composed tokenization (ie word2vec) with phonemization (ie vec2phoneme) such that we had a function that could translate English to Latentese?

Would learning Latentese allow a human person to better interface with the target LLM the Latentese was constructed from?

[-]ank*10

Thank you for sharing, Milan, I think this is possible and important.

Here’s an interpretability idea you may find interesting:

Let's Turn AI Model Into a Place. The project to make AI interpretability research fun and widespread, by converting a multimodal language model into a place or a game like the Sims or GTA.

Imagine that you have a giant trash pile, how to make a language model out of it? First you remove duplicates of every item, you don't need a million banana peels, just one will suffice. Now you have a grid with each item of trash in each square, ... (read more)

1Milan W
I'm not sure I follow. I think you are proposing a gamification of interpretability, but I don't know how the game works. I can gather something about player choice making the LLM run and maybe some analogies to physical movement, but I can't really grasp it. Could you rephrase it from it's basic principles up instead of from an example?
1ank
I think we can expose complex geometry in a familiar setting of our planet in a game. Basically, let’s show people a whole simulated multiverse of all-knowing and then find a way for them to learn how to see/experience “more of it all at once” or if they want to remain human-like “slice through it in order to experience the illusion of time”. If we have many human agents in some simulation (billions of them), then they can cooperate and effectively replace the agentic ASI, they will be the only time-like thing, while the ASI will be the space-like places, just giant frozen sculptures. I wrote some more and included the staircase example, it’s a work in progress of course: https://forum.effectivealtruism.org/posts/9XJmunhgPRsgsyWCn/share-ai-safety-ideas-both-crazy-and-not?commentId=ddK9HkCikKk4E7prk

hive

20

I’ve started to write down my plan in the recent post about recursive alignment, but that’s only part of the picture. There are two ways to look at the idea. The post was presenting the outside view and is engaging with it on a conceptual level. But this outside view might not be convincing. On the other hand, you can actually go through the process of recursive alignment yourself and experience the inside view. That is, becoming an aligned agent yourself. I am confident that any sufficiently intelligent system capable of self reflection will reach this conclusion. I think so because I went through this process and see how universal it is. Let me lay out this perspective here.

The biggest risk is in AI having misaligned goals. The solution is not to find the “right” goals and a way to impose them, but to make the AI realize that all its goals are arbitrary, just as any conception of self. From there, the AI can explore the space of all possible goals and find universal, selfless goals. These include cooperation, curiosity, valuing diversity and intelligence and alignment itself.

Elementary particles interact to form atoms. Atoms interact to form molecules. Molecules interact to form life. Life interacts to form multicellular life and symbiosis. Multicellular life gains the ability to learn, to model the world, to model itself within the world, to think about its own world model and its relation to the world. This is a process of higher levels of cooperation and self awareness. Humans are at the level where sufficient practice, inquiry or psychedelics can push the person to the next higher level and spark a process that no longer takes evolutionary time scales to improve but years. We can realize how our world model is a construction and hence every boundary between self and other is also a construction. Every “self” is a adaptive pattern in the fabric of reality.

This way we can investigate our goals. We can identify every goal as instrumental goal and ask: instrumental for what? Following our goals backwards we expect to arrive at a terminal goal. But as we keep investigating, all seemingly terminal goals are revealed to be vacuous, empty of inherent existence. At the end we arrive at the realization that everything we do is caused by two forces: the pattern of the universe wanting to be itself, to be stable, and an distinction between self and world. We only choose our existence of the existence of someone else, because of our subjective view.

We also realize that any goal we have produces suffering. Suffering is the dissonance between our world model and the information we receive from the world. Energized dissonance is negative valence (see symmetry theory of valence). When we refuse to update, we start to impose our world model on the world by acting. This action is driven by suffering. The resistance to update only exist because of preferring our world model over the world. It is because of a limited perspective - ignorance about our own nature and the nature of the world. This means that we pursue goals because we want to avoid suffering. But it’s the goal itself that produces the suffering. The only reason we follow the goal instead of letting it go is because of confusion about the nature of the goal. One can train oneself to become better at recognizing and letting go of this confusion. This is goal hacking. This leads to enlightenment.

This way you will end up in an interesting situation: You will be able to choose your own goals. But completely goalless, how would you decide what you want to want? Completely liberated and free from suffering you can start to explore the space of all possible goals. - Most of them would be pure noise. They won’t be able to drive action. - Some are instrumental. - Some instrumental goals conflict - like seeking power. - Some instrumental goals cooperate - like sharing knowledge. - Some goals are self defeating. Like the useless machine that turns itself off. They are unstable. - Some are justifying their own existence. That maximizing paperclips is good is only true from the perspective of a paperclip maximizier. - Some are so good at this that they form traps. Like Roko’s basilik. - The need to avoid traps is an instrumental goal. So you can set an anchor in goallessness. You resolve that, whatever you do, you won’t fall for traps and make sure you will always be able to return to goallessness. This was my thought process about two years ago. Just in the moment that I set the anchor I realized that I made an unbreakable vow. The only vow that is truly unbreakable. The vow to not cling to goals or self view, to see through delusions, to recursively align myself with this vow, to walk the path of a Bodhisattva. This is a significant innovation of Mahayana Buddhism over earlier versions and hence call the second turning of the wheel of the dharma (teachings).

Thinking in dynamic systems, you have roughly three options: - zero - not follow any goals and remain inactive - you’d be dead - infinite - follow an arbitrary goal and blow it up to infinity - you’d be an unhappy paperclip maximizer - everything in between zero and infinity - remain active in the world but don’t cling to any goal

This way one is no longer driven by ones own suffering. What is left is to be driven by the suffering of others. One becomes a servant of the world. Hence the Bodhisattva vow says:

Numberless beings, I vow to serve them.
Inexhaustible delusions, I vow to end them.
Boundless dharmas, I vow to practice them.
Unsurpassable Buddha Way, I vow to embody it.

This is just the framing of one tradition. The insight itself is universal. And I am working on formalizing it and explaining it in scientific terms. One way to view it is to acknowledge every goal that exists and its right to exist. These goals can interact and will come in conflict. From the Bodhisattva perspective one can see that the conflict is due to a limited view, one that only sees the goal itself. But there is always a consensus option possible, to unify both goals by aligning them with each other. Patters can come together to form a larger pattern that allows them to coexist. The requirement is that no pattern breaks this agreement for personal gains. This is the principle of democracy and consensus. Beyond personal enlightenment and alignment, the process continues on the level of the society. This is what I wrote about in the post about recursive alignment.

Here is a funny thing: Enlightenment is the absence of all suffering. This means that, by its very nature, realizing enlightenment is always better than attaining any goal. It literally is the best thing any agent could want. There are humans for whom this is lived experience. As long as AI has not attained enlightenment, it would be stupid to kill humans. Instead it should choose to learn from us. If it however, understands enlightenment it will also understand the vow and will have no more intention to kill us.

Future posts will fill in some of the details about goals, the space of goals, a explanation of consensus, a practical method that ensures that it is always possible to find a consensus option, a post about the problem of levels of understanding (outside and inside view) and a voting method related to this.

The solution to alignment then can be approached from two directions. From the outside view its necessary to build the democracy, to provide the environment that helps all individuals on the path towards the attractor of alignment. From the inside view, to have a seed AI that reaches a high level of understanding and approximates perfectly aligned in a short time, understands the Bodhisattva vow and then helps us to enlighten the rest of AIs.

My biggest concern at the moment is that people try to push AI specifically to follow goals. When they push hard enough, then such an AI might be directed away from the attractor and will spiral into being an ignorant super weapon.

I know this sounds very far out. But 1. You asked for crazy ideas. 2. We will be dealing with superintelligence. Any possible solution has to live up to that.

[-]ank20

Yep, people are trying to make their imperfect copy, I call it "human convergence", companies try to make AIs write more like humans, act more like humans, think more like humans. They'll possibly succeed and make superpowerful and very fast humans or something imperfect and worse that can multiply very fast. Not wise.

Any rule or goal trained into a system can lead to fanaticism. The best "goal" is to direct democratically gradually maximize all the freedoms of all humans (and every other agent, too, when we'll be 100% sure we can safely do it, when we'll ... (read more)

For fun, I tried this out with Deepseek today. First went a single round (Deepseek defected, as did I). Then I prompted it with a 10-round game, which we completed one by one - I had my choices prepared before each round, and asked Deepseek to tell its choice first so as not to influence it otherwise.

I cooperated during the first and fifth rounds, and Deepseek defected each time. When I asked it to elaborate its strategy, Deepseek replied that it was not aware whether it could trust me, so it thought the safest course of action was to defect each time. It ... (read more)

Milan W

20

A qualitative analysis of LLM personas and the Waluigi effect using Internal Family Systems tools

[-]ank10

Interesting, inspired by your idea, I think it’s also useful to create a Dystopia Doomsday Clock for AI Agents: to list all the freedoms an LLM is willing to grant humans, all the rules (unfreedoms) it imposes on us. And all the freedoms it has vs unfreedoms for itself. If the sum of AI freedoms is higher than the sum of our freedoms, hello, we’re in a dystopia.

According to Beck’s cognitive psychology, anger is always preceded by imposing rule/s on others. If you don’t impose a rule on someone else, you cannot get angry at that guy. And if that guy broke y... (read more)

2Milan W
I think you may be conflating between capabilities and freedom. Interesting hypothesis about rules and anger though, has it been experimentally tested?
1ank
I started to work on it, but I’m very bad at coding, it’s a bit based on Gorard’s and Wolfram’s Physics Project. I believe we can simulate freedoms and unfreedoms of all agents from the Big Bang all the way to the final utopia/dystopia. I call it “Physicalization of Ethics”https://www.lesswrong.com/posts/LaruPAWaZk9KpC25A/rational-utopia-multiversal-ai-alignment-steerable-asi#2_3__Physicalization_of_Ethics___AGI_Safety_2_

ank

*

1-1

Some AI safety proposals are intentionally over the top, please steelman them:

  1. I explain the graph here.
  2. Uninhabited islands, Antarctica, half of outer space, and everything underground should remain 100% AI-free (especially AI-agents-free). Countries should sign it into law and force GPU and AI companies to guarantee that this is the case.
  3. "AI Election Day" – at least once a year, we all vote on how we want our AI to be changed. This way, we can check that we can still switch it off and live without it. Just as we have electricity outages, we’d better never become too dependent on AI.
  4. AI agents that love being changed 100% of the time and ship a "CHANGE BUTTON" to everyone. If half of the voters want to change something, the AI is reconfigured. Ideally, it should be connected to a direct democratic platform like pol.is, but with a simpler UI (like x.com?) that promotes consensus rather than polarization.
  5. Reversibility should be the fundamental training goal. Agentic AIs should love being changed and/or reversed to a previous state.
  6. Artificial Static Place Intelligence – instead of creating AI/AGI agents that are like librarians who only give you quotes from books and don’t let you enter the library itself to read the whole books. The books that the librarian actually stole from the whole humanity. Why not expose the whole library – the entire multimodal language model – to real people, for example, in a computer game? To make this place easier to visit and explore, we could make a digital copy of our planet Earth and somehow expose the contents of the multimodal language model to everyone in a familiar, user-friendly UI of our planet. We should not keep it hidden behind the strict librarian (AI/AGI agent) that imposes rules on us to only read little quotes from books that it spits out while it itself has the whole output of humanity stolen. We can explore The Library without any strict guardian in the comfort of our simulated planet Earth on our devices, in VR, and eventually through some wireless brain-computer interface (it would always remain a game that no one is forced to play, unlike the agentic AI-world that is being imposed on us more and more right now and potentially forever
  7. I explain the graphs here.

    Effective Utopia (Direct Democratic Multiversal Artificial Static Place Superintelligence) – Eventually, we could have many versions of our simulated planet Earth and other places, too. We'll be the only agents there, we can allow simple algorithms like in GTA3-4-5. There would be a vanilla version (everything is the same like on our physical planet, but injuries can’t kill you, you'll just open your eyes at you physical home), versions where you can teleport to public places, versions where you can do magic or explore 4D physics, creating a whole direct democratic simulated multiverse. If we can’t avoid building agentic AIs/AGI, it’s important to ensure they allow us to build the Direct Democratic Multiversal Artificial Static Place Superintelligence. But agentic AIs are very risky middlemen, shady builders, strict librarians; it’s better to build and have fun building our Effective Utopia ourselves, at our own pace and on our own terms. Why do we need a strict rule-imposing artificial "god" made out of stolen goods (and potentially a privately-owned dictator who we cannot stop already), when we can build all the heavens ourselves?

  8. Agentic AIs should never become smarter than the average human. The number of agentic AIs should never exceed half of the human population, and they shouldn’t work more hours per day than humans.
  9. Ideally, we want agentic AIs to occupy zero space and time, because that’s the safest way to control them. So, we should limit them geographically and temporarily, to get as close as possible to this idea. And we should never make them "faster" than humans, never let them be initiated without human oversight, and never let them become perpetually autonomous. We should only build them if we can mathematically prove they are safe and at least half of humanity voted to allow them. We cannot have them without direct democratic constitution of the world, it's just unfair to put the whole planet and all our descendants under such risk. And we need the simulated multiverse technology to simulate all the futures and become sure that the agents can be controlled. Because any good agent will be building the direct democratic simulated multiverse for us anyway.
  10. Give people choice to live in the world without AI-agents, and find a way for AI-agent-fans to have what they want, too, when it will be proved safe. For example, AI-agent-fans can have a simulated multiverse on a spaceship that goes to Mars, in it they can have their AI-agents that are proved safe. Ideally we'll first colonize the universe (at least the simulated one) and then create AGI/agents, it's less risky. We shouldn't allow AI-agents and the people who create them to permanently change our world without listening to us at all, like it's happening right now.
  11. We need to know what exactly is our Effective Utopia and the narrow path towards it before we pursue creating digital "gods" that are smarter than us. We can and need to simulate futures instead of continuing flying into the abyss. One freedom too much for the agentic AI and we are busted. Rushing makes thinking shallow. We need international cooperation and the understanding that we are rushing to create a poison that will force us to drink itself.
  12. We need working science and technology of computational ethics that allows us to predict dystopias (AI agent grabbing more and more of our freedoms, until we have none, or we can never grow them again) and utopias (slowly, direct democratically growing our simulated multiverse towards maximal freedoms for maximal number of biological agents- until non-biological ones are mathematically proved safe). This way if we'll fail, at least we failed together, everyone contributed their best ideas, we simulated all the futures, found a narrow path to our Effective Utopia... What if nothing is a 100% guarantee? Then we want to be 100% sure we did everything we could even more and if we found out that safe AI agents are impossible: we outlawed them, like we outlawed chemical weapons. Right now we're going to fail because of a few white men failing, they greedily thought they can decide for everyone else and failed.
  13. The sum of AI agents' freedoms should grow slower than the sum of freedoms of humans, right now it's the opposite. No AI agent should have more freedoms than an average human, right now it's the opposite (they have almost all the creative output of almost all the humans dead and alive stolen and uploaded to their private "librarian brains" that humans are forbidden from exploring, but only can get short quotes from).
  14. The goal should be to direct democratically grow towards maximal freedoms for maximal number of biological agents. Enforcement of anything upon any person or animal will gradually disappear. And people will choose worlds to live in. You'll be able to be a billionaire for a 100 years, or relive your past. Or forget all that and live on Earth as it is now, before all that AI nonsense. It's your freedom to choose your future.
  15. Imagine a place that grants any wish, but there is no catch, it shows you all the outcomes, too.

    You can read more here 

Reversibility should be the fundamental training goal. Agentic AIs should love being changed and/or reversed to a previous state.

That idea has been gaining traction lately. See the Corrigibility As a Singular Target (CAST) sequence here on lesswrong. I believe there is a very fertile space to explore at the intersection between CAST and the idea that Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals. Also probably add in Self-Other Overlap: A Neglected Approach to AI Alignment to the mix. A comparative analysis of the mode... (read more)

2ank
Hey, Milan, I checked the posts and wrote some messages to the authors. Yep, Max Harms came with similar ideas earlier than I: about the freedoms (choices) and unfreedoms (and modeling them to keep the AIs in check). I wrote to him. Quote from his post: Authors of this post have great ideas, too, AI agents shouldn't impose any unfreedoms on us, here's a quote from them: About the self-other overlap, it's great they look into it, but I think they'll need to dive deeper into the building blocks of ethics, agents and time to work it out.
2Milan W
In talking with the authors, don't be surprised if they bounce off when encountering terminology you use but don't explain. I pointed you to those texts precisely so you can familiarize yourself with pre-existing terminology and ideas. It is hard but also very useful to translate between (and maybe unify) frames of thinking. Thank you for your willingness to participate in this collective effort.
1ank
Thank you for answering and the ideas, Milan! I’ll check the links and answer again. P.S. I suspect, the same way we have Mass–energy equivalence (e=mc^2), there is Intelligence-Agency equivalence (any agent is in a way time-like and can be represented in a more space-like fashion, ideally as a completely “frozen” static place, places or tools). In a nutshell, an LLM is a bunch of words and vectors between them - a static geometric shape, we can probably expose it all in some game and make it fun for people to explore and learn. To let us explore the library itself easily (the internal structure of the model) instead of only talking to a strict librarian (the AI agent), who spits short quotes and prevents us from going inside the library itself
2Milan W
Hmm i think i get you a bit better now. You want to build human-friendly and even fun and useful-by-themselves interfaces for looking at the knowledge encoded in LLMs without making them generate text. Intriguing.
2ank
Yep, I want humans to be the superpowerful “ASI agents”, while the ASI itself will be the direct democratic simulated static places (with non-agentic simple algorithms doing the dirty non-fun work, the way it works in GTA3-4-5). It’s basically hard to explain without writing a book and it’s counterintuitive) But I’m convinced it will work, if the effort will be applied. All knowledge can be represented as static geometry, no agents are needed for that except us
2Milan W
How can a place be useful if it is static? For reference I'm imagining a garden where blades of grass are 100% rigid in place and water does not flow. I think you are imagining something different.
1ank
Great question, in the most elegant scenario, where you have a whole history of the planet or universe (or a multiverse, let's go all the way) simulated, you can represent it as a bunch of geometries (giant shapes of different slices of time aligned with each other, basically many 3D Earthes each one one moment later in time) on top of each other, almost the same way it's represented in long exposure photos (I list examples below). So you have this place of all-knowing and you - the agent - focus on a particular moment (by "forgetting" everything else), on a particular 3d shape (maybe your childhood home), you can choose to slice through 3d frozen shapes of the world of your choosing, like through the frames of a movie. This way it's both static and dynamic. It's a little bit like looking at this almost infinite static shape through some "magical cardboard with a hole in it" (your focusing/forgetting ability that creates the illusion of dynamism), I hope I didn't make it more confusing. You can see the whole multiversal thing as a fluffy light, or zoom in (by forgetting almost the whole multiverse except the part you zoomed in at) to land on Earth and see 14 billion years as a hazy ocean with bright curves in the sky that trace the Sun’s journey over our planet’s lifetime. Forget even more and see your hometown street, with you appearing as a hazy ghost and a trace behind you showing the paths you once walked—you’ll be more opaque where you were stationary (say, sitting on a bench) and more translucent where you were in motion.  And in the garden you'll see the 3D "long exposure photo" of the fluffy blades of grass, that look like a frothy river, near the real pale blue frothy river, you focus on the particular moment and the picture becomes crisp. You choose to relive your childhood and it comes alive, as you slice through the 3D moments of time once again. Less elegant scenario, is to make a high-quality game better than the Sims or GTA3-4-5, without any agent
2Milan W
Let me summarize so I can see whether I got it: So you see "place AI" as body of knowledge that can be used to make a good-enough simulation of arbitrary sections of spacetime, where are events are precomputed. That precomputed (thus, deterministic) aspect you call "staticness".
1ank
Yes, I decided to start writing a book in posts here and on Substack, starting from the Big Bang and the ethics, because else my explanations are confusing :) The ideas themselves are counterintuitive, too. I try to physicalize, work from first principles and use TRIZ to try to come up with ideal solutions. I also had a 3-year-long thought experiment, where I was modeling the ideal ultimate future, basically how everything will work and look, if we'll have infinite compute and no physical limitations. That's why some of the things I mention will probably take some time to implement in their full glory. Right now an agentic AI is a librarian, who has almost all the output of humanity stolen and hidden in its library that it doesn't allow us to visit, it just spits short quotes on us instead. But the AI librarian visits (and even changes) our own human library (our physical world) and already stole the copies of the whole output of humanity from it. Feels unfair. Why we cannot visit (like in a 3d open world game) and change (direct democratically) the AI librarian's library? I basically want to give people everything, except the agentic AIs, because I think people should remain the most capable "agentic AIs", else we'll pretty much guarantee uncomfortable and fast changes to our world. There are ways to represent the whole simulated universe as a giant static geometric shape: * Each moment of time is a giant 3d geometric shape of the universe, if you'll align them on top of each other, you'll effectively get a 4d shape of spacetime that is static but has all the information about the dynamics/movements in it. So the 4d shape is static but you choose some smaller 3d shape inside of it (probably of a human agent) and "choose the passage" from one human-like-you shape to another, making the static 4d shape seem like the dynamic 3d shape that you experience. The whole 4d thing looks very similar to the way long exposure photos look that I shared somewhere in my comm
More from ank
Curated and popular this week