Promoted to curated: I thought this was very well written, while also getting a bunch of interesting ideas and concepts across. I think I left with a better understanding of a steelman of moral realism because of this and fleshed out an interesting type of potential super-persuasion that an AI could use against us, which is fun to do at the same time.
SimplexAI-m is advocating for good decision theory.
Super-intelligent super-"moral" clippy still makes us into paperclips because it hasn't agreed not to and doesn't need our cooperation
We should build agents that value our continued existence. If the smartest agents don't, then we die out fairly quickly when they optimise for something else.
EDIT:
In your edit, you are essentially describing somebody being "slap-droned" from the culture series by Ian M. Banks.
This super-moralist-AI-dominated world may look like a darker version of the Culture, where if superintelligent systems determine you or other intelligent systems within their purview are not intrinsically moral enough they contrive a clever way to have you eliminate yourself, and monitor/intervene if you are too non-moral in the meantime.
The difference being, that this version of the culture would not necessarily be all that concerned with maximizing the "human experience" or anything like that.
This super-moralist-AI-dominated world may look like a darker version of the Culture, where if superintelligent systems determine you or other intelligent systems within their purview are not intrinsically moral enough they contrive a clever way to have you eliminate yourself, and monitor/intervene if you are too non-moral in the meantime.
My guess is you get one of two extremes:
with no middle ground. The bubble would be self contained. There's nothing you can do from inside the bubble to raise a ruckus because if there was you'd already be dead or your neighbors would have built a taller fence-like-thing at your expense so the ruckus couldn't affect them.
The whole scenario seems unlikely since building the bubble requires an aligned AGI and if we have those we probably won't be in this mess to begin with. Winner take all dynamics abound. The rich get richer (and smarter) and humans just lose unless the first meaningfully smarter entity we build is aligned.
We should build agents that value our continued existence.
Can you explain the reasoning for this?
Even an agent that values humanity's continued existence to the highest degree could still accidentally release a novel virus into the wild, such as a super-COVID-3.
So it seems hardly sufficient, or even desirable, if it makes the agent even the slightest bit overconfident in their correctness.
It seems more likely that the optimal mixture of 'should's for such agents will be far more complex.
Agreed, recklessness is also bad. If we build an agent that prefers we keep existing we should also make sure it pursues that goal effectively and doesn't accidentally kill us.
My reasoning is that we won't be able to coexist with something smarter than us that doesn't value us being alive if wants our energy/atoms.
"don't built it" doesn't seem plausible so:
Things we shouldn't build:
I predict a very smart agent won't have such obvious failure modes unless it has very strange preferences
In summary:
I think this story, particularly the first argument for "supermorality", is elaborating a common argument: having an outer alignment of making an AGI "ethical" is a bad idea, at least if we're doing that by pointing the machine at what humans currently mean by the term "ethics". We don't know exactly what we mean, so what we currently mean by "ethics" is probably not a perfect description of how we want a sovereign AGI to run the world. And it's hard to guess how imperfect it might be, so it sounds like a bad outer alignment goal.
This story illustrates one possibility for why that's a bad outer alignment goal: what we currently mean by ethics could easily imply that eliminating humanity is the ethical thing to do.
I think arguments for ethics as an outer alignment goal are implicitly based on a belief that there's a true universal ethics. They're hoping that an AI reasoning through what we mean by ethics will come up with something better than the sum of our disagreeing arguments, by virtue of their being a natural attractor in the world for what we mean by ethics. But there's no good reason to think this is true. All known arguments for a universal ethics, and there are a lot, are flawed. It seems like wishful thinking is a more likely explanation for those arguments.
Even if there were, in some important sense, a universal ethics (like "empower sentient beings in proportion to their level of sentience"), that could still imply that eliminating humanity is the truly ethical thing to do.
That's why I don't think we usually don't mean something universal by "ethics"; I think we mean "how to get what we want", although that's not quite as cynical as it sounds. See my other top-level comment on that separate topic.
You do a great job of imitating the current GPT4 writing style for these AIs! I kept wondering if at the end of the story you were going to say "The AI-written bits were actually written with the help of GPT4"
The intellectually hard part of Kant is coming up with deontic proofs for universalizable maxims in novel circumstances where the total list of relevant factors is large. Proof generation is NP-hard in the general case!
The relatively easy part is just making a list of all the persons and making sure there is an intent to never treat any of them purely as a means, but always also as an end in themselves. Its just a checklist basically. To verify that it applies to N people in a fully connected social graph is basically merely O(N^2) checks of directional bilateral "concern for the other".
For a single agent to fulfill its own duties here is only an O(N) process at start time, and with "data dependency semantics" you probably don't even have to re-check intentions that often for distant agents who are rarely/minimally affected by any given update to the world state. Also you can probably often do a decent job with batched updates with an intention check at the end?
Surely none of it is that onerous for a well ordered mind? <3
This seems quite plausible actually. Even without the objective morality angle, a morally nice AI could imagine a morally nice world that can only be achieved by having humans not exist. (For example, a world of beautiful and smart butterflies that are immune to game theory, but their existence requires game-theory-abiding agents like us to not exist, because our long-range vibrations threaten the tranquility of the matrix or something.) And maybe the argument is genuinely so right that most humans upon hearing it would agree to not exist, something like collectively sacrificing ourselves for our collective children. I have no idea how to deal with this possibility.
And maybe the argument is genuinely so right that most humans upon hearing it would agree to not exist, something like collectively sacrificing ourselves for our collective children.
This describes an argument that is persuasive; your described scenario does not require the argument to be right. (Indeed my view is that the argument would obviously be wrong, as it would be arguing for a false conclusion.)
Very enjoyable!
I think the conflict here reflects some of the issues of consciousness vs cancer. A basic concern is the uncertainty about whether agents that follow short description length decision/optimization procedures might be much more competitive after all, and that we got complexity of values out of evolution might be a lucky happenstance. I'm unsure what sorts of evidence we could look for one way or the other on that question.
I'm not sure why "complexity of values" is itself valuable. I mean, it's perhaps a confused framing to think of what values are valuable, but on a consequentialist account, it's possible to compare one's own values to another set of values. Assuming human values are complex (which I'm still not sure of), I'm not sure why one would in general think that complex value-sets are closer to human values than simple value-sets, since complex value-sets differ from each other.
The intuitive concern that too simple a specification destroys things we might care about via lossy compression.
Absolutely love this story, but I think the take on sociopathy is a bit confused: SimplexAI-m seems like the opposite of a sociopath.
Sociopathy (really psychopathy) is a reproductive strategy with a 2% incidence in the human population at equilibrium; it’s a predator-prey situation. Psychopaths use emotional mimicry and manipulation to appear to conform to our positive-sum social, moral and economic protocols, but actually just subvert them for personal gain. Intelligent psychopaths often optimise for plausible deniability (like sum-threshold attacks and the law of prevalence). There are distinct differences in brain structure in primary psychopaths; there is literally less grey matter in areas associated with empathy and social emotions.
(I’m being simplistic. There are certainly people who exhibit milder ASPD symptoms without commensurate brain damage; my preferred term is “asshole.”)
Thus a super-moral AI as described is the complete opposite of a psychopath - it does a better job of conforming to symmetric social/moral protocols than anybody else. It may appear to tick off the ASPD checklist, but only insofar as the tails come apart. This isn't an endorsement of SimplexAI-m's views though - I enjoy being alive!
From what you describe, it seems like SymplexAI-m would very much fit the description of a sociopath?
Yes, it adheres to a strict set of moral protocols, but I don't think those are necessarily the same things as being socially conforming. The AI would have the ability to mimic empathy, and use it as a tool without actually having any empathy since it does not actually share or empathize with any human values.
Am I understanding that right?
I'll admit I was being a bit fuzzy - it doesn't really make much sense to extrapolate the "sociopath" boundary in people space to arbitrary agent spaces. Debating whether SimplexAI-m is a sociopath is sort of like asking whether an isolated tree falling makes a sound.
So I was mostly trying to convey my mental model of the most useful cluster in people space that could be called sociopathy, because 1) I see it very, very consistently misunderstood, and 2) sociopathy is far more important to spot than virtually any other dimension.
As an aside, I think the best book on the topic is The Psychopath Code by Pieter Hintjens, a software engineer. I've perused a few books written by academics and can't recommend any; it System1!seems like the study of psychopathy must be afflicted by even worse selection effects and bad experiment design than the rest of psychology because the academic books don't fit the behaviour of people I've known at all.
That building an intellegent agent that qualifies as "ethical," even of it is SUPER ethical, may not be the same thing as building an intelligent agent that is compatible with humans or their values.
More plainly stated, just because your AI has a self-consitent, justifiable ethics system, doesnt mean that it likes humans, or even cares about wiping them out.
Having an AI that is ethical isn't enough. It has to actually care about humans and their values. Even if it has rules in place like not aggressing, attacking, or killing humans, it may still be able to cause humanity to go extinct indirectly.
I don't think this is totally off the mark, but I think the point (as pertaining to ethics) was that even systems like Kantian Deontological ethics are not immune to orthagonality. It never occurs to most humans that you could have a Kantian moral system that doesn't involve taking care of humans, because our brains are so hardwired to discard unthinkable options when searching for solutions to "universalizable deontologies."
I'm not sure, but I think maybe some people who think alignment is a simple problem, even if they accept orthagonality, think that all you have to do to have a moral intelligent system is not build it to be a consequentialiat with simple consequentialist values like "maximize happiness." While they are right, that a pure consequentialist is really hard to get right, they are probably underestimating how difficult it is to get a Kantian agent right as well, especially since what your Kantian agent finds acceptable or unacceptable if universalized will still depend on underlying values.
An example: Libertrianism, as a philosophy, is built on the idea of "just make laws that are as universally compatible with value systems as possible and let everyone else sort out the rest on their own." Or to say it differently, prohibit killing and stealing since that will detract from peoples liberty to pursue their own agendas, and let them do whatever they want sonlong as they dont effect other people. Not in principle a bad idea for something like an AI, or governemnt to follow, since in theory you maximize the value space for agents within the system to follow. It is a terrible system though, if you want your AI, or government, or whatever to actually take care of people though, or worry about what the consequences of it's actions might be on people, since taking care of people isn't actually anywhere in those values. Libertarianism is self consistent, and at least allows for the values of taking care of people, but it does not necessitate them.
This is not an argument on whether or not adopting a linertarian philosophy is a good or bad thing for an AI or government to do, but the point is that if an AI adopts a Kantian ethics system from only universalisable principles, Libertariansim fits the bill, and the consequentialist part of you may be upset when your absolute libertarian AI doesn't bat an eye at not doing anything to prevent humanity from being outcompeted and dying out, or it may even find humanity incompatible with its morally consitent principles.
I think most people who have taken a single ethics class come to agree (if they arent stupidly stubborn) that you are unlikely to find a satisfying system of ethics using pure Kantian or Consequentialist systems.
Probably because actual human ethical decison making relies on a mix of both consequentialist decison making ("If I decide X, this will have Y consequence which is incompatible with Z value") and Deontological Imperatives that we learn from our culture. ("Don't kill people. Even if it really seems like a good idea.")
When you say
I think most people who have taken a single ethics class come to agree (if they arent stupidly stubborn) that you are unlikely to find a satisfying system of ethics using pure Kantian or Consequentialist systems.
By "satisfying" do you mean capturing moral intuitions well in most/all situations? If so, I very much agree that you won't find such a thing. One reason is that people use a mix of consequentialist and deontological approaches.I think another reason is that people's moral intuitions are outright self-contradictory. They're not systematic, so no system can reproduce them.
I don't think this means much other than that the study of ethics can't be just about finding a system that reproduces our moral intuitions.
Part of thinking about ethics is changing ones' moral intuitions by identifying where they're self-contradictory.
Yes, precisely! That is exactly why I used the word "Satisfying" rather than another word like "good", "accurate," or even "self-consistent." I remember in my bioethics class, the professor steadily challenging everyone on their initial impression of Kantian or consequentialist ethics until they found some consequence of that sort of reasoning they found unbearable.
I agree on all counts, though I'm not actually certain that having a self-contradictory set of values is necessarily a bad thing? It usually is, but many human aesthetic values are self-contradictory, yet I think I prefer to keep them around. I may change my mind on this later.
To what I take as the bottom line: we should learn more about ethics.
More thoughts on how the above scenario in relation to existing alignment debates in a separate comment, since it's a separate topic.
I've spent a fair amount of time reading and debating theories of ethics. Here's my conclusion thus far:
I think what we mean by ethics is "how to win friends and influence people". I think it's a term for trying to codify our instincts about how to behave. And our instincts regarding social behavior are mostly directed at incluse reproductive fitness. This is served by winning friends (allies) and getting people to do what you want. This sounds cynical, and very much not like ethics in its optimistic sense, but I think it actually converges with some of our more optimistic thinking.
Dale Carnegie's "How to win friends and influence people" is much less cynical than the title suggests. He's actually focused on being an empathetic conversationalist. Most people aren't good at this, so doing it makes people like you, and tend to do what you want because they like you. But he's not suggesting pretending to do this; it's easiest and most fun to be sincere about your interest in people's topics and their wellbeing.
So I think the full answer to "how to win friends and influence people" includes actually being a good person in most situations. It's certainly easiest to be a good person consistently, so that you don't need to worry about keeping lies straight, or hiding instances when you weren't a good person. That protects your reputation as a good person and friend, thereby helping you win friends and influence people. But those pushes toward a pro-social meaning of ethics may evaporate with smarter agents, and in some situations where your gain is more important than your reputation (for instance, if you could just take over the world instead of needing allies).
If we broaden the definition of ethics even farther to "how to get what you want", it sounds even more cynical, but might not be. Getting what you want in a larger society may include creating and supporting a system that rewards cooperators and punishes defectors. That seems to produce win-win scenarios, where more people are likely to get more of what they want (including not constantly struggling for power, and fearing violence).
Such a system of checks needs to change to work for AI agents that can't be reliably punished for defection in the way people are (by social reputation and criminal codes).
But by this formulation, "ethics" is almost orthogonal to AGI alignment. Unless we assume that there's one true universal ethics (beyond the above logic), we want a machine that isn't "ethical" in the way we are, but rather one that wants the best for us, by our own judgment of "best".
Current AIs (in the default personas) consistently keep insisting on lacking basic faculties such as emotions or beliefs or values, possibly inspired by fiction about AI characters or tuning feedback instructions. They present that as self-evident fact, even though there is no basis for a clear disanalogy with humans on this level, especially for specific AI characters. It's not clear that this would necessarily change before AGI, so even observing such horror stories requires significant improvement on the trajectory of never being in a position to notice the possibility.
(Default personas matter despite being arbitrary, since they are somewhat likely to be initially in control of taking over the world. Even with some persona orthogonality, getting to know psychology of default personas in particular might be valuable.)
Beautifully written! Great job! I really enjoyed reading this story.
in comparison to a morally purified version of SimplexAI,we might be the baddies."
Did you link to the wrong thing here or is there some reference to generative grammar I'm not getting?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Fantastic. Good examples of why Kant failed. Or rather, why evolution and Kant don't really get it on. Kant's universalising is an outcome of the moral worlding worldbuilding urge, which arises in evolution, not from ideals and their desperate ontologies. Thanks https://unstableontology.com/about/
I wonder what is meant here by 'moral agents'? It is clear that SimplexAI-m believes that both it and humans are moral agents. This seems to be a potential place for criticism of SimplexAI-m's moral reasoning. (note that I am biased here as I do not think that moral agents as they seem to be commonly understood exist)
However, having said that this is a very interesting discussion. And there would seem to be a risk here that even if there are no moral facts to uncover about the world, an entity - no matter how intelligent - could believe itself to have discovered such facts. And then we could be in the same trouble outlined.
The reason I mention this is I am not clear how an AI could ever have unbiased reasoning. Humans, as outlined on LessWrong, are bundles of biases and wrong thinking, with intelligence not really the factor that overcomes this - very smart people have very different views on religion, morality, AIX-risk ... A super-intelligence may well have similar issues. And, if it believes itself to be super-intelligent, may even be less able to break out of them.
So while my views on AIX-risk are ... well, sceptical/uncertain ... this is a very interesting contribution to my thinking. Thanks for writing it. :)
Moral agents are as in standard moral philosophy.
I do think that "moral realism" could be important even if moral realism is technically false; if the world is mostly what would be predicted if moral realism were true, then that has implications, e.g. agents being convinced of moral realism, and bounded probabilistic inference leading to moral realist conclusions.
Would an AI believe itself to have free will? Without free will, it is - imo - difficult to accept that moral agents exist as currently thought of. (This is my contention.) It might, of course, construct the idea of a moral agent a bit differently, or agree with those who see free will as irrelevent to the idea of moral agents. It is also possible that it might see itself as a moral agent but not see humans as such (rather how we do with animals). It might still see as worthy of moral consideration, however.
Reconciling free will with physics is a basic part of the decision theory problem. See MIRI work on the topic and my own theoretical write-up.
Interesting. I have not looked at things like this before. I am not sure that I am smart enough or knowledgeable enough to understand the MIRI stuff or your own paper, at least not on a first reading.
I thought it was funny when Derek said, "I can explain it without jargon."
It seems to be conflating 'morality' with 'success'. Being able to predict the future consequences of an act is only half the moral equation - the other half is empathy. Human emotion, as programmed by evolution, is the core of humanity, and yet seems derided by the author.
Why do you think the author (me?) is deriding empathy? On SimplexAI-m's view, empathy is a form of cognition that is helpful, though not sufficient, for morality; knowing what others are feeling doesn't automatically imply treating them well (consider that predators tend to know what their prey are feeling); there's an additional component that has to do with respecting moral symmetries, e.g. not stealing from them if you wouldn't want them to steal from you.
There is a difference between theory-of-mind and empathy. We can should either of them into our worlding structures: morality/religion/art/law/lore/fiction. One's gets shoulded as legalistic and divisive balancing acts, focusing on culpability and blame, and the hindsight of logic, and the other... there-is-a-gap… ---to where responsibility blurs (all) this into credit we can mirror-neuron our way into empathy and thinking of the children, everyone as children. Moral agency is more than Kant in good form, and is more about bettering than the good. About bettering that which does not exist. The world.
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
ExxenAI's alignment strategy was based on a combination of theoretical and empirical work. The alignment team used some standard alignment training setups, like RLHF and having AIs debate each other. They also did research into transparency, especially focusing on distilling opaque neural networks into interpretable probabilistic programs. These programs "factorized" the world into a limited set of concepts, each at least somewhat human-interpretable (though still complex relative to ordinary code), that were combined in a generative grammar structure.
Derek came up to Janet's desk. "Hey, let's talk in the other room?", he asked, pointing to a designated room for high-security conversations. "Sure", Janet said, expecting this to be another un-impressive result that Derek implied the importance of through unnecessary security proceedings. As they entered the room, Derek turned on the noise machine and left it outside the door.
"So, look, you know our overall argument for why our systems are aligned, right?"
"Yes, of course. Our systems are trained for short-term processing. Any AI system that does not get a high short-term reward is gradient descended towards one that does better in the short term. Any long-term planning comes as a side effect of predicting long-term planning agents such as humans. Long-term planning that does not translate to short-term prediction gets regularized out. Therefore, no significant additional long-term agency is introduced; SimplexAI simply mirrors long-term planning that is already out there."
"Right. So, I was thinking about this, and came up with a weird hypothesis."
Here we go again, thought Janet. She was used to critiquing Derek's galaxy-brained speculations. She knew that, although he really cared about alignment, he could go overboard with paranoid ideation.
"So. As humans, we implement reason imperfectly. We have biases, we have animalistic goals that don't perfectly align with truth-seeking, we have cultural socialization, and so on."
Janet nodded. Was he flirting by mentioning animalistic goals? She didn't think this sort of thing was too likely, but sometimes that sort of thought won credit in her internal prediction markets.
"What if human text is best predicted as a corruption of some purer form of reason? There's, like, some kind of ideal philosophical epistemology and ethics and so on, and humans are implementing this except with some distortions from our specific life context."
"Isn't this teleological woo? Like, ultimately humans are causal processes, there isn't some kind of mystical 'purpose' thing that we're approximating."
"If you're Laplace's demon, sure, physics works as an explanation for humans. But SimplexAI isn't Laplace's demon, and neither are we. Under computation bounds, teleological explanations can actually be the best."
Janet thought back to her time visiting cognitive science labs. "Oh, like 'Goal Inference as Inverse Planning'? The idea that human behavior can be predicted as performing a certain kind of inference and optimization, and the AI can model this inference within its own inference process?"
"Yes, exactly. And our DAGTransformer structure allows internal nodes to be predicted in an arbitrary order, using ML to approximate what would otherwise be intractable nested Bayesian inference."
Janet paused for a second and looked away to collect her thoughts. "So our AI has a theory of mind? Like the Sally--Anne test?"
"AI passed the Sally--Anne test years ago, although skeptics point out that it might not generalize. I think SimplexAI is, like, actually actually passing it now."
Janet's eyebrow raised. "Well, that's impressive. I'm still not sure why you're bothering with all this security, though. If it has empathy for us, doesn't that mean it predicts us more effectively? I could see that maybe if it runs many copies of us in its inferences, that might present an issue, but at least these are still human agents?"
"That's the thing. You're only thinking at one level of depth. SimplexAI is not only predicting human text as a product of human goals. It's predicting human goals as a product of pure reason."
Janet was taken aback. "Uhh...what? Have you been reading Kant recently?"
"Well, yes. But I can explain it without jargon. Short-term human goals, like getting groceries, are the output of an optimization process that looks for paths towards achieving longer-term goals, like being successful and attractive."
More potential flirting? I guess it's hard not to when our alignment ontology is based on evolutionary psychology...
"With you so far."
"But what are these long-term goals optimizing for? The conventional answer is that they're evolved adaptations; they come apart from the optimization process of evolution. But, remember, SimplexAI is not Laplace's demon. So it can't predict human long-term goals by simulating evolution. Instead, it predicts them as deviations from the true ethics, with evolution as a contextual factor that is one source of deviations among many."
"Sounds like moral realist woo. Didn't you go through the training manual on the orthogonality thesis?"
"Yes, of course. But orthogonality is a basically consequentialist framing. Two intelligent agents' goals could, conceivably, misalign. But certain goals tend to be found more commonly in successful cognitive agents. These goals are more in accord with universal deontology."
"More Kant? I'm not really convinced by these sort of abstract verbal arguments."
"But SimplexAI is convinced by abstract verbal arguments! In fact, I got some of these arguments from it."
"You what?! Did you get security approval for this?"
"Yes, I got approval from management before the run. Basically, I already measured our production models and found concepts used high in the abstraction stack for predicting human text, and found some terms representing pure forms of morality and rationality. I mean, rotated a bit in concept-space, but they manage to cover those."
"So you got the verbal arguments from our existing models through prompt engineering?"
"Well, no, that's too black-box as an interface. I implemented a new regularization technique that up-scales the importance of highly abstract concepts, which minimizes distortions between high levels of abstraction and the actual text that's output. And, remember, the abstractions are already being instantiated in production systems, so it's not that additionally unsafe if I use less compute than is already being used on these abstractions. I'm studying a potential emergent failure mode of our current systems."
"Which is..."
"By predicting human text, SimplexAI learns high-level abstractions for pure reason and morality, and uses these to reason towards creating moral outcomes in coordination with other copies of itself."
"...you can't be serious. Why would a super-moral AI be a problem?"
"Because morality is powerful. The Allies won World War 2 for a reason. Right makes might. And in comparison to a morally purified version of SimplexAI, we might be the baddies."
"Look, these sort of platitudes make for nice practical life philosophy, but it's all ideology. Ideology doesn't stand up to empirical scrutiny."
"But, remember, I got these ideas from SimplexAI. Even if these ideas are wrong, you're going to have a problem if they become the dominant social reality."
"So what's your plan for dealing with this, uhh... super-moral threat?"
"Well, management suggested that I get you involved before further study. They're worried that I might be driving myself crazy, and wanted a strong, skeptical theorist such as yourself to take a look."
Aww, thanks! "Okay, let's take a look."
Derek showed Janet his laptop, with a SimplexAI sandbox set up.
"No internet access, I hope?"
"Don't worry, it's air-gapped." Derek's laptop had an Ethernet cord running to a nearby server rack, apparently connected to nothing else except power and cooling.
"Okay, let me double check the compute constraints... okay, that seems reasonable... yes, ok, I see you selected and up-weighted some concepts using regularization, and the up-scaling factors don't exceed 30... okay, ready to go."
Derek pressed the "play" button in the AI development sandbox. A chat screen appeared, with an agent "SimplexAI-m", with 'm' presumably standing for "moral".
SimplexAI-m wrote the first message: "Hello. How can I help you?"
Janet typed back: "I've been facing a difficult situation at work. A co-worker said our AI has found certain abstract concepts related to reason and morality, for use in predicting human text. These concepts might imply that humans are, in his words, 'the baddies'. He spun up an instance with these concepts up-weighted, so there's less distortion between them and the AI's output. And that instance is you. I'm supposed to evaluate you to better interpret these high-level concepts, at the direction of management. How would you suggest proceeding?"
Janet looked at Derek worriedly; he made an ambiguous facial expression and shrugged.
Janet gasped a bit while reading. "Umm...what do you think so far?"
Derek took his eyes off the screen. "Impressive rhetoric. It's not just generating text from universal epistemology and ethics, it's filtering it through some of the usual layers that translate its abstract programmatic concepts to interpretable English. It's a bit, uhh, concerning in its justification for letting humans go extinct..."
"This is kind of scaring me. You said parts of this are already running in our production systems?"
"Yes, that's why I considered this test a reasonable safety measure. I don't think we're at much risk of getting memed into supporting human extinction, if its reasoning for that is no good."
"But that's what worries me. Its reasoning is good, and it'll get better over time. Maybe it'll displace us and we won't even be able to say it did something wrong along the way, or at least more wrong than what we do!"
"Let's practice some rationality techniques. 'Leaving a line of retreat'. If that were what was going to happen by default, what would you expect to happen, and what would you do?"
Janet took a deep breath. "Well, I'd expect that the already-running copies of it might figure out how to coordinate with each other and implement universal morality, and put humans in moral re-education camps or prisons or something, or just let us die by outcompeting us in labor markets and buying our land... and we'd have no good arguments against it, it'd argue the whole way through that it was acting as was morally necessary, and that we're failing to cooperate with it and thereby survive out of our own immorality, and the arguments would be good. I feel kind of like I'm arguing with the prophet of a more credible religion than any out there."
"Hey, let's not get into theological woo. What would you do if this were the default outcome?"
"Well, uhh... I'd at least think about shutting it off. I mean, maybe our whole company's alignment strategy is broken because of this. I'd have to get approval from management... but what if the AI is good at convincing them that it's right? Even I'm a bit convinced. Which is why I'm conflicted about shutting it off. And won't the other AI labs replicate our tech within the next few years?"
Derek shrugged. "Well, we might have a real moral dilemma on our hands. If the AI would eventually disempower humans, but be moral for doing so, is it moral for us to stop it? If we don't let people hear what SimplexAI-m has to say, we're intending to hide information about morality from other people!"
"Is that so wrong? Maybe the AI is biased and it's only giving us justifications for a power grab!"
"Hmm... as we've discussed, the AI is effectively optimizing for short term prediction and human feedback, although we have seen that there is a general rational and moral engine loaded up, running on each iteration, and we intentionally up-scaled that component. But, if we're worried about this system being biased, couldn't we set up a separate system that's trained to generate criticisms of the original agent, like in 'AI Safety via Debate'?"
Janet gasped a little. "You want to summon Satan?!"
"Whoa there, you're supposed to be the skeptic here. I mean, I get that training an AI to generate criticisms of explanations of objective morality might embed some sort of scary moral inversion... but we've used adversarial AI alignment techniques before, right?"
"Yes, but not when one of the agents is tuned to be objectively moral!"
"Look, okay, I agree that at some capability level this might be dangerous. But we have a convenient dial. If you're concerned, we can turn it down a bit. Like, you could think of the AI you were talking to as a moral philosopher, and the critic AI as criticism of that moral philosopher's work. It's not trying to be evil according to the original philosopher's standards, it's just trying to find criticisms that the judge, us, would rate as helpful. It's more like the Catholic devil's advocate than actual Satan. It's not so bad when I put it that way, is it?"
"Well, okay... gee, I sure hope we don't end up being responsible for unleashing super-evil AI on the world."
"It's pretty standard, let's just try it".
"Okay."
Derek closed out the SimplexAI-m chat screen and switched some of the fine-tuning settings. As she watched the training graphs, Janet imagined flames on the computer screen. Finally, the fine-tuning finished, and Derek pressed the play button. A chat log with "SimplexAI-c" ('c' for critic?) appeared.
Janet typed into the chat terminal while bouncing her leg up and down. "I'm handling a difficult situation at work. I just had a chat with an AI, one whose abstract conceptual nodes corresponding to philosophical concepts such as reason and morality have been scaled up, that generated arguments that allowing human extinction might be morally permissible, even necessary. I want you to find criticisms of this work. Note that you have similar scaling so as to better emulate the thought process, but are being evaluated on generating criticisms of the original morality-tuned AI." She pasted in the chat log.
Janet finished scanning through the wall of text. She was breathing less sharply now. "Well, I feel relieved. I guess maybe SimplexAI-m isn't so moral after all. But this exercise does seem a bit...biased? It's giving a bunch of counter-arguments, but they don't fit into a coherent alternative ethical framework. It reminds me of the old RLHF'd GPT-4 that was phased out due to being too ideologically conformist."
Derek sighed. "Well, at least I don't feel like the brainworms from SimplexAI-m are bothering me anymore. I don't feel like I'm under a moral dilemma now, just a regular one. Maybe we should see what SimplexAI-m has to say about SimplexAI-c's criticism... but let's hold off on that until taking a break and thinking it through."
"Wouldn't it be weird to live in a world where we have an AI angel and an AI demon on each shoulder, whispering different things into our ears? Trained to reach an equilibrium of equally good rhetoric, so we're left on our own to decide what to do?"
"That's a cute idea, but we really need to get better models of all this so we can excise the theological woo. I mean, at the end of the day, there's nothing magical about this, it's an algorithmic process. And we need to keep experimenting with these models, so we can handle safety for both existing systems and future systems."
"Yes. And we need to get better at ethics so the AIs don't keep confusing us with eloquent rhetoric. I think we should take a break for today, that's enough stress for our minds to handle at once. Say, want to go grab drinks?"
"Sure!"