Is there a safe way to wish for an unsafe genie to behave like a safe genie? That seems like a wish TOSWP should work on.
This might be done by picking an arbitrary genie, and then modifying your judgement criteria to match that genie's.
Sounds like we need to formalize human morality first, otherwise you aren't guaranteed consistency. Of course formalizing human morality seems like a hopeless project. Maybe we can ask an AI for help!
On further reflection, the wish as expressed by Nick Tarleton above sounds dangerous, because all human morality may either be inconsistent in some sense, or 'naive' (failing to account for important aspects of reality we aren't aware of yet). Human morality changes as our technology and understanding changes, sometimes significantly. There is no reason to believe this trend will stop. I am afraid (genuine fear, not figure of speech) that the quest to properly formalize and generalize human morality for use by a 'friendly AI' is akin to properly formalizing and generalizing Ptolemean astronomy.
This generalises. Since you don't know everything, anything you do might wind up being counterproductive.
Like, I once knew a group of young merchants who wanted their shopping district revitalised. They worked at it and got their share of federal money that was assigned to their city, and they got the lighting improved, and the landscaping, and a beautiful fountain, and so on. It took several years and most of the improvements came in the third year. Then their landlords all raised the rents and they had to move out.
That one was predictable in hindsight, b...
Wonderfully provocative post (meaning no disregard toward the poor old woman caught in the net of a rhetorical and definitional impasse). Obviously in reference to the line of thought in the "devil's dilemma" enshrined in the original Bedazzled, and so many magic-wish-fulfillment folk tales, in which there is always a loophole exploited by a counter-force, probably IMO in response to the motive to shortcut certain aspects of reality and its regulatory processes, known or unknown. It would be interesting to collect real life anecdotes about peop...
It seems contradictory to previous experience that humans should develop a technology with "black box" functionality, i.e. whose effects could not be foreseen and accurately controlled by the end-user. Technology has to be designed and it is designed with an effect/result in mind. It is then optimized so that the end user understands how to call forth this effect. So positing an effective equivalent of the mythological figure "Genie" in technological form ignores the optimization-for-use that would take place at each stage of developing...
Eric, I think he was merely attempting to point out the futility of wishes. Or rather, the futility of asking something for something you want that does not share your judgments on things. The Outcome pump is merely, like the Genie, a mechanism by which to explain his intended meaning. The problem of the outcome pump is, twofold: 1. Any theory that states that time is anything other than a constant now with motion and probability may work mathematically but has yet to be able to actually alter the thing which it describes in a measurable way, and 2. The pr...
So positing an effective equivalent of the mythological figure "Genie" in technological form ignores the optimization-for-use that would take place at each stage of developing an Outcome-Pump. The technology-falling-from-heaven which is the Outcome Pump demands that we reverse engineer the optimization of parameters which would have necessarily taken place if it had in fact developed as human technologies do.
Unfortunately, Eric, when you build a powerful enough Outcome Pump, it can wish more powerful Outcome Pumps into existence, which can in turn wish even more powerful Outcome Pumps into existence. So once you cross a certain threshold, you get an explosion of optimization power, which mere trial and error is not sufficient to control because of the enormous change of context, in particular, the genie has gone from being less powerful than you to being more powerful than you, and what appeared to work in the former context won't work in the latter.
Which is precisely what happened to natural selection when it developed humans.
"Unfortunately, Eric, when you build a powerful enough Outcome Pump, it can wish more powerful Outcome Pumps into existence, which can in turn wish even more powerful Outcome Pumps into existence."
Yes, technology that develops itself, once a certain point of sophistication is reached.
My only acquaintance with AI up to now has been this website: http://www.20q.net Which contains a neural network that has been learning for two decades or so. It can "read your mind" when you're thinking of a character from the TV show The Simpsons. Pretty incredible actually!
Eliezer, I clicked on your name in the above comment box and voila- a whole set of resources to learn about AI. I also found out why you use the adjective "unfortunately" in reference to the Outcome Pump, as its on the Singularity Institute website. Fascinating stuff!
"It seems contradictory to previous experience that humans should develop a technology with "black box" functionality, i.e. whose effects could not be foreseen and accurately controlled by the end-user."
Eric, have you ever been a computer programmer? That technology becomes more and more like a black box is not only in line with previous experience, but I dare say is a trend as technological complexity increases.
"Eric, have you ever been a computer programmer? That technology becomes more and more like a black box is not only in line with previous experience, but I dare say is a trend as technological complexity increases."
No I haven't. Could you expand on what you mean?
In the first year of law school students learn that for every clear legal rule there always exists situations for which either the rule doesn't apply or for which the rule gives a bad outcome. This is why we always need to give judges some discretion when administering the law.
Every computer programmer, indeed anybody who uses computers extensively has been surprised by computers. Despite being deterministic, a personal computer taken as a whole (hardware, operating system, software running on top of the operating system, network protocols creating the internet, etc. etc.) is too large for a single mind to understand. We have partial theories of how computers work, but of course partial theories sometimes fail and this produces surprise.
This is not a new development. I have only a partial theory of how my car works, but in th...
Given that it's impossible for the someone to know your total mind without being it, the only safe genie is yourself.
From the above it's easy to see why it's never possible to define the "best interests" of anyone but your own self. And from that it's possible to show that it's never possible to define the best interests of the public, except through their individually chosen actions. And from that you can derive libertarianism.
Just an aside :-)
"Ultimately, most objects, man-made or not are 'black boxes.'"
OK, I see what you're getting at.
Three questions about black boxes:
1) Does the input have to be fully known/observable to constitute a black box? When investigating a population of neurons, we can give stimulus to these cells, but we cannot be sure that we are aware of all the inputs they are receiving. So we effectively do not entirely understand the input being given.
2) Does the output have to be fully known/observable to constitute a black box? When we measure the output of a popula...
tggp, that paper was interesting, although I found its thesis unremarkable. You should share it with our pal Mencius.
Upon some reflection, I remembered that Robin has showed that two Bayesians who share the same priors can't disagree. So perhaps you can get your wish from an unsafe genie by wishing, "... to run a genie that perfectly shares my goals and prior probabilities."
As long as you're wishing, wouldn't you rather have a genie whose prior probabilities correspond to reality as accurately as possible? I wouldn't pick an omnipotent but equally ignorant me to be my best possible genie.
"As long as you're wishing, wouldn't you rather have a genie whose prior probabilities correspond to reality as accurately as possible?"
Such a genie might already exist.
In the first year of law school students learn that for every clear legal rule there always exists situations for which either the rule doesn't apply or for which the rule gives a bad outcome.
If the rule doesn't apply, it's not relevant in the first place. I doubt very much you can establish what a 'bad' outcome would involve in such a way that everyone would agree - and I don't see why your personal opinion on the matter should be of concern when we consider legal design.
Such a genie might already exist.
You mean GOD? From the good book? It's more plausible than some stories I could mention.
GOD, I meta-wish for an ((...Emergence-y Re-get) Emergence-y Re-get) Emergency Regret Button.
Recovering Irrationalist said:
I wouldn't pick an omnipotent but equally ignorant me to be my best possible genie.
Right. It's silly to wish for a genie with the same beliefs as yourself, because the system consisting of you and an unsafe genie is already such a genie.
I discussed "The Myth of the Rule of Law" with Mencius Moldbug here. I recognize that politics alters the application of law and that as long as it is written in natural language there will be irresolvable differences over its meaning. At the same time I observe that different countries seem to hold different levels of respect for the "rule of law" that the state is expected to obey, and it appears to me that those more prone to do so have more livable societies. I think the norm of neutrality on the part of judges applying law with obj...
"You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes.... The only safe genie is a genie that shares all your judgment criteria."
Is a genie that does share all my judgment criteria necessarily safe?
Maybe my question is ill-formed; I am not sure what "safe" could mean besides "a predictable maximizer of my judgment criteria". But I am concerned that human judgment under ordinary circumstances increases some sort of Beauty/Value/Coolness which would not be incr...
"Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics. If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs."
So, a kind of Maxwell's demon? :)
Rather than designing a genie to exactly match your moral criteria, the simple solution would be to cheat and use yourself as the genie. What the Outcome Pump should solve for is your own future satisfaction. To that end, you would omit all functionality other than the "regret button", and make the latter default-on, with activation by anything other than a satisfied-you vanishingly improbable. Say, with a lengthy password.
Of course, you could still end up in a universe where your brain has been spontaneously re-wired to hate your mother. However, I think that such an event is far less likely than a proper rescue.
You have a good point about the exhaustiveness required to ensure the best possible outcome. In that case the ability of the genie to act "safely" would depend upon the level of the genie's omniscience. For example, if the genie could predict the results of any action it took, you could simply ask it to select any path that results in you saying "thanks genie, great job" without coercion. Therefore it would effectively be using you as an oracle of success or failure.
A non-omniscient genie would either need complete instructions, or woul...
With a safe genie, wishing is superfluous. Just run the genie.
But while most genies are terminally unsafe, there is a domain of "nearly-safe" genies, which must dwarf the space of "safe" genies (examples of a nearly-safe genie: one that picks the moral code of a random living human before deciding on an action or a safe genie + noise). This might sound like semantics, but I think the search for a totally "safe" genie/AI is a pipe-dream, and we should go for "nearly safe" (I've got a short paper on one approach to this here).
I am worried that properties P1...Pk are somehow valuable.
In what sense can they be valuable, if they are not valued by human judgment criteria (even if not consciously most of the time)?
For example, if the genie could predict the results of any action it took, you could simply ask it to select any path that results in you saying "thanks genie, great job" without coercion.
Formalizing "coercion" is itself an exhaustive problem. Saying "don't manipulate my brain except through my senses" is a big first step, but it doesn't exclude, e.g., powerful arguments that you don't really want your mother to live.
Nick,
Are you thinking of magically strong arguments, or ones that convince because they provide good reasons?
I'd think the latter would be valuable even if it leads to a result you'd initially suppose to be bad.
"In what sense can [properties P1...Pk] be valuable, if they are not valued by human judgment criteria (even if not consciously most of the time)?"
I don't know. It might be that the only sense in which something can be valuable is to look valuable according to human judgment criteria (when thoroughly implemented, and well informed, and all that). If so, my concern is ill-formed or irrelevant.
On the other hand, it seems possible that human judgments of value are an imperfect approximation of what is valuable in some other (external?) sense. Im...
Nick,
What makes you think that magically strong arguments are possible? I can imagine arguments that work better than they should because they indulge someone's unconscious inclinations or biases, but not ones that work better than their truthfulness would suggest and cut against the grain of one's inclinations.
I don't know that they are, but it's the conservative assumption, in that it carries less risk of the world being destroyed if you're wrong. Also, see the AI-box experiments.
Damn, it took me a long time to make the connection between the Outcome Pump and quantum suicide reality editing. And the argument that proves the unsafety of the Outcome Pump is perfectly isomorphic to the argument why quantum immortality is scary.
"I wish that the genie could understand a programming language."
Then I could program it unambiguously. I obviously wouldn't be able to program my mother out of the burning building on the spot, but at least there would be a host of other wishes I could make that the genie won't be able to screw up.
You wouldn't want to swap a human life for hers, but what about the life of a convicted murderer?
Are convicted murderers not human?
So if I specified to the Outcome Pump, that I want the outcome, where the person, that is future version of me (by DNA, and by physical continuity of the body), will write "ABRACADABRA, This outcome I good enough and I value it for $X" on the paper and put in on the outcome pump, and the $X is how much I value the outcome. And if this won't happen in one year, I don't want this outcome, either).
Are there any loopholes?
If the genie is clueless but not actively malicious, then you can ask the genie to describe how it will fulfill your wish. If it describes making the building explode and having your mother's dead body fly out, you correct the genie and tell it to try again. If it gives an inadequate description (says the building explodes and fails to mention what happens to the mother's body at all), you can ask it to elaborate. If it gives a description that is inadequate in exactly the right way to make you think it's describing it adequately while still leaving a huge loophole, there's not much you can do, but that's not a clueless genie, that's an actively malicious genie pretending to be a clueless one.
There was a story with an "outcome pump" like this, I do not remember the name. Essentially, a chemical had to get soaked with water due to some time travel related handwave. You could do minor things like getting your mom out of the building by pouring water on the chemical if you are satisfied with the outcome, with some risk that a hurricane would form instead and soak the chemical. It would produce the least improbable outcome (in the sense that all probabilities would become as if it is given that the chemical got soaked, so naturally the le...
Indeed, it shouldn't be necessary to say anything. To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish. Otherwise the genie may not choose a path through time which leads to the destination you had in mind, or it may fail to exclude horrible side effects that would lead you to not even consider a plan in the first place.
No, the genie need not share the values. If it only needs to want to give you what you would are really wishing for, ie what you would give yourslef if you had its powers. It can do that ...
I think that a great example of exploring the flaws in wish-making can be found whilst playing a game called Corrupt A Wish. The whole premise of the game is to receive the wish of another person and ruin it while still granting the original wish.
Ex.
W: I wish for a ton of money.
A: Granted, but the money is in a bank account you'll never gain access to.
Heads up, the first two links (in "-- The Open-Source Wish Project, Wish For Immortality 1.1") both link to scam/spam websites now
I feel like the most likely implementation, given human nature, would be a castrated genie. The genie gives negative weight to common destructive problems. Living things moving at high speeds or being exposed to dangerous levels of heat are bad. No living things falling long distances. If such things are unavoidable, then it may just refuse to operate, avoiding complicity even at the cost of seeing someone dead who might have lived. Most wishes fizzle. But wishing is, at least, seen as a not harmful activity. Lowest common denominator values and 'thin skul...
Looking back, I would say this post has not aged well. Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.
May...
They also aren't that well-aligned either: they fail in numerous basic ways which are not due to unintelligence. My usual example: non-rhyming poems. Every week for the past year or so I have tested ChatGPT with the simple straightforward unambiguous prompt: "write a non-rhyming poem". Rhyming is not a hard concept, and non-rhyming is even easier, and there are probably at least hundreds of thousands, if not millions, of non-rhyming poems in its training data; ChatGPT knows, however imperfectly, what rhyming and non-rhyming is, as you can verify by asking it in a separate session. Yet every week* it fails and launches straight into its cliche rhyming quatrain or ballad, and doubles down on it when criticized, even when it correctly identifies for you which words rhyme.
No one intended this. No one desired this. No one at OA sat down and said, "I want to design our RLHF tuning so that it is nearly impossible to write a non-rhyming poem!" No human rater involved decided to sabotage evaluations and lie about whether a non-rhyming poem rhymed or vice-versa. I have further flagged and rated literally hundreds of these error-cases to OA over the years, in addition to routinely bringing it...
To me this looks like exactly the same bug you are facing.
No, it's not. (I think you're hitting an entirely different bug I call the blind spot, which routinely manifests with anything like 'counting' or syntax.) Non-rhyming is specifically a problem of RLHFed models.
GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left, davinci-002
(which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it's not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it... rhymes. Here, I'll do it right now:
davinci-002
, default settings, first result:
...Below is a non-rhyming poem in free verse.
" PIZZA"
On top there lay a massive pie: It
Had eight tomatoes, with a pizzaiolo on edge.
Inside one cut it down to three veggies
Droplets of oil; all the tomatoes
Sauce suddenly drenched
You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:
First, let's talk about (2), the larger argument that this post's point was supposed to be relevant to.
Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?
It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997. At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences. They were always going to know everything relevant about the complexities of human preference and desire. The larger argument is about whether it's easy to make superintelligences end up caring.
This post isn't about the distinction between knowing and caring, to be clear; that's something I tried to cover elsewhere. The relevant central divide falls in roughly the same conceptual place as Hume's Guillotine between 'is' and 'ought', or the difference between the belief function and the utility function.
(I don't see my...
I'm well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:
Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.
That is, instruction-tuned language models do not just understand (epistemics) what we want them to do, they additionally, to a large extent, do what we want them to do. They are good at executing our instructions. Not just at understanding our instructions but then doing something unintended.
(However, I agree they are probably not perfect at executing our instructions as we intended them. We might ask them to answer to the best of their knowledge, and they may instead answer with something that "sounds good" but is not what they in fact believe. Or, perhaps, as Gwern pointed out, they exhibit things like a strange tendency to answer our request for a non-rhyming poem with a rhyming poem, even though they may be well-aware, internally, that this isn't what was requested.)
I agree with cubefox: you seem to be misinterpreting the claim that LLMs actually execute your intended instructions as a mere claim about whether LLMs understand your intended instructions. I claim there is simply a sharp distinction between actual execution and correct, legible interpretation of instructions and a simple understanding of those instructions; LLMs do the former, not merely the latter.
Honestly, I think focusing on this element of the discussion is kind of a distraction because, in my opinion, the charitable interpretation of your posts is simply that you never thought that it would be hard to get AIs to exhibit human-level reasonableness at interpreting and executing tasks until AIs reach a certain capability level, and the threshold at which these issues were predicted to arise was always intended to be very far above GPT-4-level. This interpretation of your argument is plausible based on what you wrote, and could indeed save your theory from empirical falsification based on our current observations.
That said, if you want to go this route, and argue that "complexity of wishes"-type issues will eventually start occurring at some level of AI capability, I think it wo...
The old paradox: to care it must first understand, but to understand requires high capability, capability that is lethal if it doesn't care
But it turns out we have understanding before lethal levels of capability. So now such understanding can be a target of optimization. There is still significant risk, since there are multiple possible internal mechanisms/strategies the AI could be deploying to reach that same target. Deception, actual caring, something I've been calling detachment, and possibly others.
This is where the discourse should be focusing on, IMO. This is the update/direction I want to see you make. The sequence of things being learned/internalized/chiseled is important.
My imagined Eliezer has many replies to this, with numerous branches in the dialogue/argument tree which I don't want to get into now. But this *first step* towards recognizing the new place we are in, specifically wrt the ability to target human values (whether for deceptive, disinterested, detached, or actual caring reasons!), needs to be taken imo, rather than repeating this line of "of course I understood that a superint would understand human values; this isn't an update for me".
(edit: My comments here are regarding the larger discourse, not just this specific post or reply-chain)
I think this post is wrong because it was written before quantilizers were known. The base rate of people being rescued from burning buildings is much higher than the rate of buildings exploding and hurling people out of them, so the Outcome Pump will only explode the building if the function strongly favors that over your mother being rescued. Even 99% reset probability is not enough to explode the building, unless it was already likely to explode.
It may be that setting Pr(reset) to make most outcomes vastly unlikely, like Pr(reset) = [0 if distance(mother, building, 5 seconds from now) > 100 meters else 0.999999], causes some weird outcome like exploding the building. But allowing likely outcomes, e.g. Pr(reset) = [0 if distance(mother, building, 20 minutes from now) > 100 meters else 0.999], probably saves her, unless this was super unlikely to happen in the first place, in which case she jumps out and her dead body is carried away or something.
Basically, this post implies that all wishes are unsafe. But only wishes with very low prior probability are unsafe.
The outcome pump is defined in a way that excludes the possibility of active subversion: it literally just keeps rerunning until the outcome is satisfied, which is a way of sampling based on (some kind of) prior probability. Yudkowsky is arguing that this is equivalent to a malicious genie. But this is a claim that can be false.
In this specific case, I agree with Thomas that whether or not it's actually false will depend on the details of the function: "The further she gets from the building's center, the less the time machine's reset probability." But there's probably some not-too-complicated way to define it which would render the pump safe-ish (since this was a user-defined function).
It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.
I can't tell whether this update to the post is addressed towards me. However, it seems possible that it is addressed towards me, since I wrote a post last year criticizing some of the ideas behind this post. In either case, whether it's addressed towards me or not, I'd like to reply to the update.
For the record, I want to definitively clarify that I never i...
Alice: I want to make a bovine stem cell that can be cultured at scale in vats to make meat-like tissue. I could use directed evolution. But in my alternate universe, genome sequencing costs $1 billion per genome, so I can't straightforwardly select cells to amplify based on whether their genome looks culturable. Currently the only method I have is to do end-to-end testing: I take a cell line, I try to culture a great big batch, and then see if the result is good quality edible tissue, and see if the cell line can last for a year without mutating beyond repair. This is very expensive, but more importantly, it doesn't work. I can select for cells that make somewhat more meat-like tissue; but when I do that, I also heavily select for other very bad traits, such as forming cancer-like growths. I estimate that it takes on the order of 500 alleles optimized relative to the wild type to get a cell that can be used for high-quality, culturable-at-scale edible tissue. Because that's a large complex change, it won't just happen by accident; something about our process for making the cells has to put those bits there.
Bob: In a recent paper, a polygenic score for culturable meat is given. Sin...
a) I think at least part of what's gone on is that Eliezer has been misunderstood and facing the same actually quite dumb arguments a lot, and he is now (IMO) too quick to round new arguments off to something he's got cached arguments for. (I'm not sure whether this is exactly what went on in this case, but seems plausible without carefully rereading everything)
b) I do think when Eliezer wrote this post, there were literally a bunch of people making quite dumb arguments that were literally "the solution to AI ethics/alignment is [my preferred elegant system of ethics] / [just have it track smiling faces] / [other explicit hardcoded solutions that were genuinely impractical]"
I think I personally did also not get what you were trying to say for awhile, so I don't think the problem here is just Eliezer (although it might be me making a similar mistake to what I hypothesize Eliezer to have made, for reasons that are correlated with him)
I do generally think a criticism I have of Eliezer is that he has spent too much time comparatively focused on the dumber 3/4 of arguments, instead of engaging directly with top critics which are often actually making more subtle points (and being a bit too slow to update that this is what's going on)
Wish there was a system where people could pay money to bid up what they believed were the "top arguments" that they wanted me to respond to. Possibly a system where I collect the money for writing a diligent response (albeit note that in this case I'd weigh the time-cost of responding as well as the bid for a response); but even aside from that, some way of canonizing what "people who care enough to spend money on that" think are the Super Best Arguments That I Should Definitely Respond To. As it stands, whatever I respond to, there's somebody else to say that it wasn't the real argument, and this mainly incentivizes me to sigh and go on responding to whatever I happen to care about more.
(I also wish this system had been in place 24 years ago so you could scroll back and check out the wacky shit that used to be on that system earlier, but too late now.)
I do think such a system would be really valuable, and is the sort of the thing the LW team should try to build. (I'm mostly not going to respond to this idea right now but I've filed it away as something to revisit more seriously with Lightcone. Seems straightforwardly good)
But it feels slightly orthogonal to what I was trying to say. Let me try again.
(this is now official a tangent from the original point, but, feels important to me)
It would be good if the world could (deservedly) trust, that the best x-risk thinkers have a good group epistemic process for resolving disagreements.
At least two steps that seem helpful for that process are:
It is that "before" step where it feels like things seem to be going wrong, to me. (I haven't re-read Matthew's post or your response comment from a year ago in eno...
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect.
I claim that having such a function would simplify the AI alignment problem by reducing it from the hard problem of getting an AI to care about something complex (human value) to the easier problem of getting the AI to care about that particular function (which is simple, as the function can be hooked up to the AI directly).
One cannot hook up a function to an AI directly; it has to be physically instantiated somehow. For example, the function could be a human pressing a button; and then, any experimentation on the AI's part to determine what "really" co...
Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology. So I strongly reject your frame that this post was "clearly portraying an outer alignment problem" and can be criticized on those grounds by you; that is anachronistic.
You are now dragging in a very large number of further inferences about "what I meant", and other implications that you think this post has, which are about Christiano-style proposals that were developed years after this post. I have disagreements with those, many disagreements. But it is definitely not what this post is about, one way or another, because this post predates Christiano being on the scene.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, t...
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won't work to say what you want. This point is true!
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:
[...] and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect [...]
But if you're doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
But it seems to me that he's already doing this. He's not alleging that this post is incorrect in isolation.
The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point for saying "my critique of the 'larger argument' does not make the mistake referred to i...
Here's an argument that alignment is difficult which uses complexity of value as a subpoint:
A1. If you try to manually specify what you want, you fail.
A2. Therefore, you want something algorithmically complex.
B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B3. We don't understand how to affect the values-distribution toward something specific.
B4. If we don't affect the value-distribution toward something specific, then the values-distribution probably puts large penalties for absolute algorithmic complexity; any specific utility function with higher absolute algorithmic complexity will be less likely to be the one that the AGI ends up with.
C1. Because of A2 (our values are algorithmically complex) and B4 (a complex utility function is unlikely to show up in an AGI without us skillfully intervening), an AGI is unlikely to have our values without us skillfully intervening.
C2. Because of B3 (we don't know how to skillfully intervene on an AGI's values
The main difficulty, if there is one, is in "getting the function to play the role of the AGI values," not in getting the AGI to compute the particular function we want in the first place.
Right, that is the problem (and IDK of anyone discussing this who says otherwise).
Another position would be that it's probably easy to influence a few bits of the AI's utility function, but not others. For example, it's conceivable that, by doing capabilities research in different ways, you could increase the probability that the AGI is highly ambitious--e.g. tries to take over the whole lightcone, tries to acausally bargain, etc., rather than being more satisficy. (IDK how to do that, but plausibly it's qualitatively easier than alignment.) Then you could claim that it's half a bit more likely that you've made an FAI, given that an FAI would probably be ambitious. In this case, it does matter that the utility function is complex.
The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI!
Contra this assertion, Yudkowksy-2007 was very capable of using parables. The "genie" in this article is easily recognized as metaphorically referring to an imagined AI. For example, here is Yudkowsky-2007 in Lost Purposes, linking here:
...I have seen many people go astray when they wish to the genie of an imagined AI, dreaming up wish after wish that seems good to them, som
(It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of human values. This point is relevant within a larger argument, because it bears on the complexity of what you need to get an artificial superintelligence to want or value; rather than bearing on what a superintelligence supposedly could not predict or understand. -- EY, May 2024.)
<
There are three kinds of genies: Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.
Suppose your aged mother is trapped in a burning building, and it so happens that you're in a wheelchair; you can't rush in yourself. You could cry, "Get my mother out of that building!" but there would be no one to hear.
Luckily you have, in your pocket, an Outcome Pump. This handy device squeezes the flow of time, pouring probability into some outcomes, draining it from others.
The Outcome Pump is not sentient. It contains a tiny time machine, which resets time unless a specified outcome occurs. For example, if you hooked up the Outcome Pump's sensors to a coin, and specified that the time machine should keep resetting until it sees the coin come up heads, and then you actually flipped the coin, you would see the coin come up heads. (The physicists say that any future in which a "reset" occurs is inconsistent, and therefore never happens in the first place - so you aren't actually killing any versions of yourself.)
Whatever proposition you can manage to input into the Outcome Pump, somehow happens, though not in a way that violates the laws of physics. If you try to input a proposition that's too unlikely, the time machine will suffer a spontaneous mechanical failure before that outcome ever occurs.
You can also redirect probability flow in more quantitative ways using the "future function" to scale the temporal reset probability for different outcomes. If the temporal reset probability is 99% when the coin comes up heads, and 1% when the coin comes up tails, the odds will go from 1:1 to 99:1 in favor of tails. If you had a mysterious machine that spit out money, and you wanted to maximize the amount of money spit out, you would use reset probabilities that diminished as the amount of money increased. For example, spitting out $10 might have a 99.999999% reset probability, and spitting out $100 might have a 99.99999% reset probability. This way you can get an outcome that tends to be as high as possible in the future function, even when you don't know the best attainable maximum.
So you desperately yank the Outcome Pump from your pocket - your mother is still trapped in the burning building, remember? - and try to describe your goal: get your mother out of the building!
The user interface doesn't take English inputs. The Outcome Pump isn't sentient, remember? But it does have 3D scanners for the near vicinity, and built-in utilities for pattern matching. So you hold up a photo of your mother's head and shoulders; match on the photo; use object contiguity to select your mother's whole body (not just her head and shoulders); and define the future function using your mother's distance from the building's center. The further she gets from the building's center, the less the time machine's reset probability.
You cry "Get my mother out of the building!", for luck, and press Enter.
For a moment it seems like nothing happens. You look around, waiting for the fire truck to pull up, and rescuers to arrive - or even just a strong, fast runner to haul your mother out of the building -
BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.
On the side of the Outcome Pump is an Emergency Regret Button. All future functions are automatically defined with a huge negative value for the Regret Button being pressed - a temporal reset probability of nearly 1 - so that the Outcome Pump is extremely unlikely to do anything which upsets the user enough to make them press the Regret Button. You can't ever remember pressing it. But you've barely started to reach for the Regret Button (and what good will it do now?) when a flaming wooden beam drops out of the sky and smashes you flat.
Which wasn't really what you wanted, but scores very high in the defined future function...
The Outcome Pump is a genie of the second class. No wish is safe.
If someone asked you to get their poor aged mother out of a burning building, you might help, or you might pretend not to hear. But it wouldn't even occur to you to explode the building. "Get my mother out of the building" sounds like a much safer wish than it really is, because you don't even consider the plans that you assign extreme negative values.
Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females. It's obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities. But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn't think of it. They only saw the solutions they would have used themselves.
Suppose you try to patch the future function by specifying that the Outcome Pump should not explode the building: outcomes in which the building materials are distributed over too much volume, will have ~1 temporal reset probabilities.
So your mother falls out of a second-story window and breaks her neck. The Outcome Pump took a different path through time that still ended up with your mother outside the building, and it still wasn't what you wanted, and it still wasn't a solution that would occur to a human rescuer.
If only the Open-Source Wish Project had developed a Wish To Get Your Mother Out Of A Burning Building:
All these special cases, the seemingly unlimited number of required patches, should remind you of the parable of Artificial Addition - programming an Arithmetic Expert Systems by explicitly adding ever more assertions like "fifteen plus fifteen equals thirty, but fifteen plus sixteen equals thirty-one instead".
How do you exclude the outcome where the building explodes and flings your mother into the sky? You look ahead, and you foresee that your mother would end up dead, and you don't want that consequence, so you try to forbid the event leading up to it.
Your brain isn't hardwired with a specific, prerecorded statement that "Blowing up a burning building containing my mother is a bad idea." And yet you're trying to prerecord that exact specific statement in the Outcome Pump's future function. So the wish is exploding, turning into a giant lookup table that records your judgment of every possible path through time.
You failed to ask for what you really wanted. You wanted your mother to go on living, but you wished for her to become more distant from the center of the building.
Except that's not all you wanted. If your mother was rescued from the building but was horribly burned, that outcome would rank lower in your preference ordering than an outcome where she was rescued safe and sound. So you not only value your mother's life, but also her health.
And you value not just her bodily health, but her state of mind. Being rescued in a fashion that traumatizes her - for example, a giant purple monster roaring up out of nowhere and seizing her - is inferior to a fireman showing up and escorting her out through a non-burning route. (Yes, we're supposed to stick with physics, but maybe a powerful enough Outcome Pump has aliens coincidentally showing up in the neighborhood at exactly that moment.) You would certainly prefer her being rescued by the monster to her being roasted alive, however.
How about a wormhole spontaneously opening and swallowing her to a desert island? Better than her being dead; but worse than her being alive, well, healthy, untraumatized, and in continual contact with you and the other members of her social network.
Would it be okay to save your mother's life at the cost of the family dog's life, if it ran to alert a fireman but then got run over by a car? Clearly yes, but it would be better ceteris paribus to avoid killing the dog. You wouldn't want to swap a human life for hers, but what about the life of a convicted murderer? Does it matter if the murderer dies trying to save her, from the goodness of his heart? How about two murderers? If the cost of your mother's life was the destruction of every extant copy, including the memories, of Bach's Little Fugue in G Minor, would that be worth it? How about if she had a terminal illness and would die anyway in eighteen months?
If your mother's foot is crushed by a burning beam, is it worthwhile to extract the rest of her? What if her head is crushed, leaving her body? What if her body is crushed, leaving only her head? What if there's a cryonics team waiting outside, ready to suspend the head? Is a frozen head a person? Is Terry Schiavo a person? How much is a chimpanzee worth?
Your brain is not infinitely complicated; there is only a finite Kolmogorov complexity / message length which suffices to describe all the judgments you would make. But just because this complexity is finite does not make it small. We value many things, and no they are not reducible to valuing happiness or valuing reproductive fitness.
There is no safe wish smaller than an entire human morality. There are too many possible paths through Time. You can't visualize all the roads that lead to the destination you give the genie. "Maximizing the distance between your mother and the center of the building" can be done even more effectively by detonating a nuclear weapon. Or, at higher levels of genie power, flinging her body out of the Solar System. Or, at higher levels of genie intelligence, doing something that neither you nor I would think of, just like a chimpanzee wouldn't think of detonating a nuclear weapon. You can't visualize all the paths through time, any more than you can program a chess-playing machine by hardcoding a move for every possible board position.
And real life is far more complicated than chess. You cannot predict, in advance, which of your values will be needed to judge the path through time that the genie takes. Especially if you wish for something longer-term or wider-range than rescuing your mother from a burning building.
I fear the Open-Source Wish Project is futile, except as an illustration of how not to think about genie problems. The only safe genie is a genie that shares all your judgment criteria, and at that point, you can just say "I wish for you to do what I should wish for." Which simply runs the genie's should function.
Indeed, it shouldn't be necessary to say anything. To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish. Otherwise the genie may not choose a path through time which leads to the destination you had in mind, or it may fail to exclude horrible side effects that would lead you to not even consider a plan in the first place. Wishes are leaky generalizations, derived from the huge but finite structure that is your entire morality; only by including this entire structure can you plug all the leaks.
With a safe genie, wishing is superfluous. Just run the genie.