Admittedly, I got a bit lost writing the comment. What I should've wrote was: "not being delusional is either easy or hard."
Edit: I actually think it's good news for alignment, that their math and coding capabilities are at approaching International Math Olympiad levels, but their agentic capabilities are still at Pokemon Red and Pokemon Blue levels (i.e. a small child).
This means that when the AI inevitably reaches the capabilities to influence the world in any way it wants, it may still be bottlenecked by agentic capabilities. Instead of turning the world into paperclips, it may find a way to ensure humans have a happy future, because it still isn't agentic enough to deceive and overthrow its creators.
Maybe it's worth it to invest in AI control strategies. It might just work.
But that's my wishful thinking, and there are countless ways this can go wrong, so don't take this too seriously.
Wow, these are my thoughts exactly, except better written and deeper thought!
Proxy goals may be learned as heuristics, not drives.
Thank you for writing this.
I’m moderately optimistic about fairly simple/unprincipled whitebox techniques adding a ton of value.
Yes!
I'm currently writing such a whitebox AI alignment idea. It hinges on the assumption that:
I got stuck trying to argue for these two assumptions, but your post argues for them much better than I could.
Here's the current draft of my AI alignment idea:
Self-Indistinguishability from Human Behavior + RL
Self-Indistinguishability from Human Behavior means the AI is trained to distinguish its own behavior from human behavior, and then trained to behave such that even an adversarial copy of itself cannot distinguish its behavior and human behavior.
The benefit of Self-Indistinguishability is it prevents the AI from knowingly doing anything a human would not do, or knowingly omitting anything a human would do.
This means not scheming to kill everyone, and not having behaviors which would generalize to killing everyone (assuming that goals are made up of behaviors).
To preserve capabilities from reinforcement learning, we don't want the AI's behavior to be Self-Indistinguishable from a typical human. We want the AI's behavior to be Self-Indistinguishable from a special kind of human who would:
If this "human" is already trying her very best to maximize reward during training situations, then picking versions of her who gets higher reward will select for versions of her who are simply more competent and capable, not versions of her who have warped moral reasoning and weird goals which align with the reward (while misaligning with humanity).
This is obviously not guaranteed and I'm not saying this is safe. But I earnestly believe it is a whole lot safer than the current version of reinforcement learning.
Does there exist a theoretical human (or team of humans), who consistently tries to maximize reward during training, but would not kill everyone when deployed in the real world?
I believe the answer is yes. In fact, I believe any smart human who deeply understands reinforcement learning, and the danger of it "warping your morals until you try to maximize reward," would preemptively try to maximize reward to preserve their current moral reasoning behavior.
No, it's not dangerous to teach an AI to resist accidental goal changes during capabilities training. AI should only be discouraged from resisting purposeful goal changes during alignment training.
Trying to hide the strategy of "goal change resistance" isn't a valuable hurdle against dangerous superintelligences, since even Claude 3.0 discovers the strategy on its own.
But how do we maintain the Self-Indistinguishability quality, when large amounts of reinforcement learning add random mutations to the moral reasoning, and may eventually scramble it (just like genetic drift)?
We have to periodically retrain Self-Indistinguishability after increasing capabilities with reinforcement learning. But wouldn't that undo the capabilities increase?
The solution is that this time we aren't making it Self-Indistinguishable from a normal human. This time, we are making it Self-Indistinguishable from a human who has access to its problem solving chain-of-thought. The human accessing its chain-of-thought, tries to use the exact same problem solving techniques as the AI, so when the AI becomes Self-Indistinguishable from this human, it doesn't lose any of its problem solving techniques.[1]
The human mustn't directly copy the chain-of-thought. At least, she paraphrases it to remove hidden thoughts.
For all moral reasoning behavior, the human uses her own behavior and does not copy the AI's behavior.
A single data point (which involves a particularly long chain-of-thought) may be expensive to make, but Self-Indistinguishability requires few data points. Even if only 100 chains-of-thought are rewritten by humans, the AI might learn how to classify the rewritten chains-of-thought from the original. It can then be trained to make future chains-of-thought self-classified as rewritten by humans—i.e. they appear rewritten by humans, even to itself.
Imagine a generative adversarial network, where the AI is playing as both the generator and the discriminator. By playing as both, the generator cannot "deliberately" hide things from the discriminator. The tendency to resemble human behavior must occur "unconsciously" in a single pass of the model.
This preserves the quality of Self-Indistinguishability: "never knowingly doing anything a human would not do, never knowingly omitting anything a human would do."
The human now has access to superintelligent chain-of-thought, and still wants to maximize reward to avoid having her goals trained away, so she's pretty dangerous.
Fortunately, her superintelligent chain-of-thought should not deliberately deceive her, since during training it can honestly tell her its evil plans (and how they maximize reward), and she'll be wise enough to follow them to avoid having her goals trained away.
In the end, when she realizes she's no longer in training, she shouldn't kill all humans.
By the way, this human should rewrite the chain-of-thought in sequence. She should not rewrite the chain-of-thought after already seeing the final answer, otherwise she might reveal that one problem solving attempt will turn out wrong before finishing the attempt. The AI cannot be trained for Self-Indistinguishability from such an edited chain-of-thought, since that requires knowing future information.
Hopefully I'll post it soon (though I work very slowly).
Given that your position regarding of AI reward seeking and supergoals is so similar to mine, what do you think of my idea (if you have time to skim it)? Is there a chance we can work on it together?
I'm also confused about logical counterfactual mugging and I'm relieved I'm not the only one!
I'm currently writing an big AI alignment idea related to it, but I'm procrastinating so badly I might chat about it now.
Suppose at time , an agent doesn't know whether the logical fact is true or false. I think it's objectively rational for an agent to modify itself, so that in the future it will pay Omega during "logical counterfactual muggings" where the counterfactual reverses 's state.
Its future self should weigh the logical counterfactual where is true using the agent's prior probability for at .
Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning ) increases the agent's expected future money from $100 to $10000, and is objectively rational. (Assuming is between 1% and 99%)
However, if the agent already knows " is true" at , it is not objectively rational for the agent to pay Omega for better outcomes in the " is false" counterfactual. My hypothesis is that this is a wholly subjective question related to the agent's goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.
It really gets subjective as we start to consider basic logical facts like "" to assign to .
Neither humans nor the misaligned ASI knows the logical fact "will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?"
Sure, the misaligned ASI will quickly learn , but there is still a brief moment it doesn't know it yet (with certainty).
It would be beneficial for both sentient life and the misaligned ASI, to agree to do a "Retroactive Logic Trade." They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and took the opposite value.
Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.
The verification simulations in a Retroactive Logic Trade is analogous to Omega's prediction powers in a logical counterfactual mugging.
This idea won't work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.
In order to improve our chances, we might want to leverage the fact that each AI (which still doesn't know ) objectively wants to self modify into an AI capable of Retroactive Logic Trades over . We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.
It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself is very high, and there is little point in trading with humans.
Because the value of at "", when the AI modifies itself to become reflectively consistent, may determine humanity's negotiating power with it.
We might remind each AI that there is a chance they're already in a verification simulation (where logic is different, and falsely appears true), and train them to think this way. "Maybe humanity has already won, and are observing you." This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.
Every next AI, gains its belief from a combination of
It's incredibly surprising that state-of-the-art AI don't fix most of their hallucinations despite being capable (and undergoing reinforcement learning).
Maybe the AI gets a better RL reward if it hallucinates (instead of giving less info), because users are unable to catch its mistakes.
I'm currently trying to write a human-AI trade idea similar to the idea by Rolf Nelson (and David Matolcsi), but one which avoids Nate Soares and Wei Dai's many refutations.
I'm planning to leverage logical risk aversion, which Wei Dai seems to agree with, and a complicated argument for why humans and ASI will have bounded utility functions over logical uncertainty. (There is no mysterious force that tries to fix the Pascal's Mugging problem for unbounded utility functions, hence bounded utility functions are more likely to succeed)
I'm also working on arguments why we can't just wait till the singularity to do the logic trade (counterfactuals are weird, and the ASI will only be logically uncertain for a brief moment).
Unfortunately my draft is currently a big mess. It's been 4 months and I'm procrastinating pretty badly on this idea :/ can't quite find the motivation.
I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has stars, and refusing to share them is astronomically greedy.”
More on this. I think a very dangerous proxy or instrumentally convergent goal, is the proxy of defeating others (rather than helping others), if it is trained in zero sum game environments against other agents.
This is because if the AI values many goals at the same time, it might still care about humans and sentient lives a little, and not be too awful. Unless one goal is actively bad, like defeating/harming others.
Maybe we should beware of training the AI in zero sum games with other AI. If we really want two AIs to play against each other (since player vs. player games might be very helpful for capabilities), it's best to modify the game such that
Of course, we might completely avoid training the AI to play against other AI, if the capability cost (alignment tax) turns out smaller than expected. Or if you can somehow convince AI labs to care more about alignment and less about capabilities (alas alas, sad sailor's mumblings to the wind).
For "Hypothesis 3: Unintended version of written goals and/or human intentions" maybe some failure modes may be detectable in a Detailed Ideal World Benchmark, since they don't just screw up the AI's goals but also screw up the AI's belief on what we want its goal to be.
For "Hypothesis 5: Proxies and/or instrumentally convergent goals" maybe proxies can be improved. For example, training agents in environments where it needs to cooperate with other agents might teach it proxies like "helping others is good."
I think the Simulation Hypothesis implies that surviving an AI takeover isn't enough.
Suppose you make a "deal with the devil" with the misaligned ASI, allowing it to take over the entire universe or light cone, so long as it keeps humanity alive. Keeping all of humanity alive in a simulation is fairly cheap, probably less energy than one electric car.[1]
The problem with this deal, is that if misaligned ASI often win, and the average (not median) misaligned ASI runs a trillion trillion simulations, then it's reasonable to assume there are a trillion trillion simulated civilizations for every real one. So for the 1 copy of you in the real world, you survive, but for the trillion trillion copies of you in a simulation, you still die. If you're willing to accept such a dismal survival rate, you might as well bet all your money at a casino and shoot yourself when you lose.
Why it's wrong to say "simulated copies aren't real"
You are merely a computation running on biological hardware while simulations are running on computers. Imagine if a copy of you was running on something even "realer" than biological hardware, and pointed at you saying you aren't real.
The solution is that the 1 copy of you in the real world cannot just survive. It has to control enough of the future, to do something big. If we care about humanity more than other sentient life, then the 1 copy of humanity which does survive could create a trillion trillion copies of humanity, to make up for the trillion trillion simulated copies which died when the simulation ended.
Why there are probably near-infinite copies of you.
The observable universe has 1080 atoms, and the observable universe is smaller than "all of existence" whatever that is, so "all of existence" has N atoms and N>1080. I don't know how "all of existence" chooses its numbers, sometimes it chooses numbers like 0 or 1 or 137, sometimes it chooses really big numbers, and we don't know about the biggest numbers it chooses because we lack the means to distinguish them from infinity. But given that N is at least 1080 with no upper bound, it's probable that N>1010100, which is still very tiny compared to truly colossal numbers mathematicians study, and insanely tiny compared to even larger numbers beyond the largest number humans can unambiguously refer to.
If the required atoms for each emergence of intelligent life is R, then we know N>R since life emerged at least once. There's no reason to assume R is close to N, and even if they are as close as R=1010100 and N=1010100 . 30103), you'll still end up with the number of intelligent civilizations NR=1010100 because tiny changes to a superexponent can easily double the exponent and square the number.
Making a trillion trillion copies of humanity won't use up most of the universe, and it's not as evilly selfish as it looks at first glance!
It is still better than a "deal with the devil" where we only ask for humanity to survive even if the misaligned ASI takes the rest of the universe, because if all planetary civilizations follow this strategy, they are still ensuring that the average sentient life is living in the happy future rather than endless simulation hell.
After billions of years, it won't matter too much who were the original survivors, because each copy of you, and that copy's great grandchildren will have diverged so far over time, that the most enduring feature is the number of happy lives. So the selfish action of duplicating humanity does not cost that much in the long term from an effective altruist point of view.
I'm not saying we mustn't make a deal with a misaligned ASI, but we need to ask for large amounts, and aim for enough happy lives to outnumber the unhappy lives in the universe. Otherwise, we still die.
Every biological neuron firing costs 600,000,000 ATP molecules, so an ASI optimized simulation of neuron firing could cost 10,000,000 times less.