Knight Lee

Wikitag Contributions

Comments

Sorted by
Knight Lee*Ω330

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]

  1. ^

    In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.

    Why?

    This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.

    If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.

    (On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)

    The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has  stars, and refusing to share them is astronomically greedy.”

More on this. I think a very dangerous proxy or instrumentally convergent goal, is the proxy of defeating others (rather than helping others), if it is trained in zero sum game environments against other agents.

This is because if the AI values many goals at the same time, it might still care about humans and sentient lives a little, and not be too awful. Unless one goal is actively bad, like defeating/harming others.

Maybe we should beware of training the AI in zero sum games with other AI. If we really want two AIs to play against each other (since player vs. player games might be very helpful for capabilities), it's best to modify the game such that

  • It's not zero sum. Minimizing the opponent's score must not be identical to maximizing one's own score, otherwise an evil AI purely trying to hurt others (doesn't care about benefiting itself) will get away with it. It's like your cheese example.
  • Each AI is rewarded a little bit for the other AI's score: each AI gets maximum reward by winning but not winning too completely and leaving the other AI with 0 score. "Be a gentle winner."
  • It's a mix of competition and cooperation, such that each AI considers the other AI gaining long term power/resources is a net positive. Humans relationships are like this: I'd want my friend to be in a better strategic position in life (long term), but in a worse strategic position when playing chess against me (short term).

Of course, we might completely avoid training the AI to play against other AI, if the capability cost (alignment tax) turns out smaller than expected. Or if you can somehow convince AI labs to care more about alignment and less about capabilities (alas alas, sad sailor's mumblings to the wind).

For "Hypothesis 3: Unintended version of written goals and/or human intentions" maybe some failure modes may be detectable in a Detailed Ideal World Benchmark, since they don't just screw up the AI's goals but also screw up the AI's belief on what we want its goal to be.

For "Hypothesis 5: Proxies and/or instrumentally convergent goals" maybe proxies can be improved. For example, training agents in environments where it needs to cooperate with other agents might teach it proxies like "helping others is good."

Knight Lee*Ω450

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

Maybe one concrete implementation would be, when doing RL[1] on an AI like o3, they don't give it a single math question to solve. Instead, they give it like 5 quite different tasks, and the AI has to allocate its time to work on the 5 tasks.

I know this sounds like a small boring idea, but it might actually help if you really think about it! It might cause the resulting agent's default behaviour pattern to be "optimize multiple tasks at once" rather than "optimize a single task ignoring everything else." It might be the key piece of RL behind the behaviour of "whoa I already optimized this goal very thoroughly, it's time I start caring about something else," and this might actually be the behaviour that saves humanity.

  1. ^

    RL = reinforcement learning

Thank you for the suggestion.

A while ago I tried using AI suggest writing improvements on a different topic, and I didn't really like any of the suggestions. It felt like the AI didn't understand what I was trying to say. Maybe the topic was too different from its training data.

But maybe it doesn't hurt to try again, I heard the newer AI are smarter.

If I keep procrastinating maybe AI capabilities will get so good they actually can do it for me :/

Just kidding. I hope.

I'm sorry I was sort of skimming and didn't realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/

PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it's already finished recursive self improvement and become so powerful that stopping it is futile.

I read some of your post and I like your philosophical landmines ideas (and other ideas too). You've definitely done a lot of research! I'm also thinking in similar directions as you, we might talk more sometime.

(By the way I was writing a reply to your comment, but then turned my reply into this quicktake)

I'm currently trying to write a human-AI trade idea similar to the idea by Rolf Nelson (and David Matolcsi), but one which avoids Nate Soares and Wei Dai's many refutations.

I'm planning to leverage logical risk aversion, which Wei Dai seems to agree with, and a complicated argument for why humans and ASI will have bounded utility functions over logical uncertainty. (There is no mysterious force that tries to fix the Pascal's Mugging problem for unbounded utility functions, hence bounded utility functions are more likely to succeed)

I'm also working on arguments why we can't just wait till the singularity to do the logic trade (counterfactuals are weird, and the ASI will only be logically uncertain for a brief moment).

Unfortunately my draft is currently a big mess. It's been 4 months and I'm procrastinating pretty badly on this idea :/ can't quite find the motivation.

Even if honeypots don't actually work, they may convince the superintelligence that honeypots are a real thing, and there is a tiny probability that it is in some elaborate honeypot right now, and everything convincing it that there is no conspiracy is actually just part of the conspiracy.

If there is a tiny probability it will fail to take over the world, it may be worthwhile for it to modify itself so that it has some mercy towards humanity, and share a little bit of the future with humanity.

It may say “I commit to giving you mercy because I expect you to be honourable beings who will repay me for this decision—even if you made no promises yet. You should also respect my goals a little in the case that I lose and you win.”

After all, neither side wants to make a logical gamble.

To maximize the superintelligence's uncertainty of success, we might have multiple levels of honeypots, some better disguised than others, so it doesn't know what the most well disguised level looks like.

Load More