All of Adam Jones's Comments + Replies

How big of this is an issue in practice? For AI in particular, considering that so much contemporary research is published on arxiv, it must be relatively accessible?

I think this is less of an issue for technical AI papers. But I'm finding more governance researchers (especially people moving from other academic communities) seem intent on journal publishing in places that policymakers can't read their stuff! I have also been blocked sometimes from sharing papers with governance friends easily because they are behind paywalls. I might see this more beca... (read more)

Thank you for writing this. I've tried to summarize this article (missing good points made above, but might be useful to people deciding whether to read the full article):

Summary

AGI might be developed by 2027, but we lack clear plans for tackling misalignment risks. This post:

  • calls for better short-timeline AI alignment plans
  • lists promising interventions that could be stacked to reduce risks

This plan focuses on two minimum requirements:

  • Secure model weights and algorithmic secrets
  • Ensure the first AI capable of alignment research isn't scheming

Laye... (read more)

At BlueDot we've been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.

I've messaged you on Slack.

Re: Your comments on the power distribution problem

Agreed that multiple entities powerful adversaries controlling AI seems like not a good plan. And I agree if the decisive winner of the AI race will not act in humanity's best interests, we are screwed.

But I think this is a problem for before that happens: we can shape the world today so it's more likely the winner of the AI race will act in humanity's best interests.

4Satron
I agree with everything. We can and should be trying to improve our odds by making sure that the leading AI labs don't have any revenge-seeking psychopaths in their leadership.

Re: Your points about alignment solving this.

I agree if you define alignment as 'get your AI system to act in the best interests in humans', then the coordination problem becomes harder and likely sufficient for problems 2 and 3. But I think it then bundles more problems together in a way that might be less conducive to solving them.

For loss of control, I was primarily thinking about making systems intent-aligned, by which I mean getting the AI system to try to do what its creators intend. I think this makes dividing these challenges up into subproblems ea... (read more)

4Satron
Ah, I see. You are absolutely right. I unintentionally used two different meanings of the word "alignment" in problems 1 and 3. If we define alignment as intent alignment (from my comment on problem 1), then humans don't necessarily lose control over the economy in The Economic Transition Problem. The group of people to win the AI race will basically control the entire economy via controlling AI that's controlling the world (and is intent aligned to them). If we are lucky, they can create a democratic online council where each human gets a say in how the economy is run. The group will tell AI what to do based on how humanity voted. Alternatively, with the help of their intent aligned AI, the group can try to build a value aligned AI. When they are confident that this AI is indeed value aligned, they can then release it and let it be the steward of humanity. In this scenario, The Economic Transition Problem just becomes The Power Distribution Problem of ensuring that whoever wins the AI race will act in humanity's best interests (or close enough).

The UK Government tends to use the PHIA probability yardstick in most of its communications.

This is used very consistently in national security publications. It's also commonly used by other UK Government departments as people frequently move between departments in the civil service, and documents often get reviewed for clearance by national security bodies before public release.

It is less granular than the IPCC terms at the extremes, but the ranges don't overlap. I don't know which is actually better to use in AI safety communications, but I think being c... (read more)

1eggsyntax
Thanks! Quite similar to the Kesselman tags that @gwern uses (reproduced in this comment below), and I'd guess that one is decended from the other. Although it has somewhat different range cutoffs for each because why should anything ever be consistent. Here are the UK ones in question (for ease of comparison):

A comment provided to me by a reader, highlighting 3rd party liability and insurance as interventions too (lightly edited):

Hi! I liked your AI regulator’s toolbox post – very useful to have a comprehensive list like this! I'm not sure exactly what heading it should go under, but I suggest considering adding proposals to greatly increase 3rd party liability (and or require carrying insurance). A nice intro is here:
https://www.lawfaremedia.org/article/tort-law-and-frontier-ai-governance

Some are explicitly proposing strict liability for catastrophic risks. Ga

... (read more)
Adam Jones2715

I don't understand how the experimental setup provides evidence for self-other overlap working.

The reward structure for the blue agent doesn't seem to provide a non-deceptive reason to interact with the red agent. The described "non-deceptive" behaviour (going straight to the goal) doesn't seem to demonstrate awareness of or response to the red agent.

Additionally, my understanding of the training setup is that it tries to make the blue agent's activations the same regardless of whether it observes the red agent or not. This would mean there's effectively n... (read more)

3Knight Lee
Good point! I really really love their idea, but I'm also skeptical of their toy model. The author admitted that "Optimising for SOO incentivises the model to act similarly when it observes another agent to when it only observes itself." I really like your idea of making a better toy model. "I haven't thought very hard about this, so this might also have problems" The red agent is following behind the blue agent so it won't reveal the obstacles. I really tried to think of a better toy model without any problems, but I think it's really hard to do without LLMs. Because AIs simpler than LLMs do not imagine plans which affect other agents. Self-Other Overlap only truly works on planning AI, because only planning AI needs to distinguish outcomes for itself and outcomes for others in order to be selfish/deceptive. Non-planning AI doesn't think about outcomes at all, its selfishness/deceptiveness is hardwired into its behaviour. Making it behave like it only observes itself can weaken its capabilities to do tricky deceptive maneuvers, but only because that nerfs its overall ability to interact with other agents. I'm looking forwards to their followup post with LLMs.

Most sunscreen feels horrible and slimy (especially in the US where the FDA has not yet approved the superior formulas available in Europe and Asia).

What superior formulas available in Europe would you recommend?

Thanks for the feedback! The article does include some bits on this, but I don't think LessWrong supports toggle block formatting.

I think individuals probably won't be able to train models themselves that pose advanced misalignment threats before large companies do. In particular, I think we disagree about how likely we think it is that there's some big algorithmic efficiency trick someone will discover that enables people to leap forward on this (I don't think this will happen, I think you think this will).

But I do think the catastrophic misuse angle seem... (read more)