Cleo Nardo

DMs open.

Sequences

Game Theory without Argmax

Wiki Contributions

Comments

Sorted by

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

Hey TurnTrout.

I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.

  • Let's say  is a policy with state space  and action space .
  • A "context" is a small moving window in the state-history, i.e. an element of  where  is a small positive integer.
  • A shard is something like , i.e. it evaluates actions given particular states.
  • The shards  are "activated" by contexts, i.e.  maps each context to the amount that shard  is activated by the context.
  • The total activation of , given a history , is given by the time-decay average of the activation across the contexts, i.e. 
  • The overall utility function  is the weighted average of the shards, i.e. 
  • Finally, the policy  will maximise the utility function, i.e. 

Is this what you had in mind?

Why do you care that Geoffrey Hinton worries about AI x-risk?

  1. Why do so many people in this community care that Hinton is worried about x-risk from AI?
  2. Do people mention Hinton because they think it’s persuasive to the public?
  3. Or persuasive to the elites?
  4. Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
  5. If so, why?
  6. Is it because he is so intelligent?
  7. Or because you think he has private information or intuitions?
  8. Do you think he has good arguments in favour of AI x-risk?
  9. Do you think he has a good understanding of the problem?
  10. Do you update more-so on Hinton’s views than on Yann LeCun’s?

I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I'd appreciate help on any/all of them. The phenomenon I'm confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

Answer by Cleo Nardo20

This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

Cleo NardoΩ120

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

it's quite common for assistants to refuse instructions, especially harmful instructions. so i'm not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk. 

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.

I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i'm not sure how worrying this is, but i haven't noticed others mentioning it.

NB: here's 20 random terms I'm imagining included in the dictionary:

  1. Evals
  2. Mechanistic anomaly detection
  3. Stenography
  4. Glitch token
  5. Jailbreaking
  6. RSPs
  7. Model organisms
  8. Trojans
  9. Superposition
  10. Activation engineering
  11. CCS
  12. Singular Learning Theory
  13. Grokking
  14. Constitutional AI
  15. Translucent thoughts
  16. Quantilization
  17. Cyborgism
  18. Factored cognition
  19. Infrabayesianism
  20. Obfuscated arguments

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:

I think these articles were non-transient and novel.

Cleo Nardo24-5

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions. 

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is impressive" or "This is research I'm glad happened". Rather, I mean something more low-level, almost syntactic, like "Here's a new idea everyone is talking out". This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would've been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.

(2) When did AI safety innovation peak?

My guess is Spring 2022, during the ELK Prize era. I'm not sure though. What do you guys think?

(3) What’s caused the slow down?

Possible explanations:

  1. ideas are harder to find
  2. people feel less creative
  3. people are more cautious
  4. more publishing in journals
  5. research is now closed-source
  6. we lost the mandate of heaven
  7. the current ideas are adequate
  8. paul christiano stopped posting
  9. i’m mistaken, innovation hasn't stopped
  10. something else

(4) How could we measure "innovation"?

By "innovation" I mean non-transient novelty. An article is "novel" if it uses n-grams that previous articles didn't use, and an article is "transient" if it uses n-grams that subsequent articles didn't use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram "gradient hacking" doesn't appear in older articles, but appears often in subsequent articles. See below.

In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric "resonance", which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.

My claim is recent AI safety research has been less resonant.

  1. ^

    Here's 20 random terms that would be in the glossary, to illustrate what I mean:

    1. Evals
    2. Mechanistic anomaly detection
    3. Stenography
    4. Glitch token
    5. Jailbreaking
    6. RSPs
    7. Model organisms
    8. Trojans
    9. Superposition
    10. Activation engineering
    11. CCS
    12. Singular Learning Theory
    13. Grokking
    14. Constitutional AI
    15. Translucent thoughts
    16. Quantilization
    17. Cyborgism
    18. Factored cognition
    19. Infrabayesianism
    20. Obfuscated arguments
Load More