David Scott Krueger (formerly: capybaralet)

I'm more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge's Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems

Posts

Sorted by New

5capybaralet's Shortform

25Testing for consequence-blindness in LLMs using the HI-ADS unit test.

5mo

98"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

20What organizations other than Conjecture have (esp. public) info-hazard policies?

QΩ

46A (EtA: quick) note on terminology: AI Alignment != AI x-safety

32Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

26Quick thoughts on "scalable oversight" / "super-human feedback" research

28Mechanistic Interpretability as Reverse Engineering (follow-up to "cars and elephants")

48"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

9I'm planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on?

47[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

Wiki Contributions

Consequentialism

(+50/-38)

Comments

Testing for consequence-blindness in LLMs using the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)2mo30

You could try to do tests on data that is far enough from the training distribution that it won't generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.

Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger (formerly: capybaralet)2moΩ120

I don't disagree... in this case you don't get agents for a long time; someone else does though.

Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger (formerly: capybaralet)2moΩ130

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

Reading the ethicists 2: Hunting for AI alignment papers

David Scott Krueger (formerly: capybaralet)5moΩ120

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

Ways I Expect AI Regulation To Increase Extinction Risk

David Scott Krueger (formerly: capybaralet)9mo30

I found this thought provoking, but I didn't find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected

(b) Regulations Generally Favor The Legible-To-The-State

(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn't care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don't have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk.
c) I do think we need effective international cooperation around regulation. But even buying 1-4 years time seems good in expectation.
d) I don't see the x-risk aspect of this story.

How LLMs are and are not myopic

David Scott Krueger (formerly: capybaralet)9moΩ350

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

What Discovering Latent Knowledge Did and Did Not Find

David Scott Krueger (formerly: capybaralet)10moΩ120

What do you mean by "random linear probe"?

Deceptive AI vs. shifting instrumental incentives

David Scott Krueger (formerly: capybaralet)10mo40

I skimmed this. A few quick comments:
- I think you characterized deceptive alignment pretty well.
- I think it only covers a narrow part of how deceptive behavior can arise.
- CICERO likely already did some of what you describe.

Instrumental Convergence? [Draft]

David Scott Krueger (formerly: capybaralet)10moΩ231

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at is equal to our expectation of the degree to which Sia's desires are satisfied at $W^{*}$ , for any $W, W^{*}$ . Call that common expected value ' $μ$ '. Secondly, our probabilities are symmetric around $μ$ . That is, our probability that $W$ satisfies Sia's desires to at least degree $μ + x$ is equal to our probability that it satisfies her desires to at most degree $μ - x$ . And thirdly, learning how well satisfied Sia's desires are at some worlds won't tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I'll say that Sia's desires are 'sampled randomly' from the space of all possible desires.

This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.

Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?

David Scott Krueger (formerly: capybaralet)10mo157

I think it might be more effective in future debates at the outset to:
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below). -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus. Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet). You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this. Here are a few counter-arguments and why they fail..." -- I think this could really take the wind out of the sails of the opposition, and put them on the back foot.

I also don't think Lecun and Meta should be given so much credit -- Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook's role in the Rohingya genocide, and are being sued for $150bn for this.
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.

More generally, I think the "companies do irresponsible stuff all the time" point needs to be stressed more. And one particular argument that is bogus is the "we'll make it safe" -- x-safety is a common good, and so companies should be expected to undersupply it. This is econ 101.