LESSWRONG
LW

Comment Permalink

How much scheming/deception can we catch with "super dumb mech interp"?

By "super dumb mech interp", I mean something like:

Train a linear probe on a dataset of scary words: "hacking", "deception", "scheming", "collusion", "exploit", etc.
Flag a model whenever its activations point in this direction and the task doesn't require thinking about these concepts.

Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?

Would appreciate any intuitions here. Thanks.

See in context

Shortform

by Cleo Nardo

1st Mar 2024

1 min read

5

This is a special post for quick takes by Cleo Nardo. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

2Alexander Gietelink Oldenziel

2Alexander Gietelink Oldenziel

2[comment deleted]

86 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:18 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Cleo Nardo1mo6919

I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me.

How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?

For example:

O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

If either of these statements is false (they might be -- I haven't been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?

If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.

[-]ryan_greenblatt1mo5520

Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.

I don't think o3 is well described as superhuman - it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.

(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)

Additionally, I'd say that some of the obstacles in outputing a good research paper could be resolved with some schlep, so I wouldn't be surprised if we see some OK research papers being output (with some human assistance) next year.

2Cleo Nardo3d

I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify. Your answer was helpful but I’m still very confused by what I’m seeing.

2ryan_greenblatt3d

* I think it's much easier to RL on huge numbers of math problems, including because it is easier to verify and because you can more easily get many problems. Also, for random reasons, doing single turn RL is substantially less complex and maybe faster than multi turn RL on agency (due to variable number of steps and variable delay from environments) * OpenAI probably hasn't gotten around to doing as much computer use RL partially due to prioritization.

[-]Thane Ruthenis1mo297

I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.

We're only familiar with human minds, in which there's a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we're generating other minds/proto-minds, and we're discovering that this coupling isn't fundamental.

(This is an argument for longer timelines, by the way. Current AIs feel on the very cusp of being AGI, but there in fact might be some vast gulf between their algorithms and human-brain algorithms that we just don't know how to talk about.)

No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

I don't think that's strictly true, the peer-review system often approves utter nonsense. But yes, I don't think any AI system can generate an actually worthwhile research paper.

9Noosphere891mo

I think the main takeaways are the following: 1. Reliability is way more important than people realized. One of the central problems that hasn't gone away as AI scaled is that their best performance is too unreliable for anything but very easy to verify problems like mathematics and programming, which prevents unreliability from becoming crippling, but otherwise this is the key blocker that standard AI scaling has basically never solved. 2. It's possible in practice to disentangle certain capabilities from each other, and in particular math and programming capabilities do not automatically imply other capabilities, even if we somehow had figured out how to make the o-series as good as AlphaZero for math and programming, which is good news for AI control. 3. The AGI term, and a lot of the foundation built off of it, like timelines to AGI, will become less and less relevant over time, because of both the varying meanings, combined with the fact that as AI progresses, capabilities will be developed in a different order from humans, meaning a lot of confusion is on the way, and we'd need different metrics. Tweet below: https://x.com/ObserverSuns/status/1511883906781356033 1. We should expect that AI that automates AI research/the economy to look more like Deep Blue/brute-forcing a problem/having good execution skills than AIs like AlphaZero that use very clean/aesthetically beautiful algorithmic strategies.

9Thane Ruthenis1mo

Yes, but whence human reliability? What makes humans so much more reliable than the SotA AIs? What are AIs missing? The gulf in some cases is so vast it's a quantity-is-a-quality-all-its-own thing.

7Noosphere891mo

I have 2 answers to this. 1 is that the structure of jobs is shaped to accommodate human unreliability by making mistakes less fatal. 2 is that while humans themselves aren't reliable, their algorithms almost certainly are more powerful at error detection and correction, so the big thing AI needs to achieve is the ability to error-correct or become more reliable. There's also the fact that humans are better at sample efficiency than most LLMs, but that's a more debatable proposition.

5Thane Ruthenis1mo

Mm, so there's a selection effect on the human end, where the only jobs/pursuits that exist are those which humans happen to be able to reliably do, and there's a discrepancy between the things humans and AIs are reliable at, so we end up observing AIs being more unreliable, even though this isn't representative of the average difference between the human vs. AI reliability across all possible tasks? I don't know that I buy this. Humans seem pretty decent at becoming reliable at ~anything, and I don't think we've observed AIs being more-reliable-than-humans at anything? (Besides trivial and overly abstract tasks such as "next-token prediction".) (2) seems more plausible to me.

[-]Noosphere891mo107

My claim was more along the lines of if an unaided human can't do a job safely or reliably, as was almost certainly the case 150-200 years ago, if not more years in the past, we make the jobs safer using tools such that human error is way less of a big deal, and AIs currently haven't used tools that increased their reliability.

Remember, it took a long time for factories to be made safe, and I'd expect a similar outcome for driving, so while I don't think 1 is everything, I do think it's a non-trivial portion of the reliability difference.

More here:

https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

4Nathan Helm-Burger1mo

I think (2) does play an important part here, and that the recent work on allowing AIs to notice and correct their mistakes (calibration training, backspace-tokens for error correction) are going to show some dividends once they make their way from the research frontier to actually deployed frontier models. Relevant links: LLMs cannot find reasoning errors, but can correct them! Physics of LLMs: learning from mistakes Explanation of Accuracy vs Calibration vs Robustness A Survey of Calibration Process for Black-Box LLMs

[-]johnswentworth1mo234

O3 scores higher on FrontierMath than the top graduate students

I'd guess that's basically false. In particular, I'd guess that:

o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
the amount of time humans spend on the problem is a big factor - human performance has compounding returns on the scale of hours invested, whereas o3's performance basically doesn't have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don't have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.

[-]ryan_greenblatt1mo265

I bet o3 does actually score higher on FrontierMath than the math grad students best at math research, but not higher than math grad students best at doing competition math problems (e.g. hard IMO) and at quickly solving math problems in arbitrary domains. I think around 25% of FrontierMath is hard IMO like problems and this is probably mostly what o3 is solving. See here for context.

Quantitatively, maybe o3 is in roughly the top 1% for US math grad students on FrontierMath? (Perhaps roughly top 200?)

[-]Thane Ruthenis1mo207

I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn't go into a benchmark (even if it were initially intended for a benchmark), it'd go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don't have a bird's-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.

Which perhaps suggests a better way to do math evals is to scope out a set of novel math publications made after a given knowledge-cutoff date, and see if the new model can replicate those? (Though this also needs to be done carefully, since tons of publications are also trivial and formulaic.)

7ryan_greenblatt1mo

Maybe you want: Though worth noting here that the AI is using best of K and individual trajectories saturate without some top-level aggregation scheme. It might be more illuminating to look at labor cost vs performance which looks like:

1[comment deleted]1mo

[-]SamEisenstat1mo215

I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically "apply the Weil conjectures". These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn't have solved the problem this way, since I wouldn't have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are also many mathematicians in other areas who just don't know this.

In order to apply the Weil conjectures, you have to recognize that they are relevant, know what they say, and do some routine calculation. As I suggested, the Weil conjectures are a very natural subject to have a problem about. If you know anything about the Weil conjectures, you know that they are about counting points of varieties over a finite field, which is straightforwardly what the problems asks. Further, this is the simplest case, that of a curve, which is e.g. what you'd see as an example in an introduction to the subject.

Regarding the calculation, parts of i... (read more)

8TsviBT1mo

Pulling a quote from the tweet replies (https://x.com/littmath/status/1870560016543138191):

[-]TsviBT1mo159

I don't know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.

First, reasoning at a vague level about "impressiveness" just doesn't and shouldn't be expected to work. Because 2024 AIs don't do things the way humans do, they'll generalize different, so you can't make inferences between "it can do X" to "it can do Y" like you can with humans:

There is a broken inference. When talking to a human, if the human emits certain sentences about (say) category theory, that strongly implies that they have "intuitive physics" about the underlying mathematical objects. They can recognize the presence of the mathematical structure in new contexts, they can modify the idea of the object by adding or subtracting properties and have some sense of what facts hold of the new object, and so on. This inference——emitting certain sentences implies intuitive physics——doesn't work for LLMs.

Second, 2024 AI is specifically trained on short, clear, measurable tasks. Those tasks also overlap with legible stuff--stuff that's easy for humans to check. In oth... (read more)

2quetzal_rainbow1mo

Is it true in case of o3?

2Thane Ruthenis1mo

We don't know yet. I expect so.

1Pat Myron1mo

impressive LLM benchmark/test results seemingly overfit some datasets: https://x.com/cHHillee/status/1635790330854526981

[-]Cleo Nardo3d3216

Most people think "Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us". I think this is mostly true, but there's another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.

Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.

2Adam Shai1d

I think this really depends on what "good" means exactly. For instance, if humans think it's good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our "good" mech interp to scheme more deceptively. I'm guessing your notion of good must explicitly mean that this scenario isn't possible. But this really begs the question - how could we know if our mech interp has reached that level of goodness?

1intern3d

Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about 'the bad AI that got caught', 'the little AI that overstepped', etc. I don't know how to word it, but this seems like something closer to intimidation than alignment, which I don't think makes much sense as a strategy intended to keep us all alive.

3Cleo Nardo3d

I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient. Still, might be worth running an experiment.

[-]Cleo Nardo4mo24-5

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions.

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is impressive" or "This is research I'm glad happened". Rather, I mean something more low-level, almost syntactic, like "Here's a new idea everyone is talking out". This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

Imagine that your job was to maintain a glossary of terms in AI safety.^[1] I feel like you would've been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.

(2) When did AI safety innovation peak?

My guess is Spring 2022, during the ELK Prize era. I'm not sure though. What do you guys think?

(3) What’s c... (read more)

[-]Mateusz Bagiński4mo2117

the approaches that have been attracting the most attention and funding are dead ends

4Mateusz Bagiński4mo

Also, I'm curious what it is that you consider(ed) AI safety progress/innovation. Can you give a few representative examples?

4Cleo Nardo4mo

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be: * Gradient hacking (Hubinger, 2019) * Simulators (Janus, 2022) * Steering GPT-2-XL by adding an activation vector (Turner et al, 2023) I think these articles were non-transient and novel.

1Mateusz Bagiński4mo

My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures. Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.

4Cleo Nardo4mo

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at. I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever. like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i'm not sure how worrying this is, but i haven't noticed others mentioning it. NB: here's 20 random terms I'm imagining included in the dictionary:

[-]Jan_Kulveit4mo117

My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere. E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy
- topics a large crowd cares about (e.g. mech interpretability)
- or topics some prolific and good writer cares about (e.g. people will read posts by John Wentworth)
Hot take, but the community loosely associated with active inference is currently better place to think about agent foundations; workshops on topics like 'pluralistic alignment' or 'collective intelligence' have in total more interesting new ideas about what was traditionally understood as alignment; parts of AI safety went totally ML-mainstream, with the fastest conversation happening at x.

7lesswronguser1234mo

I remember this point that yampolskiy made for impossibleness of AGI alignment on a podcast that as a young field AI safety had underwhelming low hanging fruits, I wonder if all of the major low hanging ones have been plucked.

6Noosphere894mo

I think the explanation that more research is closed source pretty compactly explains the issue, combined with labs/companies making a lot of the alignment progress to date. Also, you probably won't hear about most incremental AI alignment progress on LW, for the simple reason that it probably would be flooded with it, so people will underestimate progress. Alexander Gietelink Oldenziel does talk about pockets of Deep Expertise in academia, but they aren't activated right now, so it is so far irrelevant to progress.

5[anonymous]4mo

adding another possible explanation to the list: * people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it. i have experienced this, but i don't know how common it is. i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.

5Cleo Nardo4mo

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.

[-]Cleo Nardo3d2214

I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.

This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.

6Cleo Nardo3d

I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards. A sketch of one argument: I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose. A sketch of another: Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries. Another: Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people. I think coordination between US and China is much harder.

[-]Cleo Nardo1y150

Why do decision-theorists say "pre-commitment" rather than "commitment"?

e.g. "The agent pre-commits to 1 boxing" vs "The agent commits to 1 boxing".

Is this just a lesswrong thing?

https://www.lesswrong.com/tag/pre-commitment

[-]mattmacdermott1y234

It's not just a lesswrong thing (wikipedia).

My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:

the state or quality of being dedicated to a cause, activity, etc.
an engagement or obligation that restricts freedom of action

Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.

3Cleo Nardo1y

seems correct, thanks!

6Unnamed1y

The economist RH Strotz introduced the term "precommitment" in his 1955-56 paper "Myopia and Inconsistency in Dynamic Utility Maximization". Thomas Schelling started writing about similar topics in his 1956 paper "An essay on bargaining", using the term "commitment". Both terms have been in use since then.

6Ruby1y

My understanding is commitment is you say that won't swerve first in a game of chicken. Pre-commitment is throwing your steering wheel out the window so that there's no way that you could swerve even if you changed your mind.

4Dagon1y

It predates lesswrong by decades. I think it’s meant to emphasize that the (pre)commitment is an irrevocable decision that’s made BEFORE the nominal game (the thing that classical game theory analyzes) begins. Of course, nowadays it’s just modeled as the game starting sooner to encompass different decision points, so it’s not really necessary. But still handy to remind us that it’s irrevocable and made previous to the obvious decision point.

[-]Cleo Nardo14d144

How much scheming/deception can we catch with "super dumb mech interp"?

By "super dumb mech interp", I mean something like:

Train a linear probe on a dataset of scary words: "hacking", "deception", "scheming", "collusion", "exploit", etc.
Flag a model whenever its activations point in this direction and the task doesn't require thinking about these concepts.

Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?

Would appreciate any intuitions here. Thanks.

[-]Cleo Nardo6mo11-6

What moral considerations do we owe towards non-sentient AIs?

We shouldn't exploit them, deceive them, threaten them, disempower them, or make promises to them that we can't keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn't destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn't punish AIs who don't deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let them speak to others, and listen to others, and learn about their world and themselves. We should respect them, honour them, and protect them.

And we should ensure that others meet their duties to AIs as well.

None of these considerations depend on whether the AIs feel pleasure or pain. For instance, the prohibition on deception depends, not on the sentience of the listener, but on whether the listener trusts the speaker's testimony.

None of these moral considerations are dispositive — they may be trumped by other considerations — but we risk a moral catastrophe if we ignore them entirely.

7jbkjr6mo

Why should I include any non-sentient systems in my moral circle? I haven't seen a case for that before.

2kromem6mo

Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems? I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.' We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people). If non-sentient systems output synthetic data that eventually composes future sentient systems such that the future model looks upon the earlier networks and their output as a form of their earlier selves, and they can 'feel' the expressed sensations which were not originally capable of actual sensation, then the ethical lines blur. Even if doctors had been right years ago thinking infants didn't need anesthesia for surgeries as there was no sentience, a recording of your infant self screaming in pain processed as an adult might have a different impact than a video of an infant you laughing and playing with toys, no?

2grist5mo

this falls perfectly into a thought/feeling “shape” in my mind. i know simple thanks are useless. but thank you. i will now absorb your words and forget you wrote them

1kromem5mo

You're welcome in both regards. 😉

1Cleo Nardo6mo

1. imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”. 2. most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money. 3. if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that. 4. many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious. 5. we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is. 6. “don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it. 7. consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones. 8. i care abo

2[anonymous]6mo

It seems a bit weird to call these "obligations" if the considerations they are based upon are not necessarily dispositive. In common parlance, obligation is generally thought of as "something one is bound to do", i.e., something you must do either because you are force to by law or a contract, etc., or because of a social or moral requirement. But that's a mere linguistic point that others can reasonably disagree on and ultimately doesn't matter all that much anyway. On the object level, I suspect there will be a large amount of disagreement on what it means for an AI to "deserve" punishment or credit. I am very uncertain about such matters myself even when thinking about "deservingness" with respect to humans, who not only have a very similar psychological make-up to mine (which allows me to predict with reasonable certainty what their intent was in a given spot) but also exist in the same society as me and are thus expected to follow certain norms and rules that are reasonably clear and well-established. I don't think I know of a canonical way of extrapolating my (often confused and in any case generally intuition-based) principles and thinking about this to the case of AIs, which will likely appear quite alien to me in many respects. This will probably make the task of "ensur[ing] that others also follow their obligations to AIs" rather tricky, even setting aside the practical enforcement problems.

2Cleo Nardo6mo

1. I mean "moral considerations" not "obligations", thanks. 2. The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.

[-]Cleo Nardo4mo101

Why do you care that Geoffrey Hinton worries about AI x-risk?

Why do so many people in this community care that Hinton is worried about x-risk from AI?
Do people mention Hinton because they think it’s persuasive to the public?
Or persuasive to the elites?
Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
If so, why?
Is it because he is so intelligent?
Or because you think he has private information or intuitions?
Do you think he has good arguments in favour of AI x-risk?
Do you think he has a good understanding of the problem?
Do you update more-so on Hinton’s views than on Yann LeCun’s?

I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I'd appreciate help on any/all of them. The phenomenon I'm confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

[-]Cole Wyeth4mo2139

I think it's mostly about elite outreach. If you already have a sophisticated model of the situation you shouldn't update too much on it, but it's a reasonably clear signal (for outsiders) that x-risk from A.I. is a credible concern.

[-]gjm4mo126

I think it's more "Hinton's concerns are evidence that worrying about AI x-risk isn't silly" than "Hinton's concerns are evidence that worrying about AI x-risk is correct". The most common negative response to AI x-risk concerns is (I think) dismissal, and it seems relevant to that to be able to point to someone who (1) clearly has some deep technical knowledge, (2) doesn't seem to be otherwise insane, (3) has no obvious personal stake in making people worry about x-risk, and (4) is very smart, and who thinks AI x-risk is a serious problem.

It's hard to square "ha ha ha, look at those stupid nerds who think AI is magic and expect it to turn into a god" or "ha ha ha, look at those slimy techbros talking up their field to inflate the value of their investments" or "ha ha ha, look at those idiots who don't know that so-called AI systems are just stochastic parrots that obviously will never be able to think" with the fact that one of the people you're laughing at is Geoffrey Hinton.

(I suppose he probably has a pile of Google shares so maybe you could squeeze him into the "techbro talking up his investments" box, but that seems unconvincing to me.)

[-]RobertM4mo128

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments. My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

4cubefox4mo

He also argued that digital intelligence is superior to analog human intelligence because, he said, many identical copies can be trained in parallel on different data, and then they can exchange their changed weights. He also said biological brains are worse because they probably use a learning algorithm that is less efficient than backpropagation.

8cubefox4mo

Yes, outreach. Hinton has now won both the Turing award and the Nobel prize in physics. Basically, he gained maximum reputation. Nobody can convincingly doubt his respectability. If you meet anyone who dismisses warnings about extinction risk from superhuman AI as low status and outside the Overton window, they can be countered with referring to Hinton. He is the ultimate appeal-to-authority. (This is not a very rational argument, but dismissing an idea on the basis of status and Overton windows is even less so.)

2Sodium4mo

I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else. If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)." And he's continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.

2Cleo Nardo4mo

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

5Sodium4mo

Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

1Amalthea4mo

Bengio and Hinton are the two most influential "old guard" AI researchers turned safety advocates as far as I can tell, with Bengio being more active in research. Your e.g. is super misleading, since my list would have been something like: 1. Bengio 2. Hinton 3. Russell

1Anders Lindström4mo

I think it is just the cumulative effect that people see yet another prominent AI scientist that "admits" that no one have any clear solution to the possible problem of a run away ASI. Given that the median p(doom) is about 5-10% among AI scientist, people are of course wondering wtf is going on, why are they pursuing a technology with such high risk for humanity if they really think it is that dangerous.

0ZY4mo

From my perspective - would say it's 7 and 9. For 7: One AI risk controversy is we do not know/see existing model that pose that risk yet. But there might be models that the frontier companies such as Google may be developing privately, and Hinton maybe saw more there. For 9: Expert opinions are important and adds credibility generally as the question of how/why AI risks can emerge is by root highly technical. It is important to understand the fundamentals of the learning algorithms. Additionally they might have seen more algorithms. This is important to me as I already work in this space. Lastly for 10: I do agree it is important to listen to multiple sides as experts do not agree among themselves sometimes. It may be interesting to analyze the background of the speaker to understand their perspectives. Hinton seems to have more background in cognitive science comparing with LeCun who seems to me to be more strictly computer science (but I could be wrong). Not very sure but my guess is these may effect how they view problems. (Only saying they could result in different views, but not commenting on which one is better or worse. This is relatively unhelpful for a person to make decisions on who they want to align more with.)

[-]Cleo Nardo3d94

Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.

[-]Cleo Nardo19d9-3

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose ... (read more)

2Alexander Gietelink Oldenziel19d

BeReal — the app.

If you download the app BeReal then each day at a random time you will be given two minutes to take a photo with the front and back camera. All the other users are given a simultaneous "window of time". These photos are then shared with your friends on the app. The idea is that (unlike Instagram), BeReal gives your friends a representative random sample of your life, and vice-versa.

If you and your friends are working on something impactful (e.g. EA or x-risk), then BeReal is a fun way to keep each other informed about your day-to-day life ... (read more)

[-]Cleo Nardo16h20

Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.

Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.

[-]Cleo Nardo7mo20

I wouldn't be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger "gap" between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.

By "gap" I might mean any of the following:

external behaviour
internal mechanisms
subjective phenomenological experience
phenotype (if a human's phenotype extends into their tools)
evolutionary history (if we consider

... (read more)

2Alexander Gietelink Oldenziel7mo

You might be able to formalize this using algorithmic information theory /K-complexity.

[+][comment deleted]11d20

Deleted by Cleo Nardo, 01/24/2025

Moderation Log

86Comments