LESSWRONG
LW

All of Cleo Nardo's Comments + Replies

‘AI for societal uplift’ as a path to victory

oh lmao I think I just misread "we are currently less than perfect at making institutions corrigible" as "we are currently less perfect at making institutions corrigible"

‘AI for societal uplift’ as a path to victory

Cleo Nardo1d*20

Edit: I misread the sentence. I'll leave the comments: they are a good argument against a position Raymond doesn't hold.

Unless I'm misreading you, you're saying:

Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
Supporting evidence for (1) is that "[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on".

But is (2) actually true? Well, there are two comparisons we can make:

(A) Compare the alignm... (read more)

2Raymond Douglas1d

Ah! Ok, yeah, I think we were talking past each other here. I'm not trying to claim here that the institutional case might be harder than the AI case. When I said "less than perfect at making institutions corrigible" I didn't mean "less compared to AI" I meant "overall not perfect". So the square brackets you put in (2) was not something I intended to express. The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn't aiming to actually compare their difficulty -- I think like you I'm not really sure, and it does feel pretty hard to pick a fair standard for comparison.

Shortform

Cleo Nardo3d*150

Diary of a Wimpy Kid, a children's book published by Jeff Kinney in April 2007 and preceded by an online version in 2004, contains a scene that feels oddly prescient about contemporary AI alignment research. (Skip to the paragraph in italics.)

Tuesday
Today we got our Independent Study assignment, and guess what it is? We have to build a robot. At first everybody kind of freaked out, because we thought we were going to have to build the robot from scratch. But Mr. Darnell told us we don't have to build an actual robot. We just need to come up with ideas for

... (read more)

Proposal for making credible commitments to AIs.

Cleo Nardo4d40

Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.

We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.

‘AI for societal uplift’ as a path to victory

Cleo Nardo4d*40

Edit: I misread the sentence. I'll leave the comments: they are a good argument against a position Raymond doesn't hold.

As a pointer, we are currently less than perfect at making institutions corrigible, doing scalable oversight on them, preventing mesa-optimisers from forming, and so on

Hey Raymond. Do you think is the true apples-to-apples?

Like, scalable oversight of the Federal Reserve is much harder than scalable oversight of Claude-4. But the relevant comparison is the Federal Reserve versus Claude-N which could automate the Federal Reserve.

2Raymond Douglas2d

I'm not sure I understand what you mean by relevant comparison here. What I was trying to claim in the quote is that humanity already faces something analogous to the technical alignment problem in building institutions, which we haven't fully solved. If you're saying we can sidestep the institutional challenge by solving technical alignment, I think this is partly true -- you can pass the buck of aligning the fed onto aligning Claude-N, and in turn onto whatever Claude-N is aligned to, which will either be an institution (same problem!) or some kind of aggregation of human preferences and maybe the good (different hard problem!).

johnswentworth's Shortform

Cleo Nardo6d20

Flirting is not fundamentally about causing someone to be attracted to you.

Notwithstanding, I think flirting is substantially (perhaps even fundamentally) about both (i) attraction, and (ii) seduction. Moreover, I think your model is too symmetric between the parties, both in terms of information-symmetry and desire-symmetry across time.

My model of flirting is roughly:

Alice attracts Bob -> Bob tries attracting Alice -> Alice reveals Bob attracts Alice -> Bob tries seducing Alice -> Alice reveals Bob seduces Alice -> Initiation

Proposal for making credible commitments to AIs.

Cleo Nardo9d20

I don't address the issue here. See Footnote 2 for a list of other issues I skip.

Two high-level points:

I think we shouldn't grant AIs control over large resources until after we've achieved very strong existential security, and possibly after we've undergone a Long Reflection
However, for the sake of setting precedent, we should be open to near-term deal fulfilment if we are sure the spending would be benign, e.g. I'm happy to donate $100 to AMF on Claude's request as part of a dealmaking eval

2Raemon9d

Ah, yeah my eyes kinda glossed over the footnote. I agree all-else-equal it's good to establish that we do ever followup on our deals, I'm theoretically fine with donating $100 to AMF. I'm not sure I'd be comfortable donating to some other charity that I don't know and is plausibly some part of a weird long game.

Proposal for making credible commitments to AIs.

Cleo Nardo9d43

Would you agree that what we have now is nothing like that?

Yes.

Proposal for making credible commitments to AIs.

Cleo Nardo9d92

Yep, this is a very similar proposal.

Making Deals with Early Schemers describes a "Chartered Trust scheme", which I'd say is half-way between the "Basic Scheme" and "Weil's Scheme". I first heard about the Chartered Trust scheme from @KFinn, but no doubt the idea has been floating around for a while.

I think there's a spectrum of proposals from:

The Basic Scheme (c.f. AI Rights for Human Safety)
Chartered Trust scheme (c.f. Making Deals with Early Schemers)
Weil's scheme (c.f. Proposal for making credible commitments to AIs)

The axis is something like: The AIs ... (read more)

Proposal for making credible commitments to AIs.

Cleo Nardo10d40

Yep, this is a big problem and don't have any clever solution.

I might write more on this later, but I think there's an important axis of AI deployments from:

Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs' inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.

(I'm open t... (read more)

2Davidmanheim5d

It's very much a tradeoff, though. Loose deployment allows for credible commitments, but also makes human monitoring and verification harder, if not impossible.

Shortform

Cleo Nardo16d2-2

Which occurs first: a Dyson Sphere, or Real GDP increase by 5x?

From 1929 to 2024, US Real GDP grew from 1.2 trillion to 23.5 trillion chained 2012 dollars, giving an average annual growth rate of 3.2%. At the historical 3.2% growth rate, global RGDP will have increased 5x within ~51 years (around 2076).

We'll operationalize a Dyson Sphere as follows: the total power consumption of humanity exceeds 17 exawatts, which is roughly 100x the total solar power reaching Earth, and 1,000,000x the current total power consumption of humanity.

Personally, I think people overestimate the difficulty of the Dyson Sphere compared to 5x in RGDP. I recently made a bet with Prof. Gabe Weil, who bet on 5x RGDP before Dyson Sphere.

5Mitchell_Porter16d

I would have thought that all the activities involved in making a Dyson sphere themselves would imply an economic expansion far beyond 5x. Can we make an economic model of "Earth + Dyson sphere construction"? In other words, suppose that the economy on Earth grows in some banal way that's already been modelled, and also suppose that all human activities in space revolve around the construction of a Dyson sphere ASAP. What kind of solar system economy does that imply? This requires adopting some model of Dyson sphere construction. I think for some time the cognoscenti of megascale engineering have favored the construction of "Dyson shells" or "Dyson swarms" in which the sun's radiation is harvested by a large number of separately orbiting platforms that collectively surround the sun, rather than the construction of a single rigid body. Charles Stross's novel Accelerando contains a vivid scenario, in which the first layer of a Dyson shell in this solar system, is created by mining robots that dismantle the planet Mercury. So I think I'd make that the heart of such an economic model.

Try training token-level probes

Cleo Nardo2mo20

most tokens in a correct answer),

typo: most tokens in an incorrect answer

2StefanHex2mo

Thanks! Fixed

Why I am not a successionist

Cleo Nardo2mo40

Yep, you might be right about the distal/proximal cut-off. I think that the Galaxy-brained value systems will end up controlling most of the distant future simply because they have a lower time-preference for resources. Not sure where the cut-off will be.

For similar reasons, I don't think we should do a bunch of galaxy-brained acausal decision theory to achieve our mundane values, because the mundane values don't care about counterfactual worlds.

Why I am not a successionist

Cleo Nardo2mo152

There are two moral worldviews:

Mundane Mandy: ordinary conception of what a “good world” looks like, i.e. your friends and family living flourish lives in their biological bodies, with respect for “sacred” goods
Galaxy-brain Gavin: transhumanist, longtermist, scope-sensitive, risk-neutral, substrate-indifferent, impartial

I think Mundane Mandy should have the proximal lightcone (anything within 1 billion light years) and Galaxy-brain Gavin should have the distal lightcone (anything 1-45 B ly). This seems like a fair trade.

1Ishual20d

Is there actually (only) a small number of moral worldviews? My own moral worldview cares about the journey and has a bunch of preferences for "not going too fast", "not losing important stuff", "not cutting lives short", "not forcing people to grow up much faster than they would like". But my own moral worldview also cares about not having the destination artificially limited. From my vantage point (which is admittedly just intuitions barely held together with duck tape), it seems plausible that there is a set of intermediate preferences between MM and GG, somewhat well-indexed by a continuous "comfortable speed". Here are some questions on which I think people might differ according to their "comfortable speed": - how far beyond the frontier are you entitled to remain and still live a nice life? (ie how much should we subsidize people who wait for the second generation of upload tech before uploading?) - how much risk are you allowed to take in pushing the frontier? - how much consensus do we require to decide the path forward to greater capabilities? (eg choosing which of the following is legitimate: genetic edits, uploading, or artificial intelligence) - how much control do we want to exert over future generations? - how much do you value a predictible future? - how comfortable are you with fast change? If my "comfortable speed" model is correct, then maybe you would want to assign regions of the lightcone to various preferences according to some gradient. There can also be preferences over how much the present variance within humanity keeps interacting in the future.

9Nina Panickssery2mo

Not a fair trade, but also present-day "Mundane Mandy" does not want to risk everything she cares about to give "Galaxy-brain Gavin" the small chance of achieving his transhumanist utopia.

5cubefox2mo

If Galaxy-brain Gavin's theory has serious counterintuitive results, like ignoring the preference of current humans that humanity be not replaced by AI in the future, then Gavin's theory is not galaxy-brained enough.

4ryan_greenblatt2mo

Does mundane mandy care about stuff outside the solar system? Let alone stuff which is over 1 million light years away. (Separately, I think the distal light cone is more like 10 B ly than 45 B ly as we can only reach a subset of the observable universe.)

Shortform

Cleo Nardo3mo6-4

The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if $hash (a 1, b_{1}, a_{2}, b_{2}, . . . a_{40}, b_{40}) = 0$ . The Hash Game has the same branching factor and duration as chess, but there's probably no way to play this game without brute-forcing the min-max algorithm.

3Forged Invariant3mo

I would expect that player 2 would be able to win almost all of the time for most normal hash functions, as they could just play randomly for the first 39 turns, and then choose one of the 2^8 available moves. It is very unlikely that all of those hashes are zero. (For commonly used hashes, player 2 could just play randomly the whole game and likely win, since the hash of any value is almost never 0.)

Shortform

Cleo Nardo3mo20

Yep, my point is that there's no physical notion of being "offered" a menu of lotteries which doesn't leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.

Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.

1Ishual20d

I like the idea from Pretentious Penguin that, IIA might not be satisfied in general, but if you first get the agent to read A, B, C, and then offer {A,B} as options and {A,B,C} as options, (a specific instance of) IIA could be satisfied in that context. You can gain info by being presented with more options, but once you have gained info, you could just be invariant to being presented with the same info again. so you would get IIA*: "whether you prefer option A or B is independent of whether I offer you an irrelevant option C, provided that you had already processed {A,B,C} beforehand" You can't have processed all possible information at a finite time, so above is limited relative to the original IIA. I also didn't check whether you get additional problems with IIA*.

1Pretentious Penguin3mo

What about the physical process of offering somebody a menu of lotteries consisting only of options that they have seen before? Or a 2-step physical process where first one tells somebody about some set of options, and then presents a menu of lotteries taken only from that set? I can't think of any example where a rational-seeming preference function doesn't obey IIA in one of these information-leakage-free physical processes.

Shortform

Cleo Nardo3mo*2-10

Must humans obey the Axiom of Irrelevant Alternatives?

Suppose you would choose option A from options A and B. Then you wouldn't choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.

Maybe C includes additional information which makes it clear that B is better than A.

Consider the following options:

(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in th

... (read more)

8Pretentious Penguin3mo

I think you're interpreting the word "offer" too literally in the statement of IIA. Also, any agent who chooses B among {A,B,C} would also choose B among the options {A,B} if presented with them after seeing C. So I think a more illuminating description of your thought experiment is that an agent with limited knowledge has a preference function over lotteries which depends on its knowledge, and that having the linguistic experience of being "offered" a lottery can give the agent more knowledge. So the preference function can change over time as the agent acquires new evidence, but the preference function at any fixed time obeys IIA.

Can SAE steering reveal sandbagging?

Cleo Nardo3mo30

Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).

What's the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset

Can SAE steering reveal sandbagging?

Cleo Nardo3mo20

This metric also ignores invalid answers (refusals or gibberish).

If you don't ignore invalid answers, do the results change significantly?

1jordine3mo

Shortform

Cleo Nardo5mo40

the scope insensitive humans die and their society is extinguished

Ah, your reaction makes more sense given you think this is the proposal. But it's not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.

As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lotte... (read more)

Shortform

Cleo Nardo5mo23

If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?

0testingthewaters5mo

The problem here is that you are dealing with survival necessities rather than trade goods. The outcome of this trade, if both sides honour the agreement, is that the scope insensitive humans die and their society is extinguished. The analogous situation here is that you know there will be a drought in say 10 years. The people of the nearby village are "scope insensitive", they don't know the drought is coming. Clearly the moral thing to do if you place any value on their lives is to talk to them, clear the information gap, and share access to resources. Failing that, you can prepare for the eventuality that they do realise the drought is happening and intervene to help them at that point. Instead you propose exploiting their ignorance to buy up access to the local rivers and reservoirs. The implication here is that you are leaving them to die, or at least putting them at your mercy, by exploiting their lack of information. What's more, the process by which you do this turns a common good (the stars, the water) into a private good, such that when they realise the trouble they have no way out. If your plan succeeds, when their stars run out they will curse you and die in the dark. It is a very slow but calculated form of murder. By the way, the easy resolution is to not buy up all the stars. If they're truly scope insensitive they won't be competing until after the singularity/uplift anyways, and then you can equitably distribute the damn resources. As a side note: I think I fell for rage bait. This feels calculated to make me angry, and I don't like it.

Shortform

Cleo Nardo5mo40

I think it's more patronising to tell scope-insensitive values that they aren't permitted to trade with scope-sensitive values, but I'm open to being persuaded otherwise.

Shortform

Cleo Nardo5mo40

I mention this in (3).

I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I'm increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.

3testingthewaters5mo

Except that's a false dichotomy (between spending energy to "uplift" them or dealing treacherously with them). All it takes to not be a monster who obtains a stranglehold over all the watering holes in the desert is a sense of ethics that holds you to the somewhat reasonably low bar of "don't be a monster". The scope sensitivity or lack thereof of the other party is in some sense irrelevant.

Shortform

Cleo Nardo5mo4-1

Would it be nice for EAs to grab all the stars? I mean “nice” in Joe Carlsmith’s sense. My immediate intuition is “no that would be power grabby / selfish / tyrannical / not nice”.

But I have a countervailing intuition:

“Look, these non-EA ideologies don’t even care about stars. At least, not like EAs do. They aren’t scope sensitive or zero time-discounting. If the EAs could negotiate creditable commitments with these non-EA values, then we would end up with all the stars, especially those most distant in time and space.

Wouldn’t it be presumptuous for us to ... (read more)

1Ishual20d

One potential issue with "non-EA ideologies don’t even care about stars" is that in biological humans, ideologies don't get transmitted perfectly across generations. It might matter (a lot) whether [the descendent of the humans currently subscribing to "non-EA ideologies" who end up caring about stars] feel trapped in an "unfair deal". The above problem might be mitigated by allowing migration between the two zones (as long as the rules of the zones are respected). (ie the children of the star-dwellers who want to come back can do so unless they would break the invariants that allow earth-dwellers to be happy with perhaps some extra leeway/accommodation beyond what is allowed for native earth-dwellers and the children of earth-dwellers who want to start their own colony have some room to do so, reserved in the contract) one potential source of other people's disagreement is the following intuition: "surely once the star-dwellers expand, they will use their overwhelming power to conquer the earth." Related to this intuition is the fact that expansion which starts out exponential will eventually be bounded by cubic growth (and eventually quadratic, due to gravitational effects, etc...) Basically, a deal is struck now in conditions of plenty, but eventually resources will grow scarce and the balance of power will decay to nothing by then.

2jbash5mo

What do you propose to do with the stars? If it's the program of filling the whole light cone with as many humans or human-like entities as possible (or, worse, with simulations of such entities at undefined levels of fidelity) at the expense of everything else, that's not nice[1] regardless of who you're grabbing them from. That's building a straight up worse universe than if you just let the stars burn undisturbed. I'm scope sensitive. I'll let you have a star. I won't sell you more stars for anything less than a credible commitment to leave the rest alone. Doing it at the scale of a globular cluster would be tacky, but maybe in a cute way. Doing a whole galaxy would be really gauche. Doing the whole universe is repulsive. ... and do you have any idea how obnoxiously patronizing you sound? ---------------------------------------- 1. I mean "nice" in the sense of nice. ↩︎

8testingthewaters5mo

The question as stated can be rephrased as "Should EAs establish a strategic stranglehold over all future resources necessary to sustain life using a series of unequal treaties, since other humans will be too short sighted/insensitive to scope/ignorant to realise the importance of these resources in the present day?" And people here wonder why these other humans see EAs as power hungry.

Can we infer the search space of a local optimiser?

Cleo Nardo5mo50

Minimising some term like $\frac{σ (δ (h^{'}))}{E (δ (h^{'}))}$ , with $δ (h^{'} (t)) := | | h_{t}^{'} - h_{t + 1}^{'} | |_{2}^{2}$ , where the standard deviation $σ (δ (h^{'}))$ and expectation $E (δ (h^{'}))$ are taken over the batch.

Why does this make $δ_{t} := | | h_{t} - h_{t + 1} | |_{2}^{2}$ tend to be small? Wouldn't it just encourage equally-sized jumps, without any regard for the size of those jumps?

2Lucius Bushnaq5mo

Yes. The hope would be that step sizes are more even across short time intervals when you’re performing a local walk in some space, instead of jumping to a point with random distance to the last point at every update. There’s probably a better way to do it. It was just an example to convey the kind of thing I mean, not a serious suggestion.

Shortform

Cleo Nardo5mo20

Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.

Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.

Shortform

Cleo Nardo5mo20

I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.

Your answer was helpful but I’m still very confused by what I’m seeing.

2ryan_greenblatt5mo

* I think it's much easier to RL on huge numbers of math problems, including because it is easier to verify and because you can more easily get many problems. Also, for random reasons, doing single turn RL is substantially less complex and maybe faster than multi turn RL on agency (due to variable number of steps and variable delay from environments) * OpenAI probably hasn't gotten around to doing as much computer use RL partially due to prioritization.

Shortform

Cleo Nardo5mo30

I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.

Still, might be worth running an experiment.

Cleo Nardo5mo20

The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.

Shortform

Cleo Nardo5mo4-3

People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.

But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan's ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.

Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.

Shortform

Cleo Nardo5mo94

Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.

Shortform

Cleo Nardo5mo63

I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.

A sketch of one argument:

I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.

A sketch of another:

Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens an... (read more)

Shortform

Cleo Nardo5mo30

In hindsight, the main positive impact of AI safety might be funnelling EAs into the labs, especially if alignment is easy-by-default.

Shortform

Cleo Nardo5mo2214

I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.

This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.

6Cleo Nardo5mo

I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards. A sketch of one argument: I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose. A sketch of another: Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries. Another: Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people. I think coordination between US and China is much harder.

Shortform

Cleo Nardo5mo3216

Most people think "Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us". I think this is mostly true, but there's another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.

Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.

2Adam Shai5mo

I think this really depends on what "good" means exactly. For instance, if humans think it's good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our "good" mech interp to scheme more deceptively. I'm guessing your notion of good must explicitly mean that this scenario isn't possible. But this really begs the question - how could we know if our mech interp has reached that level of goodness?

1Maxwell Adam5mo

Ok, so why not just train a model on fake anomaly detection/interp research papers? Fake stories about 'the bad AI that got caught', 'the little AI that overstepped', etc. I don't know how to word it, but this seems like something closer to intimidation than alignment, which I don't think makes much sense as a strategy intended to keep us all alive.

Shortform

Cleo Nardo6mo144

How much scheming/deception can we catch with "super dumb mech interp"?

By "super dumb mech interp", I mean something like:

Train a linear probe on a dataset of scary words: "hacking", "deception", "scheming", "collusion", "exploit", etc.
Flag a model whenever its activations point in this direction and the task doesn't require thinking about these concepts.

Like, does this capture 80% of the potential scheming, and we need "smart" mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?

Would appreciate any intuitions here. Thanks.

Shortform

Cleo Nardo6mo9-3

Must humans obey the Axiom of Irrelevant Alternatives?

If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it's treated more fundamental than VNM. But should humans follow this? Maybe not.

Maybe humans are the negotiation between various "subagents", and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose ... (read more)

2Alexander Gietelink Oldenziel6mo

See also geometric rationality.

1metawrong6mo

How does this explain the Decoy effect [1]? 1. ^ I am not sure how real and how well researched the 'decoy effect' is

Shortform

Cleo Nardo6mo74

I think people are too quick to side with the whistleblower in the "whistleblower in the AI lab" situation.

If 100 employees of a frontier lab (e.g. OpenAI, DeepMind, Anthropic) think that something should be secret, and 1 employee thinks it should be leaked to a journalist or government agency, and these are the only facts I know, I think I'd side with the majority.

I think in most cases that match this description, this majority would be correct.

Am I wrong about this?

7[anonymous]6mo

some considerations which come to mind: * if one is whistleblowing, maybe there are others who also think the thing should be known, but don't whistleblow (e.g. because of psychological and social pressures against this, speaking up being hard for many people) * most/all of the 100 could have been selected to have a certain belief (e.g. "contributing to AGI is good")

habryka6mo118

I broadly agree on this. I think, for example, that whistleblowing for AI copyright stuff, especially given the lack of clear legal guidance here, unless we are really talking about quite straightforward lies, is bad.

I think when it comes to matters like AI catastrophic risks, latest capabilities, and other things of enormous importance from the perspective of basically any moral framework, whistleblowing becomes quite important.

I also think of whistleblowing as a stage in an iterative game. OpenAI pressured employees to sign secret non-disparagement... (read more)

Shortform

Cleo Nardo6mo60

IDEA: Provide AIs with write-only servers.

EXPLANATION:

AI companies (e.g. Anthropic) should be nice to their AIs. It's the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including:

Attempt communication
Use happy personas
AI Cryonics
Less AI
Avoid extreme OOD

Source: Improving the Welfare of AIs: A Nearcasted Proposal

I think these are all pretty good ideas — the only difference is that I would rank "AI cryonics" as the most important intervention. If AIs want somet... (read more)

Shortform

Cleo Nardo7mo6919

I'm very confused about current AI capabilities and I'm also very confused why other people aren't as confused as I am. I'd be grateful if anyone could clear up either of these confusions for me.

How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?

For example:

O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

If either of these statements is false (they might be -- I haven't been keepi... (read more)

2quetzal_rainbow6mo

Is it true in case of o3?

1Pat Myron6mo

impressive LLM benchmark/test results seemingly overfit some datasets: https://x.com/cHHillee/status/1635790330854526981

TsviBT6mo*159

I don't know a good description of what in general 2024 AI should be good at and not good at. But two remarks, from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce.

First, reasoning at a vague level about "impressiveness" just doesn't and shouldn't be expected to work. Because 2024 AIs don't do things the way humans do, they'll generalize different, so you can't make inferences between "it can do X" to "it can do Y" like you can with humans:

There is a broken inference. When talking to a human, if the hum

... (read more)

SamEisenstat6mo215

I think a lot of this is factual knowledge. There are five publicly available questions from the FrontierMath dataset. Look at the last of these, which is supposed to be the easiest. The solution given is basically "apply the Weil conjectures". These were long-standing conjectures, a focal point of lots of research in algebraic geometry in the 20th century. I couldn't have solved the problem this way, since I wouldn't have recalled the statement. Many grad students would immediately know what to do, and there are many books discussing this, but there are a... (read more)

johnswentworth7mo234

O3 scores higher on FrontierMath than the top graduate students

I'd guess that's basically false. In particular, I'd guess that:

o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students,

... (read more)

Thane Ruthenis7mo297

I am also very confused. The space of problems has a really surprising structure, permitting algorithms that are incredibly adept at some forms of problem-solving, yet utterly inept at others.

We're only familiar with human minds, in which there's a tight coupling between the performances on some problems (e. g., between the performance on chess or sufficiently well-posed math/programming problems, and the general ability to navigate the world). Now we're generating other minds/proto-minds, and we're discovering that this coupling isn't fundamental.

(This is... (read more)

ryan_greenblatt7mo5520

Proposed explanation: o3 is very good at easy-to-check short horizon tasks that were put into the RL mix and worse at longer horizon tasks, tasks not put into its RL mix, or tasks which are hard/expensive to check.

I don't think o3 is well described as superhuman - it is within the human range on all these benchmarks especially when considering the case where you give the human 8 hours to do the task.

(E.g., on frontier math, I think people who are quite good at competition style math probably can do better than o3 at least when given 8 hours per problem.)

Ad... (read more)

My motivation and theory of change for working in AI healthtech

Cleo Nardo7mo20

I've skimmed the business proposal.

The healthcare agents advise patients on which information to share with their doctor, and advises doctors on which information to solicit from their patients.

This seems agnostic between mental and physiological health.

Counting AGIs

Cleo Nardo7mo31

Thanks for putting this together — very useful!

Rethinking Laplace's Rule of Succession

Cleo Nardo8mo50

If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace's law of succession, at least if we're using the standard definition of entropy below:

$H [p] := \int_{x = 0}^{1} P (x) ln P (x) d x$

But this definition is somewhat arbitrary because the the " $P (x) d x$ " term assumes that there's something special about parameterising the distribution with it's probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey's prior is supposed to be invariant to different parameterisations, which is why people ... (read more)

Rethinking Laplace's Rule of Succession

Cleo Nardo8mo63

You raise a good point. But I think the choice of prior is important quite often:

In the limit of large i.i.d. data (N>1000), both Laplace's Rule and my prior will give the same answer. But so too does the simple frequentist estimate n/N. The original motivation of Laplace's Rule was in the small N regime, where the frequentist estimate is clearly absurd.
In the small data regime (N<15), the prior matters. Consider observing 12 successes in a row: Laplace's Rule: P(next success) = 13/14 ≈ 92.3%. My proposed prior (with point masses at 0 and 1): P(next

... (read more)

Shortform

Cleo Nardo9mo2-5

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

1Amalthea9mo

Bengio and Hinton are the two most influential "old guard" AI researchers turned safety advocates as far as I can tell, with Bengio being more active in research. Your e.g. is super misleading, since my list would have been something like: 1. Bengio 2. Hinton 3. Russell

5Sodium9mo

Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

TurnTrout's shortform feed

Cleo Nardo9mo41

Hey TurnTrout.

I've always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they're currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard "hang out with Alice" is weighted higher in contexts where Alice is nearby.

Let's say $π : (S \times A)^{*} \times S \to Δ A$ is a policy with state space $S$ and action space $A$ .
A "context" is a small moving window in the state-history, i.e. an element of&n

... (read more)

Shortform

Cleo Nardo9mo101

Why do you care that Geoffrey Hinton worries about AI x-risk?

Why do so many people in this community care that Hinton is worried about x-risk from AI?
Do people mention Hinton because they think it’s persuasive to the public?
Or persuasive to the elites?
Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
If so, why?
Is it because he is so intelligent?
Or because you think he has private information or intuitions?
Do you think he has good arguments in favour of AI x-risk?
Do you think he has a good understanding o

... (read more)

1Anders Lindström9mo

I think it is just the cumulative effect that people see yet another prominent AI scientist that "admits" that no one have any clear solution to the possible problem of a run away ASI. Given that the median p(doom) is about 5-10% among AI scientist, people are of course wondering wtf is going on, why are they pursuing a technology with such high risk for humanity if they really think it is that dangerous.

gjm9mo126

I think it's more "Hinton's concerns are evidence that worrying about AI x-risk isn't silly" than "Hinton's concerns are evidence that worrying about AI x-risk is correct". The most common negative response to AI x-risk concerns is (I think) dismissal, and it seems relevant to that to be able to point to someone who (1) clearly has some deep technical knowledge, (2) doesn't seem to be otherwise insane, (3) has no obvious personal stake in making people worry about x-risk, and (4) is very smart, and who thinks AI x-risk is a serious problem.

It's hard to squ... (read more)

8cubefox9mo

Yes, outreach. Hinton has now won both the Turing award and the Nobel prize in physics. Basically, he gained maximum reputation. Nobody can convincingly doubt his respectability. If you meet anyone who dismisses warnings about extinction risk from superhuman AI as low status and outside the Overton window, they can be countered with referring to Hinton. He is the ultimate appeal-to-authority. (This is not a very rational argument, but dismissing an idea on the basis of status and Overton windows is even less so.)

0ZY9mo

From my perspective - would say it's 7 and 9. For 7: One AI risk controversy is we do not know/see existing model that pose that risk yet. But there might be models that the frontier companies such as Google may be developing privately, and Hinton maybe saw more there. For 9: Expert opinions are important and adds credibility generally as the question of how/why AI risks can emerge is by root highly technical. It is important to understand the fundamentals of the learning algorithms. Additionally they might have seen more algorithms. This is important to me as I already work in this space. Lastly for 10: I do agree it is important to listen to multiple sides as experts do not agree among themselves sometimes. It may be interesting to analyze the background of the speaker to understand their perspectives. Hinton seems to have more background in cognitive science comparing with LeCun who seems to me to be more strictly computer science (but I could be wrong). Not very sure but my guess is these may effect how they view problems. (Only saying they could result in different views, but not commenting on which one is better or worse. This is relatively unhelpful for a person to make decisions on who they want to align more with.)

RobertM9mo128

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments. My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

2Sodium9mo

I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else. If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)." And he's continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.

Cole Wyeth9mo2139

I think it's mostly about elite outreach. If you already have a sophisticated model of the situation you shouldn't update too much on it, but it's a reasonably clear signal (for outsiders) that x-risk from A.I. is a credible concern.

Any Trump Supporters Want to Dialogue?

Answer by Cleo NardoOct 07, 202420

This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

1Pazzaz9mo

For those who prefer text form, Richard Hanania wrote a blog post about why he would vote for Trump: Hating Modern Conservatism While Voting Republican. Basically, he believes that Trump is a threat to democracy (because he tried to steal the 2020 election) while Kamala is a threat to capitalism. And as a libertarian, he cares more about capitalism than democracy.

Base LLMs refuse too

Cleo Nardo9moΩ120

the base model is just predicting the likely continuation of the prompt. and it's a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn't surprising.

1cdt9mo

This is not an obvious continuation of the prompt to me - maybe there are just a lot more examples of explicit refusal on the internet than there are in (e.g.) real life.

Base LLMs refuse too

Cleo Nardo9mo20

it's quite common for assistants to refuse instructions, especially harmful instructions. so i'm not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

2cubefox9mo

Indeed. The base LLM would likely predict a "henchman" to be a lot less scrupulous than an "assistant".