LESSWRONG
LW

All of maxnadeau's Comments + Replies

Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle

I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.

Alt phrasing of the solution:

Each expression (i.e. a well-formed string of brackets) has a "degree", which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two,... (read more)

4Czynski10d

Not only is there not a standard name for this set of numbers, but it's not clear what that set of numbers is. I consulted a better mathematician in the past, and he said that if you allow multiplication it becomes an known unsolved problem whether its representations are unique and whether it can construct all algebraic numbers.

Six Thoughts on AI Safety

maxnadeau1mo10

On point 6, "Humanity can survive an unaligned superintelligence": In this section, I initially took you to be making a somewhat narrow point about humanity's safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can't tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI "kill[ing] us all" is not among... (read more)

We should start looking for scheming "in the wild"

maxnadeau1moΩ691

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

4Marius Hobbhahn1mo

Good point! Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

Open problems in emergent misalignment

maxnadeau1mo202

People interested in working on these sorts of problems should consider applying to Open Phil's request for proposals: https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/

Detecting Strategic Deception Using Linear Probes

maxnadeau2mo10

This section of our RFP has some other related work you might want to include, e.g. Orgad et al.

Six Thoughts on AI Safety

maxnadeau2mo10

I think the link in footnote two goes to the wrong place?

Jesse Hoogland's Shortform

maxnadeau2mo30

I haven't read the paper, but based only on the phrase you quote, I assume it's referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0

AI Timelines

maxnadeau3moΩ450

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

7Neel Nanda3mo

I don't know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there's a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn't explicitly practice CTF. I also don't know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there's a lot of room for improvement.

8elifland3mo

Yes, that would be my guess, medium confidence. I'm skeptical of your skepticism. Not knowing basically anything about the CTF scene but using the competitive programming scene as an example, I think the median competitor is much more capable than the median software engineering professional, not less. People like competing at things they're good at.

AI Timelines

maxnadeau3moΩ570

You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.

5elifland3mo

I believe Cybench first solve times are based on the fastest top professional teams, rather than typical individual CTF competitors or cyber employees, for which the time to complete would probably be much higher (especially for the latter).

Brief analysis of OP Technical AI Safety Funding

maxnadeau5mo50

(I work at Open Phil on TAIS grantmaking)

I agree with most of this. A lot of our TAIS grantmaking over the last year was to evals grants solicited through this RFP. But I want to make a few points of clarification:

Not all the grants that have been or will be approved in 2024 are on our website. For starters, there are still two months left in the year. But also, there are some grants that have been approved, but haven't been put on the website yet. So $28 million is an modest underestimate, so it isn't directly comparable to the 2022/2023 numbers.
I agree t

maxnadeau6mo*5916

Edit: Let me know if you or someone you know is interested in working on this sort of research. I work at Open Phil and we're interested in getting a better sense of how much demand for funding in this area there is.

Good post. I agree with the headline that faithful CoT is a promising path forward for interpreting and supervising LLMs.

TL;DR: I'd suggest researchers focus more narrowly on a subset of unfaithful chain-of-thought: steganographic/encoded reasoning that the model is using to perform hidden serial reasoning, as studied in Roger et al... (read more)

The Checklist: What Succeeding at AI Safety Will Involve

maxnadeau7mo130

I got a bit lost in understanding your exit plan. You write

My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them

Some questions about this and the text that comes after it:

How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?
Why do these AIs need to be hum

... (read more)

ryan_greenblatt7mo*150

How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?

I would say a mixture of moonshots and "doing huge amounts of science". Honestly, we don't have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren't amazing.

I agree this is unsatisfying, though w... (read more)

7ryan_greenblatt7mo

It seems like I didn't do a good job of explaining the exit plan! I'll need to do a better job of explaining this in the future. (I'll respond to some of these specific points in a bit.)

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

maxnadeau7moΩ417928

In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.

But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.

Is this a change in your views? Or have you always felt that catching an AI trying ... (read more)

Buck7moΩ14259

I'm somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
As you note, one difference between thi

... (read more)

Getting 50% (SoTA) on ARC-AGI with GPT-4o

maxnadeau9mo20

Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)

Getting 50% (SoTA) on ARC-AGI with GPT-4o

maxnadeau10mo20

Thanks, this is a helpful comment. Fixed the typo

Getting 50% (SoTA) on ARC-AGI with GPT-4o

maxnadeau10mo*4319

Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,

Why isn't Jack's solution on the public leaderboard?
Is the semi-pubic test set the same as the old private set?
If not, is it equal in difficulty to the public test set, or the harder private test set

... (read more)

2Buck9mo

See https://x.com/GregKamradt/status/1806372523170533457 for a somewhat confusing update

7arabaga10mo

I agree that there is a good chance that this solution is not actually SOTA, and that it is important to distinguish the three sets. There's a further distinction between 3 guesses per problem (which is allowed according to the original specification as Ryan notes), and 2 guesses per problem (which is currently what the leaderboard tracks [rules]). Some additional comments / minor corrections: AFAICT, the current SOTA-on-the-private-test-set with 3 submissions per problem is 37%, and that solution scores 54% on the public eval set. The SOTA-on-the-public-eval-set is at least 60% (see thread). I think this is a typo and you mean the opposite. From looking into this a bit, it seems pretty clear that the public eval set and the private test set are not IID. They're "intended" to be the "same" difficulty, but AFAICT this essentially just means that they both consist of problems that are feasible for humans to solve. It's not the case that a fixed set of eval/test problems were created and then randomly distributed between the public eval set and private test set. At your link, Chollet says "the [private] test set was created last" and the problems in it are "more unique and more diverse" than the public eval set. He confirms that here: Bottom line: I would expect Ryan's solution to score significantly lower than 50% on the private test set.

ryan_greenblatt10mo266

I endorse this comment for the record.

I'm considering editing the blog post to clarify.

If I had known that prior work got a wildly different score on the public test set (comparable to the score I get), I wouldn't have claimed SOTA.

(That said, as you note, it seems reasonably likely (though unclear) that this prior solution was overfit to the test set while my solution is not.)

2[comment deleted]10mo

Anthropic Fall 2023 Debate Progress Update

maxnadeau1yΩ110

What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?

It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.

1Ansh Radhakrishnan1y

I think allowing the judge to abstain is a reasonable addition to the protocol -- we mainly didn't do this for simplicity, but it's something we're likely to incorporate in future work. The main reason you might want to give the judge this option is that it makes it harder still for a dishonest debater to come out ahead, since (ideally) the judge will only rule in favor of the dishonest debater if the honest debater fails to rebut the dishonest debater's arguments, the dishonest debater's arguments are ruled sufficient by the judge, and the honest debater's arguments are ruled insufficient by the judge. Of course, this also makes the honest debater's job significantly harder, but I think we're fine with that to some degree, insofar as we believe that the honest debater has a built-in advantage anyway (which is sort of a foundational assumption of Debate). It's also not clear that this is necessary though, since we're primarily viewing Debate as a protocol for low-stakes alignment, where we care about average-case performance, in which case this kind of "graceful failure" seems less important.

What are the best non-LW places to read on alignment progress?

maxnadeau2y2115

"Follow the right people on twitter" is probably the best option. People will often post twitter threads explaining new papers they put out. There's also stuff like:

News put together by CAIS: https://newsletter.mlsafety.org/ and https://newsletter.safe.ai/ and https://twitter.com/topofmlsafety
News put together by Daniel Paleka: https://newsletter.danielpaleka.com/ and twitter summaries like https://twitter.com/dpaleka/status/1664617835178631170

the gears to ascension2y108

can you and others please reply with lists of people you find notable for their high signal to noise ratio, especially given twitter's sharp decline in quality lately?

Geoffrey Hinton - Full "not inconceivable" quote

maxnadeau2y20

I appreciate you transcribing these interviews William!

DragonGod's Shortform

maxnadeau2y10

Did/will this happen?

2DragonGod2y

See my other shortform comments on optimisation. I did start the project, but it's currently paused. I have exams ongoing. I'll probably pick it up again, later.

My first year in AI alignment

maxnadeau2y61

I've been loving your optimization posts so far; thanks for writing them. I've been feeling confused about this topic for a while and feel like "being able to answer any question about optimization" would be hugely valuable for me.

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau2y10

Thanks, fixed

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau2y10

We're expecting familiarity with PyTorch, unlike MLAB. The level of Python background expected is otherwise similar. The bar will vary somewhat depending on each applicant's other traits, e.g. mathematical and empirical-science backgrounds

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau2y10

30 min, 45 min, 20-30 min (respectively)

Infra-Exercises, Part 1

maxnadeau3y21

Video link in the pdf doesn't work

3Jack Parker3y

Hi Max! That link will start working by the end of this weekend. Connall and I are putting the final touches on the video series, and once we're done the link will go live.

chinchilla's wild implications

maxnadeau3y50

Confusion:

You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."

But earlier you write:

"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.

It's 70B params, trained on 1.4T tokens of data"

300B vs. 1.4T. Is this an error?

3Buck3y

I think that in that first sentence, OP is comparing PaLM to other large LMs rather than to Chinchilla.

nostalgebraist3y101

Hmm, yeah, I phrased that point really badly. I'll go back and rewrite it.

A clearer version of the sentence might read:

"Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size."

For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a "small" boost from its larger size.

EDIT: rewrote and expanded this part of the post.

Some thoughts on vegetarianism and veganism

maxnadeau3y10

I agree with your description about the hassle of eating veg when away from home. The point I was trying to make is that buying hunted meat seems possibly ethically preferable to veganism on animal welfare grounds, would address Richard's nutritional concerns, and also satisfies meat cravings.

Of course, this only works if you condition on the brutality of nature as the counterfactual. But for the time being, that won't change.

Some thoughts on vegetarianism and veganism

maxnadeau3y*60

I was thinking yesterday that I'm surprised more EAs don't hunt or eat lots of mail-ordered hunted meat, like eg this. Regardless of whether you think nature should exist in the long term, as it stands the average deer, for example, has a pretty harsh life and death. Studies like this on American white-tailed deer enumerate the alternative modes of death, which I find universally unappealing. You've got predation (which surprisingly to me is the number one cause of death for fawns), car accidents, disease, and starvation. These all seem orders ... (read more)

1gbear6053y

In my experience, the hardest part about not eating meat is eating outside the house, either at restaurants or at social events. In restaurants it can be hard to find good options that don't include meat (especially if you're also avoiding animal products), and at social events the host has to go out of their way to accommodate you unless they are already planning for it. Eating hunted meat doesn't help the situation with either of those situations. Theoretically there could be a hunted meat restaurant or a social event where the host serves hunted meat, but both of those would be difficult from a verifiability standpoint ("is this really hunted meat, or are they just saying that?"). I do definitely agree that hunted meat could be a good option for people who still want to have meat but are okay with not having it all of the time and are willing to deal with the hassle. Some people buy meat from farms that raise their animals ethically. That has basically all of the same benefits and drawbacks for animal welfare concerns, but it doesn't help with climate emissions, which I think hunted meat would help with.