On point 6, "Humanity can survive an unaligned superintelligence": In this section, I initially took you to be making a somewhat narrow point about humanity's safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can't tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI "kill[ing] us all" is not among...
To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?
I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?
People interested in working on these sorts of problems should consider applying to Open Phil's request for proposals: https://www.openphilanthropy.org/request-for-proposals-technical-ai-safety-research/
I think the link in footnote two goes to the wrong place?
I haven't read the paper, but based only on the phrase you quote, I assume it's referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0
Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.
One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.
You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.
(I work at Open Phil on TAIS grantmaking)
I agree with most of this. A lot of our TAIS grantmaking over the last year was to evals grants solicited through this RFP. But I want to make a few points of clarification:
Edit: Let me know if you or someone you know is interested in working on this sort of research. I work at Open Phil and we're interested in getting a better sense of how much demand for funding in this area there is.
Good post. I agree with the headline that faithful CoT is a promising path forward for interpreting and supervising LLMs.
TL;DR: I'd suggest researchers focus more narrowly on a subset of unfaithful chain-of-thought: steganographic/encoded reasoning that the model is using to perform hidden serial reasoning, as studied in Roger et al...
I got a bit lost in understanding your exit plan. You write
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them
Some questions about this and the text that comes after it:
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?
I would say a mixture of moonshots and "doing huge amounts of science". Honestly, we don't have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren't amazing.
I agree this is unsatisfying, though w...
In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.
But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.
Is this a change in your views? Or have you always felt that catching an AI trying ...
Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)
Thanks, this is a helpful comment. Fixed the typo
Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,
I endorse this comment for the record.
I'm considering editing the blog post to clarify.
If I had known that prior work got a wildly different score on the public test set (comparable to the score I get), I wouldn't have claimed SOTA.
(That said, as you note, it seems reasonably likely (though unclear) that this prior solution was overfit to the test set while my solution is not.)
What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?
It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.
"Follow the right people on twitter" is probably the best option. People will often post twitter threads explaining new papers they put out. There's also stuff like:
can you and others please reply with lists of people you find notable for their high signal to noise ratio, especially given twitter's sharp decline in quality lately?
I appreciate you transcribing these interviews William!
Did/will this happen?
I've been loving your optimization posts so far; thanks for writing them. I've been feeling confused about this topic for a while and feel like "being able to answer any question about optimization" would be hugely valuable for me.
We're expecting familiarity with PyTorch, unlike MLAB. The level of Python background expected is otherwise similar. The bar will vary somewhat depending on each applicant's other traits, e.g. mathematical and empirical-science backgrounds
30 min, 45 min, 20-30 min (respectively)
Video link in the pdf doesn't work
Confusion:
You write "Only PaLM looks better than Chinchilla here, mostly because it trained on 780B tokens instead of 300B or fewer, plus a small (!) boost from its larger size."
But earlier you write:
"Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation.
It's 70B params, trained on 1.4T tokens of data"
300B vs. 1.4T. Is this an error?
Hmm, yeah, I phrased that point really badly. I'll go back and rewrite it.
A clearer version of the sentence might read:
"Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size."
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a "small" boost from its larger size.
EDIT: rewrote and expanded this part of the post.
I agree with your description about the hassle of eating veg when away from home. The point I was trying to make is that buying hunted meat seems possibly ethically preferable to veganism on animal welfare grounds, would address Richard's nutritional concerns, and also satisfies meat cravings.
Of course, this only works if you condition on the brutality of nature as the counterfactual. But for the time being, that won't change.
I was thinking yesterday that I'm surprised more EAs don't hunt or eat lots of mail-ordered hunted meat, like eg this. Regardless of whether you think nature should exist in the long term, as it stands the average deer, for example, has a pretty harsh life and death. Studies like this on American white-tailed deer enumerate the alternative modes of death, which I find universally unappealing. You've got predation (which surprisingly to me is the number one cause of death for fawns), car accidents, disease, and starvation. These all seem orders ...
I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.
Alt phrasing of the solution:
Each expression (i.e. a well-formed string of brackets) has a "degree", which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two,... (read more)