Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
There is now some work in that direction: https://forum.effectivealtruism.org/posts/47RH47AyLnHqCQRCD/soft-nationalization-how-the-us-government-will-control-ai
Sounds like a very successful hackathon! Nice work to everybody involved!
Some prompts I found interesting when brainstorming LLM startups
I spent a little time thinking about making an AI startup. I generally think it would be great if more people were trying to build useful companies that directly add value, rather than racing to build AGI. Here are some of the prompts I found interesting to think about, perhaps they will be useful to other people/AI agents interested in building a startup:
Some other problem attack surface opened up by LLMs:
Some other desiderata I had (for prompting LLMs):
Yep, I basically agree with those being the cruxes!
On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme)
Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we're worried about. I'm mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones.
By naturalistic, I mean "from a realistic training process, even if that training is designed to create goals". Which sounds like what you said is the main threat model you're worried about? If you have the time, I would push you harder on this: what is a specific story of AI catastrophe that you are trying to study/prevent?
Some quick thoughts (only skimmed the post, writing quickly), as you asked for feedback:
It looks like the main thing you're testing is some variant of "when prompted to do goal directed behavior, how effective is the model at satisfying the goal?" That's a reasonable thing to investigate, but I'm not sure it would be near the top of the list of "empirical research on goal-directed-ness that I want to see". I'm probably mainly interested in the deceptive alignment motivation, read the rest of this comment as focusing on that.
Aside: To state it directly, I think the main reason to study goal-directedness in this lower-validity setting (of giving models goals in prompts) is that CoT-based goal-directedness might act as a precursor for in-forward-pass goal directedness (which seems far more worrying re deceptive alignment) — so we can study it earlier. So again, reasonable to study, but if you agree with me that this is the main reason for such experiments being valid, it's an important frame to have when thinking about this kind of work: artificially inducing goal directedness is a model-organism approach rather than a natural experiment.
Thinking out loud, a list for goal-directedness work I want to see might be; sub-bullets are more detailed ideas:
It looks like the settings in this post are sorta a general capability eval for a model accomplishing goals. I wonder if you think they add a ton of value over existing agent benchmarks like SWE-Bench? My intuition says you would be better off on trying to focus in on a narrower question that is particularly relevant to safety, like one of those I mentioned.
Sorry if this comment was rude or mean, it's been a couple weeks and this post has no feedback even though you asked, I figured something might be better than nothing. It looks to me like your overall approach and ways of thinking about this are good!
What's the evidence that this document is real / written by Anthropic?
This sentence seems particularly concerning:
We believe the first two issues can be addressed by focusing on deterrence rather than pre-harm enforcement: instead of deciding what measures companies should take to prevent catastrophes (which are still hypothetical and where the ecosystem is still iterating to determine best practices), focus the bill on holding companies responsible for causing actual catastrophes.
Nice work, these seem like interesting and useful results!
High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don't get in each other's way, except insofar as you're imposing sparsity. Like, your "I'm inside a parenthetical" and your "I'm attempting a coup" features will both activate in the SAE hidden layer, as long as they're in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in different experts, only one of them will activate in the SAE hidden layer (based on whatever your gating learned).
The obvious reply is "but look at the empirical results you fool! The switch SAEs are pretty good!" And that's fair. I weakly expect what is happening in your experiment is that similar but slightly specialized features are being learned by each expert (a testable hypothesis), and maybe you get enough of this redundancy that it's fine e.g,. the expert with "I'm inside a parenthetical" also has a "Words relevant to coups" feature and this is enough signal for coup detection in that expert.
Again, maybe this worry is totally off or I'm misunderstanding something.
Thanks for the addition, that all sounds about right to me!
Leaving Dangling Questions in your Critique is Bad Faith
Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately.
Example
Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another?
Explanation of Example
I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My response is basically, “wait, you’re just going to ask that question and then move on?! The answer really fucking matters! Lives are at stake! You are clearly so deeply unserious about the project of doing lots of good, such that you can pose these massively important questions and then spend less than 30 seconds trying to figure out the answer.” I think I might take these critics more seriously if they took themselves more seriously.
Description of Dangling Questions
A common move I see people make when arguing or criticizing something is to pose a question that they think the original thing has answered incorrectly or is not trying sufficiently hard to answer. But then they kinda just stop there. The implicit argument is something like “The original thing didn’t answer this question sufficiently, and answering this question sufficiently is necessary for the original thing to be right.”
But importantly, the criticisms usually don’t actually argue that — they don’t argue for some alternative answer to the original questions, if they do they usually aren’t compelling, and they also don’t really try to argue that this question is so fundamental either.
One issue with Dangling Questions is that they focus the subsequent conversation on a subtopic that may not be a crux for either party, and this probably makes the subsequent conversation less useful.
Example
Me: I think LLMs might scale to AGI.
Friend: I don’t think LLMs are actually doing planning, and that seems like a major bottleneck to them scaling to AGI.
Me: What do you mean by planning? How would you know if LLMs were doing it?
Friend: Uh…idk
Explanation of Example
I think I’m basically shifting the argumentative burden onto my friend when it falls on both of us. I don’t have a good definition of planning or a way to falsify whether LLMs can do it — and that’s a hole in my beliefs just as it is a hole in theirs. And sure, I’m somewhat interested in what they say in response, but I don’t expect them to actually give a satisfying answer here. I’m posing a question I have no intention of answering myself and implying it’s important for the overall claim of LLMs scaling to AGI (my friend said it was important for their beliefs, but I’m not sure it’s actually important for mine). That seems like a pretty epistemically lame thing to do.
Traits of “Dangling Questions”
Here's another example, though it's imperfect.
Example
From an AI Snake Oil blog post:
Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. … But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.
Explanation of Example
The argument being implied is something like “scaling laws are only about perplexity, but perplexity is different from the metric we actually care about — how much? who knows? —, so you should ignore everything related to perplexity, also consider going on a philosophical side-quest to figure out what ‘better’ really means. We think ‘better’ is about emergent abilities, and because they’re emergent we can’t predict them so who knows if they will continue to appear as we scale up”. In this case, the authors have ventured an answer to their Dangling Question, “what is a ‘better’ model?“, they’ve said it’s one with more emergent capabilities than a previous model. This answer seems flat out wrong to me; acceptable answers include: downstream performance, self-reported usefulness to users, how much labor-time it could save when integrated in various people’s work, ability to automate 2022 job tasks, being more accurate on factual questions, and much more. I basically expect nobody to answer the question “what does it mean for one AI system to be better than another?” with “the second has more capabilities that were difficult to predict based on the performance of smaller models and seem to increase suddenly on a linear-performance, log-compute plot”.
Even given the answer “emergent abilities”, the authors fail to actually argue that we don’t have a scaling precedent for these. Again, I think the focus on emergent abilities is misdirected, so I’ll instead discuss the relationship between perplexity and downstream benchmark performance — I think this is fair game because this is a legitimate answer to the “what counts as ‘better’?” question and because of the original line “Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence”. The quoted thing is technically true but in this context highly misleading, because we can, in turn, draw clear relationships between perplexity and downstream benchmark performance; here are three recent papers which do so, here are even more studies that relate compute directly to downstream performance on non-perplexity metrics. Note that some of these are cited in the blog post. I will also note that this seems like one example of a failure I’ve seen a few times where people conflate “scaling laws” with what I would refer to as “scaling trends” where the scaling laws refer to specific equations for predicting various metrics based on model inputs such as # parameters and amount of data to predict perplexity, whereas scaling trends are the more general phenomenon we observe that scaling up just seems to work and in somewhat predictable ways; the scaling laws are useful for the predicting, but whether we have those specific equations or not has no effect on this trend we are observing, the equations just yield a bit more precision. Yes, scaling laws relating parameters and data to perplexity or training loss do not directly give you info about downstream performance, but we seem to be making decent progress on the (imo still not totally solved) problem of relating perplexity to downstream performance, and together these mean we have somewhat predictable scaling trends for metrics that do matter.
Example
Here’s another example from that blog post where the authors don’t literally pose a question, but they are still doing the Dangling Question thing in many ways. (context is referring to these posts):
Also, like many AI boosters, he conflates benchmark performance with real-world usefulness.
Explanation of Example
(Perhaps it would be better to respond to the linked AI Snake Oil piece, but that’s a year old and lacks lots of important evidence we have now). I view the move being made here as posing the question “but are benchmarks actually useful to real world impact?“, assuming the answer is no — or poorly arguing so in the linked piece — and going on about your day. It’s obviously the case that benchmarks are not the exact same as real world usefulness, but the question of how closely they’re related isn’t some magic black box of un-solvability! If the authors of this critique want to complain about the conflation between benchmark performance and real-world usefulness, they should actually bring the receipts showing that these are not related constructs and that relying on benchmarks would lead us astray. I think when you actually try that, you get an answer like: benchmark scores seem worse than user’s reported experience and than user’s reported usefulness in real world applications, but there is certainly a positive correlation here; we can explain some of the gap via techniques like few-shot prompting that are often used for benchmarks, a small amount via dataset contamination, and probably much of this gap comes from a validity gap where benchmarks are easy to assess but unrealistic, but thankfully we have user-based evaluations like LMSYS that show a solid correlation between benchmark scores and user experience, … (if I actually wanted to make the argument the authors were, I would be spending like >5 paragraphs on it and elaborating on all of the evidences mentioned above, including talking more about real world impacts, this is actually a difficult question and the above answer is demonstrative rather than exemplar)
Caveats and Potential Solutions
There is room for questions in critiques. Perfect need not be the enemy of good when making a critique. Dangling Questions are not always made in bad faith.
Many of the people who pose Dangling Questions like this are not trying to act in bad faith. Sometimes they are just unserious about the overall question, and they don’t care much about getting to the right answer. Sometimes Dangling Questions are a response to being confused and not having tons of time to think through all the arguments, e.g., they’re a psychological response something like “a lot feels wrong about this, here are some questions that hint at what feels wrong to me, but I can’t clearly articulate it all because that’s hard and I’m not going to put in the effort”.
My guess at a mental move which could help here: when you find yourself posing a question in the context of an argument, ask whether you care about the answer, ask whether you should spend a few minutes trying to determine the answer, ask whether the answer to this question would shift your beliefs about the overall argument, ask whether the question puts undue burden on your interlocutor.
If you’re thinking quickly and aren’t hoping to construct a super solid argument, it’s fine to have Dangling Questions, but if your goal is to convince others of your position, you should try to answer your key questions, and you should justify why they matter to the overall argument.
Another example of me posing a Dangling Question in this:
What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired.
Explanation of Example
(I’m not sure equating GPT-5 with a ~5b training run is right). In the above quote, I’m arguing against The Scaling Picture by asking whether anybody will keep investing money if we see only marginal gains after the next (public) compute jump. I think I spent very little time trying to answer this question, and that was lame (though acceptable given this was a Quick Take and not trying to be a strong argument). I think for an argument around this to actually go through, I should argue: without much larger dollar investments, The Scaling Picture won’t hold; those dollar investments are unlikely conditional on GPT-5 not being much better than GPT-4. I won’t try to argue these in depth, but I do think some compelling evidence is that OpenAI is rumored to be at ~$3.5 billion annualized revenue, and this plausibly justifies considerable investment even if the GPT-5 gain over this isn’t tremendous.
Noting that I spent a couple minutes pondering the quoted passage which I don't think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don't see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.