Is there a summary of this post?
2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.
2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.
It might.
My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.
Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.
(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)
I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.
Good point!
Y'all are on fire recently with this and the alignment faking paper.
Thanks for the useful write-up on RepE.
RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.
Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?
Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:
...While these observations enhance our co
Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).
- Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
Don't you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: "Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods."
Thanks your great paper on alignment faking, by the way.
Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.
It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).
Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .
Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.
I only see bad options, a choice between an EU-style regime and doing essentially nothing.
What issues do you have with the EU approach? (I assume you mean the EU AI Act.)
Thoughtful/informative post overall, thanks.
Wow this seems like a really important breakthrough.
Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?
Thanks, I think you're referring to:
It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)
Really fascinating post, thanks.
On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo...
But in this case Patrick Collison is a credible source and he says otherwise.
Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice
Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.
Reading that page, The Verge's claim seems to all hinge on this part:
OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."
They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.
Yes though I think he said this at APEC right before he was fired (not after).
Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)
Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856
Interesting... still taking that in.
Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.
Love to see an orthodoxy challenged!
Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.
It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?
(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)
But if there really is a large number of intelligence officials earnestly coming forward with this
Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.
...Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, bu
What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.
Did Grusch already testify to Congress? I thought that was still being planned.
Re: the tweet thread you linked to. One of the tweets is:
- Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]
Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2...
I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments
Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.
I don't know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven't seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?
The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft
I wasn't saying that there were only a few research directions that don't require frontier models period, just that there are only a few that don't require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.
I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn't want to shut down someone's research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I'm not sure it's wise to be spendin...
Thanks for reviewing it! Yea of course you can use it however you like!
Great idea, we need to make sure there are some submissions raising existential risks.
Deadline for the RFI: July 7, 2023 at 5:00pm ET
Would you agree with this summary of your post? I was interested in your post but I didn't see a summary and didn't have time to read the whole thing just now. So I generated this using a summarizer script I've been working on for articles that are longer than the context windows for gpt-3.5 turbo and gpt-4.
It's a pretty interesting thesis you have if this is right, but I wanted to check if you spotted any glaring errors:
...In this article, the author examines the challenges of aligning artificial intelligence (AI) with deontological morality as a means to en
A couple of quick thoughts:
We're working on a more thorough technical report.
Is the new Model evaluation for extreme risks paper the technical report you were referring to?
A few other possible terms to add to the brainstorm:
As an aside, if you are located in Australia or New Zealand and would be interested in coordinating with me, please contact me through LessWrong on this account.
One potential source of leads for this might be the FLI Pause Giant AI Experiments open letter . I did a Ctrl+F search on there for "Australia" which had 50+ results and "New Zealand" which had 10+. So you might find some good people to connect with on there.
Upvoted. I think it's definitely worth pursuing well-thought out advocacy in countries besides US and China. Especially since this can be done in parallel with efforts in those countries.
A lot of people are working on the draft EU AI Act in Europe.
In Canada, parliament is considering Bill C-27 which may have a significant AI component. I do some work with an org called AIGS that is trying to help make that go well.
I'm glad to hear that some projects are underway in Australia and New Zealand and that you are pursuing this there!
Seems important, I'm guessing people are downvoting this considering it a possible infohazard.
Speaking for myself, I object against the Twitter format. (This is what the shortform is for.)
Also, some more context would be nice. For example, how often it happens that a lab containing virus samples is captured by fighters? Once in a decade? Once in a week? I have no idea. Was this lab somehow exceptional?
This is just "hey, hey, something potentially interesting is happening, but you have to figure out what".
Post summary
I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:
...Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.
The LLM serves as the CPU, wh
Post summary (experimental)
Here's an alternative summary of your post, complementing your TL;DR and Overview. This is generated by my summarizer script utilizing gpt-3.5-turbo and gpt-4. (Feedback welcome!)
...The article explores the potential of language model cognitive architectures (LMCAs) to enhance large language models (LLMs) and accelerate progress towards artificial general intelligence (AGI). LMCAs integrate and expand upon approaches from AutoGPT, HuggingGPT, Reflexion, and BabyAGI, adding goal-directed agency, executive function, episodic memory, a
Less compressed summary
Here's a longer summary of your article generated by the latest version of my summarizer script:
...In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversib
My claim is that AI safety isn't part of the Chinese gestalt.
Stuart Russell claims that Xi Jinping has referred to the existential threat of AI to humanity [1].
[1] 5:52 of Russell's interview on Smerconish: https://www.cnn.com/videos/tech/2023/04/01/smr-experts-demand-pause-on-ai.cnn
Great idea, I will experiment with that - thanks!
Post summary (experimental)
I just found your post. I want to read it but didn't have time to dive into it thoroughly yet, so I put it into a summarizer script I've been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.
Here's the summary it came up with, let me know if anyone see problems with it. If you're in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:
...The article examines a theoretical solution to the AI alignment problem, focusing on detect
Post summary (auto-generated, experimental)
I am working on a summarizer script that uses gpt-3.5-turbo and gpt-4 to summarize longer articles (especially AI safety-related articles). Here's the summary it generated for the present post.
...The article addresses the issue of self-unalignment in AI alignment, which arises from the inherent inconsistency and incoherence in human values and preferences. It delves into various proposed solutions, such as system boundary alignment, alignment with individual components, and alignment through whole-system representati
New summary that's 'less wrong' (but still experimental)
I've been working on improving the summarizer script. Here's the summary auto-generated by the latest version, using better prompts and fixing some bugs:
...The author investigates a phenomenon in GPT language models where the prompt "petertodd" generates bizarre and disturbing outputs, varying across different models. The text documents experiments with GPT-3, including hallucinations, transpositions, and word associations. Interestingly, "petertodd" is associated with character names from the Japanese R
Great feedback, thanks! Looks like GPT-4 ran away with its imagination a bit. I'll try to fix that.
Post summary (experimental)
Here's an experimental summary of this post I generated using gpt-3.5-turbo and gpt-4:
...This article discusses the 'petertodd' phenomenon in GPT language models, where the token prompts the models to generate disturbing and violent language. While the cause of the phenomenon remains unexplained, the article explores its implications, as language models become increasingly prevalent in society. The author provides examples of the language generated by the models when prompted with 'petertodd', which vary between models. The article
Seems to claim the post talks about things it doesn't ("as language models become more prevalent in society" narrative(??)), while also leaving out important nuance about what that the post does talk about.
Upvoted for trying stuff, disagreement voted because the summary just ain't very good.
Robert Miles has been making educational videos about AI existential risk and AI alignment for 4+ years. I've never spoken with him, but I'm sure he has learned a lot about how to communicate these ideas to a general audience in the process. I don't know that he has compiled his learnings on that anywhere, but it might be worth reaching out to him if you're looking to talk with someone who has experience with this.
Another resource - Vael Gates and Collin Burns shared some testing they did in Dec 2022 on outreach to ML researchers in What AI Safety Material...
"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.