All of Evan R. Murphy's Comments + Replies

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

Is there a summary of this post?

2GeneSmith
https://x.com/GeneSmi96946389/status/1892721828625264928

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

3Evan R. Murphy
"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Y'all are on fire recently with this and the alignment faking paper.

Thanks for the useful write-up on RepE.

RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.

Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?

Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:

While these observations enhance our co

... (read more)

Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).

  • Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)


Don't you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: "Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods."

Thanks your great paper on alignment faking, by the way.

Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.

It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).

Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .

3Simon Lermen
Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version. I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons: 1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling. 2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful. 3. I think current post-training basically improves all benchmarks I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x

Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.

I only see bad options, a choice between an EU-style regime and doing essentially nothing.

What issues do you have with the EU approach? (I assume you mean the EU AI Act.)

Thoughtful/informative post overall, thanks.

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

1Sheikh Abdur Raheem Ali
Appreciate you getting back to me. I was aware of this paper already and have previously worked with one of the authors.

Really fascinating post, thanks.

On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo... (read more)

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Reading that page, The Verge's claim seems to all hinge on this part:

OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."

They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.

9gwern
Further evidence: the OA official announcement from Altman today about returning to the status quo ante bellum and Toner's official resignation tweets all make no mention or hints of Q* (in addition to the complete radio silence about Q* since the original Reuters report). Toner's tweet, in particular: See also https://twitter.com/sama/status/1730032994474475554 https://twitter.com/sama/status/1730033079975366839 and the below Verge article where again, the blame is all placed on governance & 'communication breakdown' and the planned independent investigation is appealed to repeatedly. EDIT: Altman evaded comment on Q*, but did not deny its existence and mostly talked about how progress would surely continue. So I read this as evidence that something roughly like Q* may exist and they are optimistic about its long-term prospects, but there's no massive short-term implications, and it played minimal role in recent events - surely far less than the extraordinary level of heavy breathing online.

Yes though I think he said this at APEC right before he was fired (not after).

4gwern
Yes, well before. Here is the original video: https://www.youtube.com/watch?v=ZFFvqRemDv8&t=815s

Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

7CarlShulman
No. Short version is that the prior for the combination of technologies and motives for aliens (and worse for magic, etc) is very low, and the evidence distribution is familiar from deep dives in multiple bogus fields (including parapsychology, imaginary social science phenomena, and others), with understandable data-generating processes so not much likelihood ratio.

Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Interesting... still taking that in.

Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.

2J. Dmitri Gallow
A quick prefatory note on how I'm thinking about 'goals' (I don't think it's relevant, but I'm not sure): as I'm modelling things, Sia's desires/goals are given by a function from ways the world could be (colloquially, 'worlds') to real numbers, D, with the interpretation that D(W) is how well satisfied Sia's desires are if W turns out to be the way the world actually is. By 'the world', I mean to include all of history, from the beginning to the end of time, and I mean to encompass every region of space. I assume that this function can be well-defined even for worlds in which Sia never existed or dies quickly. Humans can want to never have been born, and they can want to die. So I'm assuming that Sia can also have those kinds of desires, in principle. So her goal can be achieved even if she's not around. When I talk about 'goal preservation', I was talking about Sia not wanting to change her desires. I think you're right that that's different from Sia wanting to retain her desires.  If she dies, then she hasn't retained her desires, but neither has she changed them. The effect I found was that Sia is somewhat more likely to not want her desires changed.

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

3J. Dmitri Gallow
There are infinitely many desires like that, in fact (that's what proposition 2 shows). More generally, take any self-preservation contingency plan, A, and any other contingency plan, B. If we start out uncertain about what Sia wants, then we should think her desires are just as likely to make A more rational than B as they are to make B more rational than A. (That's what proposition 3 shows.) That's rough and subject to a bunch of caveats, of course. I try to go through all of those caveats carefully in the draft.

But if there really is a large number of intelligence officials earnestly coming forward with this

Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, bu

... (read more)

What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.

Did Grusch already testify to Congress? I thought that was still being planned.

3awg
He's provided classified information to congress already yes. The intelligence committees in both houses I believe. https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft The one you linked is a new set of hearings planned by the House Oversight Committee.

Re: the tweet thread you linked to. One of the tweets is:

  1. Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]

Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2... (read more)

I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.

2Evan R. Murphy
I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

I don't know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven't seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?

3Dumbledore's Army
Possible yes, but if all advanced civs are highly prioritising stealth, that implies some version of the Dark Forest theory, which is terrifying.
3GeneSmith
Wow. Ok, I guess my odds that this is actually an alien spacecraft went up a little bit. It's interesting that at the end they quote a NASA official who stated that they haven't found any evidence of extraterrestrial life yet, directly contradicting the whistleblower. That means either the evidence the whistleblower has isn't sufficient to convince the scientists at NASA or the DOD isn't sharing it with them.

I wasn't saying that there were only a few research directions that don't require frontier models period, just that there are only a few that don't require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.

I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn't want to shut down someone's research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I'm not sure it's wise to be spendin... (read more)

Thanks for reviewing it! Yea of course you can use it however you like!

Great idea, we need to make sure there are some submissions raising existential risks.

Deadline for the RFI: July 7, 2023 at 5:00pm ET

Would you agree with this summary of your post? I was interested in your post but I didn't see a summary and didn't have time to read the whole thing just now. So I generated this using a summarizer script I've been working on for articles that are longer than the context windows for gpt-3.5 turbo and gpt-4.

It's a pretty interesting thesis you have if this is right, but I wanted to check if you spotted any glaring errors:

In this article, the author examines the challenges of aligning artificial intelligence (AI) with deontological morality as a means to en

... (read more)
3William D'Alessandro
A little clunky, but not bad! It's a good representation of the overall structure if a little fuzzy on certain details. Thanks for trying this out. I should have included a summary at the start -- maybe I can adapt this one?

A couple of quick thoughts:

  • Very glad to see someone trying to provide more infrastructure and support for independent technical alignment researchers. Wishing you great success and looking forward to hearing how your project develops.
  • A lot of promising alignment research directions now seem to require access to cutting-edge models. A couple of ways you might deal with this could be:
    • Partner with AI labs to help get your researchers access to their models
    • Or focus on some of the few research directions such as mechanistic interpretability that still seem to be making useful progress on smaller, more accessible models
3Alexandra Bos
I'd be curious to hear from the people who pressed the disagreement button on Evan's remark:  what part of this do you disagree with or not recognize?
3Alexandra Bos
I was thinking about helping with infrastructure around access to large amounts of compute but had not considered trying to help with access to cutting-edge models but I think it might be a very good suggestion. Thanks for sharing your thoughts! 

We're working on a more thorough technical report.

Is the new Model evaluation for extreme risks paper the technical report you were referring to?

A few other possible terms to add to the brainstorm:

  • AI massive catastrophic risks
  • AI global catastrophic risks
  • AI catastrophic misalignment risks
  • AI catastrophic accident risks (paired with "AI catastrophic misuse risks")
  • AI weapons of mass destruction (WMDs) - Pro: a well-known term, Con: strongly connotes misuse so may be useful for that category but probably confusing to try and use for misalignment risks

As an aside, if you are located in Australia or New Zealand and would be interested in coordinating with me, please contact me through LessWrong on this account.

One potential source of leads for this might be the FLI Pause Giant AI Experiments open letter . I did a Ctrl+F search on there for "Australia" which had 50+ results and "New Zealand" which had 10+. So you might find some good people to connect with on there.

Upvoted. I think it's definitely worth pursuing well-thought out advocacy in countries besides US and China. Especially since this can be done in parallel with efforts in those countries.

A lot of people are working on the draft EU AI Act in Europe.

In Canada, parliament is considering Bill C-27 which may have a significant AI component. I do some work with an org called AIGS that is trying to help make that go well.

I'm glad to hear that some projects are underway in Australia and New Zealand and that you are pursuing this there!

Seems important, I'm guessing people are downvoting this considering it a possible infohazard.

Viliam119

Speaking for myself, I object against the Twitter format. (This is what the shortform is for.)

Also, some more context would be nice. For example, how often it happens that a lab containing virus samples is captured by fighters? Once in a decade? Once in a week? I have no idea. Was this lab somehow exceptional?

This is just "hey, hey, something potentially interesting is happening, but you have to figure out what".

Post summary

I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:

Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.

The LLM serves as the CPU, wh

... (read more)

Post summary (experimental)

Here's an alternative summary of your post, complementing your TL;DR and Overview. This is generated by my summarizer script utilizing gpt-3.5-turbo and gpt-4. (Feedback welcome!)

The article explores the potential of language model cognitive architectures (LMCAs) to enhance large language models (LLMs) and accelerate progress towards artificial general intelligence (AGI). LMCAs integrate and expand upon approaches from AutoGPT, HuggingGPT, Reflexion, and BabyAGI, adding goal-directed agency, executive function, episodic memory, a

... (read more)
4Seth Herd
Cool, thanks! I think this summary is impressive. I think it's missing a major point in the last paragraph: the immense upside of the natural language alignment and interpretability possible in LMCAs. However, that summary is in keeping with the bulk of what I wrote, and a human would be at risk of walking away with the same misunderstanding.

Less compressed summary

Here's a longer summary of your article generated by the latest version of my summarizer script:

In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversib

... (read more)

My claim is that AI safety isn't part of the Chinese gestalt.

Stuart Russell claims that Xi Jinping has referred to the existential threat of AI to humanity [1].

[1] 5:52 of Russell's interview on Smerconish: https://www.cnn.com/videos/tech/2023/04/01/smr-experts-demand-pause-on-ai.cnn

3Evan R. Murphy
Less compressed summary Here's a longer summary of your article generated by the latest version of my summarizer script:

Post summary (experimental)

I just found your post. I want to read it but didn't have time to dive into it thoroughly yet, so I put it into a summarizer script I've been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.

Here's the summary it came up with, let me know if anyone see problems with it. If you're in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:

The article examines a theoretical solution to the AI alignment problem, focusing on detect

... (read more)
3Paul Colognese
Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

Post summary (auto-generated, experimental)

I am working on a summarizer script that uses gpt-3.5-turbo and gpt-4 to summarize longer articles (especially AI safety-related articles). Here's the summary it generated for the present post.

The article addresses the issue of self-unalignment in AI alignment, which arises from the inherent inconsistency and incoherence in human values and preferences. It delves into various proposed solutions, such as system boundary alignment, alignment with individual components, and alignment through whole-system representati

... (read more)

New summary that's 'less wrong' (but still experimental)

I've been working on improving the summarizer script. Here's the summary auto-generated by the latest version, using better prompts and fixing some bugs:

The author investigates a phenomenon in GPT language models where the prompt "petertodd" generates bizarre and disturbing outputs, varying across different models. The text documents experiments with GPT-3, including hallucinations, transpositions, and word associations. Interestingly, "petertodd" is associated with character names from the Japanese R

... (read more)

Great feedback, thanks! Looks like GPT-4 ran away with its imagination a bit. I'll try to fix that.

Post summary (experimental)

Here's an experimental summary of this post I generated using gpt-3.5-turbo and gpt-4:

This article discusses the 'petertodd' phenomenon in GPT language models, where the token prompts the models to generate disturbing and violent language. While the cause of the phenomenon remains unexplained, the article explores its implications, as language models become increasingly prevalent in society. The author provides examples of the language generated by the models when prompted with 'petertodd', which vary between models. The article

... (read more)

Seems to claim the post talks about things it doesn't ("as language models become more prevalent in society" narrative(??)), while also leaving out important nuance about what that the post does talk about.

Upvoted for trying stuff, disagreement voted because the summary just ain't very good.

Answer by Evan R. Murphy30

Robert Miles has been making educational videos about AI existential risk and AI alignment for 4+ years. I've never spoken with him, but I'm sure he has learned a lot about how to communicate these ideas to a general audience in the process. I don't know that he has compiled his learnings on that anywhere, but it might be worth reaching out to him if you're looking to talk with someone who has experience with this.

Another resource - Vael Gates and Collin Burns shared some testing they did in Dec 2022 on outreach to ML researchers in What AI Safety Material... (read more)

Load More