LESSWRONG
LW

All of Evan R. Murphy's Comments + Replies

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

How to Make Superbabies

Evan R. Murphy1mo40

Is there a summary of this post?

2GeneSmith1mo

https://x.com/GeneSmi96946389/status/1892721828625264928

Evan R. Murphy's Shortform

Evan R. Murphy1mo*92

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

3Evan R. Murphy1mo

Detecting Strategic Deception Using Linear Probes

Evan R. Murphy1mo2-2

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

Detecting Strategic Deception Using Linear Probes

Evan R. Murphy1mo20

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Training on Documents About Reward Hacking Induces Reward Hacking

Evan R. Murphy2moΩ5810

Y'all are on fire recently with this and the alignment faking paper.

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Evan R. Murphy2mo20

Thanks for the useful write-up on RepE.

RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.

Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?

Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:

While these observations enhance our co

Evan R. Murphy2mo30

Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).

Alignment Faking in Large Language Models

Evan R. Murphy2mo0-2

Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)

Don't you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: "Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods."

Thanks your great paper on alignment faking, by the way.

Applying refusal-vector ablation to a Llama 3 70B agent

Evan R. Murphy5mo20

Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.

It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).

Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .

3Simon Lermen5mo

Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version. I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons: 1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling. 2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful. 3. I think current post-training basically improves all benchmarks I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x

Creating unrestricted AI Agents with Command R+

Evan R. Murphy5mo20

Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.

Newsom Vetoes SB 1047

Evan R. Murphy6mo42

I only see bad options, a choice between an EU-style regime and doing essentially nothing.

What issues do you have with the EU approach? (I assume you mean the EU AI Act.)

Thoughtful/informative post overall, thanks.

Simple probes can catch sleeper agents

Evan R. Murphy9moΩ340

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Bing Chat is blatantly, aggressively misaligned

Evan R. Murphy1y20

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

1Sheikh Abdur Raheem Ali1y

Appreciate you getting back to me. I was aware of this paper already and have previously worked with one of the authors.

On green

Evan R. Murphy1y*20

Really fascinating post, thanks.

On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo... (read more)

On Devin

Evan R. Murphy1y*90

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough - Reuters

Evan R. Murphy1y33

Reading that page, The Verge's claim seems to all hinge on this part:

OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."

They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.

9gwern1y

Further evidence: the OA official announcement from Altman today about returning to the status quo ante bellum and Toner's official resignation tweets all make no mention or hints of Q* (in addition to the complete radio silence about Q* since the original Reuters report). Toner's tweet, in particular: See also https://twitter.com/sama/status/1730032994474475554 https://twitter.com/sama/status/1730033079975366839 and the below Verge article where again, the blame is all placed on governance & 'communication breakdown' and the planned independent investigation is appealed to repeatedly. EDIT: Altman evaded comment on Q*, but did not deny its existence and mostly talked about how progress would surely continue. So I read this as evidence that something roughly like Q* may exist and they are optimistic about its long-term prospects, but there's no massive short-term implications, and it played minimal role in recent events - surely far less than the extraordinary level of heavy breathing online.

Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs

Evan R. Murphy1y1013

Yes though I think he said this at APEC right before he was fired (not after).

4gwern1y

Yes, well before. Here is the original video: https://www.youtube.com/watch?v=ZFFvqRemDv8&t=815s

UFO Betting: Put Up or Shut Up

Evan R. Murphy2y20

Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

7CarlShulman2y

No. Short version is that the prior for the combination of technologies and motives for aliens (and worse for magic, etc) is very low, and the evidence distribution is familiar from deep dives in multiple bogus fields (including parapsychology, imaginary social science phenomena, and others), with understandable data-generating processes so not much likelihood ratio.

My understanding of Anthropic strategy

Evan R. Murphy2y85

Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Instrumental Convergence? [Draft]

Evan R. Murphy2y20

Interesting... still taking that in.

Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.

2J. Dmitri Gallow2y

A quick prefatory note on how I'm thinking about 'goals' (I don't think it's relevant, but I'm not sure): as I'm modelling things, Sia's desires/goals are given by a function from ways the world could be (colloquially, 'worlds') to real numbers, D, with the interpretation that D(W) is how well satisfied Sia's desires are if W turns out to be the way the world actually is. By 'the world', I mean to include all of history, from the beginning to the end of time, and I mean to encompass every region of space. I assume that this function can be well-defined even for worlds in which Sia never existed or dies quickly. Humans can want to never have been born, and they can want to die. So I'm assuming that Sia can also have those kinds of desires, in principle. So her goal can be achieved even if she's not around. When I talk about 'goal preservation', I was talking about Sia not wanting to change her desires. I think you're right that that's different from Sia wanting to retain her desires. If she dies, then she hasn't retained her desires, but neither has she changed them. The effect I found was that Sia is somewhat more likely to not want her desires changed.

Instrumental Convergence? [Draft]

Evan R. Murphy2yΩ050

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

3J. Dmitri Gallow2y

There are infinitely many desires like that, in fact (that's what proposition 2 shows). More generally, take any self-preservation contingency plan, A, and any other contingency plan, B. If we start out uncertain about what Sia wants, then we should think her desires are just as likely to make A more rational than B as they are to make B more rational than A. (That's what proposition 3 shows.) That's rough and subject to a bunch of caveats, of course. I try to go through all of those caveats carefully in the draft.

Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin

Evan R. Murphy2y20

But if there really is a large number of intelligence officials earnestly coming forward with this

Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, bu

Evan R. Murphy2y20

What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.

Did Grusch already testify to Congress? I thought that was still being planned.

3awg2y

He's provided classified information to congress already yes. The intelligence committees in both houses I believe. https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft The one you linked is a new set of hearings planned by the House Oversight Committee.

Dealing with UFO claims

Evan R. Murphy2y60

Re: the tweet thread you linked to. One of the tweets is:

Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]

Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2... (read more)

Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin

Evan R. Murphy2y20

I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin

Evan R. Murphy2y20

Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.

2Evan R. Murphy2y

Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin

Evan R. Murphy2y20

I don't know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven't seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?

3Dumbledore's Army2y

Possible yes, but if all advanced civs are highly prioritising stealth, that implies some version of the Dark Forest theory, which is terrifying.

Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin

Evan R. Murphy2y110

The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft

3GeneSmith2y

Wow. Ok, I guess my odds that this is actually an alien spacecraft went up a little bit. It's interesting that at the end they quote a NASA official who stated that they haven't found any evidence of extraterrestrial life yet, directly contradicting the whistleblower. That means either the evidence the whistleblower has isn't sufficient to convince the scientists at NASA or the DOD isn't sharing it with them.

Evan R. Murphy2y*2-3

I wasn't saying that there were only a few research directions that don't require frontier models period, just that there are only a few that don't require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.

I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn't want to shut down someone's research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I'm not sure it's wise to be spendin... (read more)

Is Deontological AI Safe? [Feedback Draft]

Evan R. Murphy2y20

Thanks for reviewing it! Yea of course you can use it however you like!

The Office of Science and Technology Policy put out a request for information on A.I.

Evan R. Murphy2y30

Great idea, we need to make sure there are some submissions raising existential risks.

Deadline for the RFI: July 7, 2023 at 5:00pm ET

Is Deontological AI Safe? [Feedback Draft]

Evan R. Murphy2y40

Would you agree with this summary of your post? I was interested in your post but I didn't see a summary and didn't have time to read the whole thing just now. So I generated this using a summarizer script I've been working on for articles that are longer than the context windows for gpt-3.5 turbo and gpt-4.

It's a pretty interesting thesis you have if this is right, but I wanted to check if you spotted any glaring errors:

In this article, the author examines the challenges of aligning artificial intelligence (AI) with deontological morality as a means to en

... (read more)

3William D'Alessandro2y

A little clunky, but not bad! It's a good representation of the overall structure if a little fuzzy on certain details. Thanks for trying this out. I should have included a summary at the start -- maybe I can adapt this one?

Evan R. Murphy2y2-3

A couple of quick thoughts:

Very glad to see someone trying to provide more infrastructure and support for independent technical alignment researchers. Wishing you great success and looking forward to hearing how your project develops.
A lot of promising alignment research directions now seem to require access to cutting-edge models. A couple of ways you might deal with this could be:
- Partner with AI labs to help get your researchers access to their models
- Or focus on some of the few research directions such as mechanistic interpretability that still seem to be making useful progress on smaller, more accessible models

3Alexandra Bos2y

I'd be curious to hear from the people who pressed the disagreement button on Evan's remark: what part of this do you disagree with or not recognize?

3Alexandra Bos2y

I was thinking about helping with infrastructure around access to large amounts of compute but had not considered trying to help with access to cutting-edge models but I think it might be a very good suggestion. Thanks for sharing your thoughts!

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Evan R. Murphy2y20

We're working on a more thorough technical report.

Is the new Model evaluation for extreme risks paper the technical report you were referring to?

"notkilleveryoneism" sounds dumb

Evan R. Murphy2y30

A few other possible terms to add to the brainstorm:

AI massive catastrophic risks
AI global catastrophic risks
AI catastrophic misalignment risks
AI catastrophic accident risks (paired with "AI catastrophic misuse risks")
AI weapons of mass destruction (WMDs) - Pro: a well-known term, Con: strongly connotes misuse so may be useful for that category but probably confusing to try and use for misalignment risks

Evan R. Murphy2y20

As an aside, if you are located in Australia or New Zealand and would be interested in coordinating with me, please contact me through LessWrong on this account.

One potential source of leads for this might be the FLI Pause Giant AI Experiments open letter . I did a Ctrl+F search on there for "Australia" which had 50+ results and "New Zealand" which had 10+. So you might find some good people to connect with on there.

Evan R. Murphy2y31

Upvoted. I think it's definitely worth pursuing well-thought out advocacy in countries besides US and China. Especially since this can be done in parallel with efforts in those countries.

A lot of people are working on the draft EU AI Act in Europe.

In Canada, parliament is considering Bill C-27 which may have a significant AI component. I do some work with an org called AIGS that is trying to help make that go well.

I'm glad to hear that some projects are underway in Australia and New Zealand and that you are pursuing this there!

WHO Biological Risk warning

Evan R. Murphy2y*20

Seems important, I'm guessing people are downvoting this considering it a possible infohazard.

Viliam2y119

Speaking for myself, I object against the Twitter format. (This is what the shortform is for.)

Also, some more context would be nice. For example, how often it happens that a lab containing virus samples is captured by fighters? Once in a decade? Once in a week? I have no idea. Was this lab somehow exceptional?

This is just "hey, hey, something potentially interesting is happening, but you have to figure out what".

Scaffolded LLMs as natural language computers

Evan R. Murphy2y50

Post summary

I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:

Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.
The LLM serves as the CPU, wh

Evan R. Murphy2y30

Post summary (experimental)

Here's an alternative summary of your post, complementing your TL;DR and Overview. This is generated by my summarizer script utilizing gpt-3.5-turbo and gpt-4. (Feedback welcome!)

The article explores the potential of language model cognitive architectures (LMCAs) to enhance large language models (LLMs) and accelerate progress towards artificial general intelligence (AGI). LMCAs integrate and expand upon approaches from AutoGPT, HuggingGPT, Reflexion, and BabyAGI, adding goal-directed agency, executive function, episodic memory, a

... (read more)

4Seth Herd2y

Cool, thanks! I think this summary is impressive. I think it's missing a major point in the last paragraph: the immense upside of the natural language alignment and interpretability possible in LMCAs. However, that summary is in keeping with the bulk of what I wrote, and a human would be at risk of walking away with the same misunderstanding.

Towards a solution to the alignment problem via objective detection and evaluation

Evan R. Murphy2y30

Less compressed summary

Here's a longer summary of your article generated by the latest version of my summarizer script:

In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversib

Evan R. Murphy2y20

My claim is that AI safety isn't part of the Chinese gestalt.

Stuart Russell claims that Xi Jinping has referred to the existential threat of AI to humanity [1].

[1] 5:52 of Russell's interview on Smerconish: https://www.cnn.com/videos/tech/2023/04/01/smr-experts-demand-pause-on-ai.cnn

Towards a solution to the alignment problem via objective detection and evaluation

Evan R. Murphy2y20

Great idea, I will experiment with that - thanks!

3Evan R. Murphy2y

Less compressed summary Here's a longer summary of your article generated by the latest version of my summarizer script:

Towards a solution to the alignment problem via objective detection and evaluation

Evan R. Murphy2y20

Post summary (experimental)

I just found your post. I want to read it but didn't have time to dive into it thoroughly yet, so I put it into a summarizer script I've been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.

Here's the summary it came up with, let me know if anyone see problems with it. If you're in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:

The article examines a theoretical solution to the AI alignment problem, focusing on detect

... (read more)

3Paul Colognese2y

Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

The self-unalignment problem

Evan R. Murphy2y*2-5

Post summary (auto-generated, experimental)

I am working on a summarizer script that uses gpt-3.5-turbo and gpt-4 to summarize longer articles (especially AI safety-related articles). Here's the summary it generated for the present post.

The article addresses the issue of self-unalignment in AI alignment, which arises from the inherent inconsistency and incoherence in human values and preferences. It delves into various proposed solutions, such as system boundary alignment, alignment with individual components, and alignment through whole-system representati

... (read more)

The ‘ petertodd’ phenomenon

Evan R. Murphy2y50

New summary that's 'less wrong' (but still experimental)

I've been working on improving the summarizer script. Here's the summary auto-generated by the latest version, using better prompts and fixing some bugs:

The author investigates a phenomenon in GPT language models where the prompt "petertodd" generates bizarre and disturbing outputs, varying across different models. The text documents experiments with GPT-3, including hallucinations, transpositions, and word associations. Interestingly, "petertodd" is associated with character names from the Japanese R

... (read more)

The ‘ petertodd’ phenomenon

Evan R. Murphy2y50

Great feedback, thanks! Looks like GPT-4 ran away with its imagination a bit. I'll try to fix that.

The ‘ petertodd’ phenomenon

Evan R. Murphy2y5-3

Post summary (experimental)

Here's an experimental summary of this post I generated using gpt-3.5-turbo and gpt-4:

This article discusses the 'petertodd' phenomenon in GPT language models, where the token prompts the models to generate disturbing and violent language. While the cause of the phenomenon remains unexplained, the article explores its implications, as language models become increasingly prevalent in society. The author provides examples of the language generated by the models when prompted with 'petertodd', which vary between models. The article

... (read more)

Bird Concept2y101

Seems to claim the post talks about things it doesn't ("as language models become more prevalent in society" narrative(??)), while also leaving out important nuance about what that the post does talk about.

Upvoted for trying stuff, disagreement voted because the summary just ain't very good.

Who is testing AI Safety public outreach messaging?

Answer by Evan R. MurphyApr 17, 202330

Robert Miles has been making educational videos about AI existential risk and AI alignment for 4+ years. I've never spoken with him, but I'm sure he has learned a lot about how to communicate these ideas to a general audience in the process. I don't know that he has compiled his learnings on that anywhere, but it might be worth reaching out to him if you're looking to talk with someone who has experience with this.

Another resource - Vael Gates and Collin Burns shared some testing they did in Dec 2022 on outreach to ML researchers in What AI Safety Material... (read more)