LESSWRONG
LW

Evan R. Murphy — LessWrong

Replying toOpen Global Investment as a Governance Model for AGI

Open Global Investment as a Governance Model for AGI

I don't see anything that addresses the situation in which one company tries to take over the world using its AGI, or in which an AGI acting on its own initiative tries to take over the world, etc. Did I miss something?

Actually the OGI-1 model (and to a lesser extent, the OGI-N model) does do something important to address loss of control risks from AGI (or ASI): it reduces competitive race dynamics.

There are plausible scenarios where it is technically possible for a lab to safely develop AGI, but where doing so would require them to slow down development. When they are competitively racing against other AGI projects, the incentives are (potentially much) stronger to proceed with risky development. But when a lab doesn't have to worry about competitors, then they at least have an opportunity to pursue costly safety measures without sacrificing their lead.

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver '25]

Evan R. Murphy

4mo

I recently gave a talk at EA Summit Vancouver '25 exploring dual catastrophic risks we face from advanced AI.

Intended audience: This was a foundation-level talk intended to give newcomers a solid overview of AI risk, though I hope those with more background might still find the framing or specific arguments valuable.

Recorded talk link (25 minutes): https://youtu.be/x53V2VCpz8Q?si=yVCRtCIb9lXZnWnb&t=59

The core question: How do we thread the needle between AI that escapes our control (alignment failure) and AI that concentrates unprecedented power in the hands of a few (successful alignment to narrow interests)?

Three Scenarios Covered

The talk examines three possible AI futures (not exhaustive, but three particularly plausible and important scenarios I wanted the audience to consider):

... (read 345 more words →)

Replying toDo models know when they are being evaluated?

Evan R. Murphy5mo

Do models know when they are being evaluated?

I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP's research). This is solid progress on an important problem.

While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there's a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.

The paper finds that frontier models show "substantial, though not yet superhuman" evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge:... (read more)

Replying toTristan's Projects

Evan R. Murphy5mo

Tristan's Projects

OISs
Idea:
There is a paradigm missing from the discussion of ASI misalignment existential risk. The threat of ASI generalizes to the concept of "Outcome Influencing Systems" (OISs). My hope is that developing terminology and formalism around this model may mitigate the issues associated with existing terminology and aid in more productive discourse and interdisciplinary research applicable to ASI risk and social coordination issues.
Links:
WIP document to become main LW post introducing OISs (comments welcome)
My involvement:
I am currently the only contributor. I think the idea has merit, but I am still at the point where I am seeking either to find collaborators and spread the idea, or to find people who can point out enough

... (read more)

•••

Replying toOpen Global Investment as a Governance Model for AGI

Evan R. Murphy5mo

Open Global Investment as a Governance Model for AGI

To help with the incentives and coordination, instead having the first frontier AI megaowner step forward and unconditionally relinquish some of their power, they could sign on to a conditional contract to do so. It would only activate if other megaowners did the same.

Replying toOpen Global Investment as a Governance Model for AGI

Evan R. Murphy5mo*

Open Global Investment as a Governance Model for AGI

Ok yes, that would be great.

I'll point out that this is not unheard of, Altman literally took no equity in OpenAI (though IMO was eventually corrupted by the power nonetheless).

He may have been corrupted later by power later. Alternatively, he may have been playing the long game, knowing that he would have that power eventually even if he took no equity.

Replying toOpen Global Investment as a Governance Model for AGI

Evan R. Murphy5mo

Open Global Investment as a Governance Model for AGI

Couldn't we just... set up a financial agreement where the first N employees don't own stock and have a set salary?

Maybe, could be nice... But since the first N employees usually get to sign off on major decisions, why would they go along with such an agreement? Or are you suggesting governments should convene to force this sort of arrangement on them?

My main concern is that they'll have enough power to be functionally wealthy all-the-same, or be able to get it via other means (e.g. Altman with his side hardware investment / company).

I'm not sure I understand this part actually, could you elaborate? Is this your concern with the OGI model or with your salary-only for first-N employees idea?

Replying toOpen Global Investment as a Governance Model for AGI

Evan R. Murphy5mo

Open Global Investment as a Governance Model for AGI

Appreciate your integrity in doing that!

At the same time, the unfairness of early frontier lab founders getting rich seems to me like a very acceptable downside given that open investment could solve a lot of issues and the bleakness of many other paths forward.

Evan R. Murphy5mo*Quick Take

Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer's apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.

The more salient outer alignment issue now how to align agents when you don't have time or enough capability yourself to supervise them well. And that's mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI 'beyond their means' so to speak.

So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.

-2

Evan R. Murphy8mo

Starting to be some discussion on LW now, e.g.

https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Evan R. Murphy8mo

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy

9mo

Rishi Jha, Collin Zhang, Vitaly Shmatikov and John X. Morris published a new paper last week called Harnessing the Universal Geometry of Embeddings.

Abstract of the paper (bold was added by me):

We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications

... (read more)

Evan R. Murphy's Shortform

Evan R. Murphy

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

Steven Pinker on ChatGPT and AGI (Feb 2023)

Evan R. Murphy

While I disagreed with a lot of Robin Hanson's latest take on AI risk, I am glad he came out with an updated position. I think with everything that's happened in the past 6-12 months, it's a good time for public intellectuals and prominent people who have previously commented on AGI and AI risk to check in again and share their latest views.

That got me curious if Steven Pinker had any recent statements. I found this article on the Harvard Gazette from last month (Feb 2023), which I couldn't find posted on LessWrong before:

Article link

Will ChatGPT supplant us as writers, thinkers?
Q&A with Steven Pinker
by Alvin Powell
Feb 14, 2023

Summary

Here's a summary of the... (read 167 more words →)

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

Evan R. Murphy, Megan Kinniment

Authors' Contributions: Both authors contributed equally to this project as a whole. Evan did the majority of implementation work, as well as the work for writing this post. Megan was more involved at the beginning of the project, and did the majority of experiment design. While Megan did give some input on the contents of this post, it doesn’t necessarily represent her views.

Acknowledgments: Thanks to the following people for insightful conversations which helped us improve this post: Ian McKenzie, Aidan O’Gara, Andrew McKnight and Evan Hubinger. One of the authors (Evan R. Murphy) was also supported by a grant from the Future Fund regranting program while working on this project.

Summary

Myopia is a... (read 2896 more words →)

Paper: Large Language Models Can Self-improve [Linkpost]

Evan R. Murphy

This paper came out about a week ago. I am not the author - it was published anonymously.

I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet:

Paper title: "Large Language Models Can Self-improve"

Author: Anonymous

Abstract: "Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence”... (read more)

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Evan R. Murphy

Back in April, Google AI announced SayCan, a project which integrated a language model (FLAN) with robotics in order to produce a robot which could follow instructions. For example, cleaning up a mess in a kitchen. (LessWrong post from April which links to a tweet about SayCan.)

This week Google AI has released some new updates, dubbed PaLM-SayCan. This involved upgrading the integrated language model to Google AI's top performing large language model (LLM) of 540-billion parameters, PaLM.

Website for PaLM-SayCan

Blog post announcing PaLM-SayCan

The following updates from this week are quoted from the original SayCan project website (linked in first sentence of this post, bold mine):

[8/16/2022] We integrated SayCan with Pathways Language Model (PaLM),

Evan R. Murphy

Summary

I had assumed the original ELK report had fundamental objections to Debate and IDA in terms of their robustness. Re-reading the report, I was surprised to find that that the only counterexample it provides for these proposals is that they don't seem computationally competitive compared to unaligned AI.

ELK's iconic SmartVault AI toy scenario seems like a pretty difficult world to imagine this kind of competitiveness mattering. I provide humorous fictional dialogue at a yacht party to illustrate this point.

My best guess is that the ELK report authors just didn't get around to doing in-depth plot continuity editing on their SmartVault scenario, but that they care about competitiveness because if its importance in... (read 1451 more words →)

New US Senate Bill on X-Risk Mitigation [Linkpost]

Evan R. Murphy

Two US Senators have introduced a bipartisan bill specifically focused on x-risk mitigation, including from AI. From the post on Senate.gov (bold mine):

WASHINGTON, D.C. – U.S. Senator Gary Peters (MI), Chairman of the Homeland Security and Governmental Affairs Committee, introduced a bipartisan bill to ensure our nation is better prepared for high-consequence events, regardless of the low probability, such as new strains of disease, biotechnology accidents, or naturally occurring risks such as super volcanoes or solar flares that though unlikely, would be exceptionally lethal if they occurred.
“Making sure our country is able to function during catastrophic events will improve national security, and help make sure people in Michigan and across the country

Evan R. Murphy

This is the second post in the sequence “Interpretability Research for the Most Important Century”. The first post, which introduces the sequence, defines several terms, and provides a comparison to existing works, can be found here: Introduction to the sequence: Interpretability Research for the Most Important Century.

Summary

This post explores the extent to which interpretability is relevant to the hardest, most important parts of the AI alignment problem (property #1 of High-leverage Alignment Research^[1]).

First, I give an overview of the four important parts of the alignment problem (following Hubinger^[2]): outer alignment, inner alignment, training competitiveness and performance competitiveness (jump to section). Next I discuss which of them is “hardest”, taking the position that... (read 17559 more words →)

LESSWRONG
LW

LESSWRONG
LW

Evan R. Murphy

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Paper: Large Language Models Can Self-improve [Linkpost]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver '25]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy's Shortform

Steven Pinker on ChatGPT and AGI (Feb 2023)

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Paper: Large Language Models Can Self-improve [Linkpost]

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Interpretability Research for the Most Important Century

Evan R. Murphy

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Paper: Large Language Models Can Self-improve [Linkpost]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy

AI Risk: Can We Thread the Needle? [Recorded Talk from EA Summit Vancouver '25]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy's Shortform

Steven Pinker on ChatGPT and AGI (Feb 2023)

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Paper: Large Language Models Can Self-improve [Linkpost]

Google AI integrates PaLM with robotics: SayCan update [Linkpost]

Interpretability Research for the Most Important Century

Three Scenarios Covered

OISs

Article link

Summary

Summary

Summary

Summary