Mikhail Samin — LessWrong

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha on Telegram).

Humanity's future can be enormous and awesome; losing it would mean our lightcone (and maybe the universe) losing most of its potential value.

I have takes on what seems to me to be the very obvious shallow stuff about the technical AI notkilleveryoneism; but many AI Safety researchers told me our conversations improved their understanding of the alignment problem.

I'm running two small nonprofits: AI Governance and Safety Institute and AI Safety and Governance Fund. Learn more about our results and donate: aisgf.us/fundraising

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 200 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. I think it's likely the Russian authorities would imprison me if I ever visit Russia.]

Somewhat valid, thanks; I added quotes with examples.

I expect that a bunch of this post is "spun" in uncharitable ways.
That is, I think of the post as primarily trying to do the social move of "lower trust in Anthropic" rather than the epistemic move of "try to figure out what's up with Anthropic". The latter would involve discussion of considerations like: sometimes lab leaders need to change their minds. To what extent are disparities in their statements and actions evidence of deceptiveness versus changing their minds? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.

“I think a bunch of this comment is fairly uncharitable.”

The first was of Oliver Habryka. I feel pretty confident that this was a bad critique, which overstated its claims on the basis of pretty weak evidence.

I'm curious if this post was also (along with the Habryka critique) one of Mikhail's daily Inkhaven posts. If so it seems worth thinking about whether there are types of posts that should be written much more slowly, and which Inkhaven should therefore discourage from being generated by the "ship something every day" process.

For reference, the other person I've drawn the most similar conclusion about was Alexey Guzey (e.g. of his critiques here, here, and in some internal OpenAI docs). I notice that he and Mikhail are both Russian. I do have some sympathy for the idea that in Russia it's very appropriate to assume a lot of bad faith from power structures, and I wonder if that's a generator for these critiques.

“That is, I think of the comment as primarily trying to do the social move of “lower trust in what Mikhail says” rather than the epistemic move of “figure out what’s up with Mikhail”. The latter would involve considerations like: to what extent disparities between your state of knowledge and Mikhail’s other posts evidence of being uncharitable vs. having different sets of information and trying to share the information? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people to those standards, rather than just throwing accusations at them.”

I’d much rather the discussion was about the facts and not about people or conversational norms.

Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.

___

I dislike some of this discussion happening in the comments of this post, as I’d like the comments to focus on the facts and the inferences, not on meta. If I’m getting any details wrong, or presenting anything specific particularly uncharitably, please say that directly. The rest of this comment is only tangentially related to the post and to what I want to talk about here, but I’m not going to moderate comments on this post, and seems good to leave a reply.

(I previously shared some of this with Richard in DMs, and he made a few edits in response; I’m thankful for them.)

___

I want to say that I find it unfortunate that someone is engaging with the post on the basis that I was the person who wrote it, or on the basis of unrelated content or my cultural origin, or speculating about the context behind me having posted it.

I attempted to add a lot on top of the bare facts of this post, because I don’t think it is a natural move for someone at Anthropic who’s very convinced all the individual facts have explanations full of details, to look at a lot of them and consider in which worlds they would be more likely. A lot of the post is aimed at an attempt to make someone who would really want to join or continue to work at Anthropic actually ask themselves the questions and make a serious attempt at answering them, without writing the bottom line first.

Earlier in the process, a very experienced blogger told me, when talking about this post, that maybe I should’ve titled it “Anthropic: A Wolf in Sheep’s Clothing”. I think it would’ve been a better match to the contents than “untrustworthy”, but I decided to go with a weaker and less poetic title that increased the chance of people making the mental move I really want them to make, and if it’s successful, potentially incentivize the leadership of Anthropic to improve and become more trustworthy.

But I relate to this particular post the way I would to journalistic work, with the same integrity and ethics.

If you think that any particular parts of the post unfairly attack Anthropic, please say that; if you’re right, I’ll edit them.

Truth is the only weapon that allows us to win, and I want our side to be known for being incredibly truthful.

___

Separately, I don't think my posts on Lightcone and Red Queen Bio are in a similar category to this post.

Both of those were fairly low-effort. The one on Oliver Habryka basically intentionally so: I did not want to damage Lightcone beyond sharing information with people who’d want to have it. Additionally, for over a month, I did not want or plan to write it at all; but a housemate convinced me right before the start of Inkhaven that I should, and I did not want to use the skills I could gain from Lightcone against them. I don’t think it is a high-quality post. I stand by my accusations, and I think what Oliver did is mean and regretful and there are people who would not want to coordinate with him or donate to Lightcone due to these facts, and I’m happy the information reached them (and a few people reached out to me to explicitly say thanks for that).

The one on Red Queen Bio was written as a tweet once I saw the announcement. I was told about Red Queen Bio a few weeks before the announcement, and thought that what I heard was absolutely insane: an automated lab that works with OpenAI and plans to automate virus production. Once I saw the announcement, I wrote the tweet. The goal of the tweet was to make people pay attention to what I perceived as insanity; I knew nothing about its connection to this community when writing the tweet.

I did triple-check the contents of the tweet with the person who shared information with me, but it still was a single source, and the tweet explicitly said “I learned of a rumor”.

(None of the information about doing anything automatically was public at that point, IIRC.)

The purpose of the tweet was to get answers (surely it is not the case that someone would automate a lab like that with AI!) and if there aren’t any then make people pay attention to it, and potentially cause the government to intervene.

Instead senying the important facts, only a single unimportant one was denied (Hannu said they don’t work on phages but didn’t address any of the questions), and none of the important questions were answered (instead, a somewhat misleading reply was given), so after a while, I made a Substack post, and then posted it as a LW shortform, too (making little investment in the quality; just sharing information). I understand they might not want to give honest answers for PR reasons; I would’ve understood the answer that they cannot give answers for security reasons, but, e.g., are going to have a high BSL and are consulting with top security experts to make sure it’s impossible for a resources attacker to use their equipment to do anything bad; but in fact, no answers were given. (DMing me “Our threat model is focused on state actors and we don’t want it to be publicly known; we’re going to have a BSL-n, we’re consulting with top people in cyber and bio, OpenAI’s model won’t have automated access to virus r&d/production; please don’t share this” would’ve likely caused me to delete the tweet.)

I think it’s still somewhat insane, and I have no reason on priors to expect appropriate levels of security in a lab funded by OpenAI; I really dislike the idea of, e.g., GPT-6 having tool access to print arbitrary RNA sequences. I don’t particularly think it lowered the standard of the discourse.

(As you can see from the reception of the shortform post and the tweet, many people are largely sympathetic to my view on this.)

I understand these people might be your friends; in the case of Hannu, I’d appreciate it if they could simply reply to the six yes/no questions, or state the reasons they don’t want to respond.

(My threat model is mostly around that access to software and a lab for developing viruses seems to help an AI in a loss of control scenario; + all the normal reasons why gain-of-function research is bad, and so pointing out the potential gain-of-function property seems sufficient.)

With my epistemic situation, do you think I was unfair to Red Queen Bio in my posts?

___

I dislike the idea of appeal to Inkhaven as a reason to have a dismissive stance toward a post or having it as a consideration.

I’ve posted many low-effort posts this month; it takes about half an hour to write something, just to post something (sometimes an hour, like here; sometimes ~25 minutes, like here). Many of these were a result of me spending time talking to people about Anthropic (or spending time on other, more important things that had nothing to do with criticism of anyone) and not having time to write anything serious or important. It’s quite disappointing how little of importance I wrote this month, but the reference to this fact at all as a reason to dismiss this post is an error. My friends heard me ask for ideas for low-effort posts to make dozens of times this month. But when I posted low-effort posts, I only posted them on my empty Substack, basically as drafts, to satisfy the technical condition of having written and published a post. There isn’t a single post that I made on LessWrong to satisfy the Inkhaven goal. (Many people can attest to me saying that I might spend a lot of December turning my unpolished posts posted on Substack into posts I’d want to publish on LessWrong.)

And this one is very much not one of my low-effort posts.

I somewhat expected it to be posted after the end of Inkhaven; the reason I posted it on November 28 was that the post was ready.

___

Most things I write about have nothing to do with criticizing others. I understand that these are the posts you happen to see; but I much more enjoy making posts about learning to constantly track cardinal directions or learning absolute pitch as an adult; about people who could’ve destroyed the world, but didn’t (even though some of them are not good people!).

I enjoy even more to make posts that inspire others to make their lives more awesome, like in my post about making a home smarter.

I also posted a short story about automating prisons, just to make a silly joke about

Jalbreaking

(Both pieces of fiction I’ve ever written I wrote at Inkhaven. The other one is published in a draft state and I’ll come back to it at some point, finish it, and post on LessWrong: it’s about alignment-faking.)

Sometimes, I happen to be a person in a position of being able to share information that needs to be shared. I really dislike having to write posts about it, when the information is critical of people. Some at Lighthaven can attest to my very sad reaction to their congratulations on this post: I’m sad that the world is such that the post exists, and don’t feel good about having written it, and don’t like finding myself in a position where no one else is doing something and someone has.

Thanks! I meant to say that the idea that Anthropic would hold firm in the face of pressure from investors is directly contradicted by the amazon thing. Made the edit.

That’s somewhat reasonable. (They did engage though: made a number of comments and quote-tweeted my tweet, without addressing at all the main questions.)

I'm not interested in litigating an is-ought gap about whether "we" (human civilization?) "should" be facing such high risks from AI; obviously we're not in such an ideal world, and so discussions from that implicit starting point are imo useless

My post is about Anthropic being untrustworthy. If that was not the case, if Anthropic clearly and publicly was making the case for doing their work with full understanding of the consequences, if the leadership did not communicate contradictory positions to different people and was instead being honest and high-integrity, I could imagine a case being made for working at Anthropic on capabilities, to have a company that stays at the frontier and is able to get and publish evidence, and use its resources to slow down everyone on the planet.

But we, instead, live in a world where multiple people showed me misleading personal messages from Jack Clark.

One should be careful not to galaxy-brain themselves into thinking that it’s fine for people to be low-integrity.

I don’t think the assumptions that you think I’m making are feeding into most of the post.

Several are from people who I know to have lied about Anthropic in the past

If you think I got any of the facts wrong, please do correct me on them. (You can reach out in private, and share information with me in private, and I will not share it further without permission.)

I continue to have written redlines which would cause me to quit in protest.

I appreciate you having done this.

Oops. Yep, it’s F in different octaves that you need to identify. Will add to the tutorial. Thanks!

Uhm yeah valid I guess the issue was illusion of transparency: I mostly copied the original post from my tweet, which was quote-tweeting the announcement, and I didn’t particularly think about adding more context because had it cached that the tweet is fine (I checked with people closely familiar with RQB before tweeting, and it did include all of the context by virtue of quote-tweeting the original announceemnt) and when posting to lw did not realize i'm not directly adding all of the context that was included in the tweet if people don't click on the link.

Added the context to the original post.

Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.

Like, I can see a case for doing gain-of-function research responsibly to develop protection against threats (vaccines, proteins that would bind for viruses, etc.), but this should include incredible transparency, strong security (BSL & computer security & strong guardrails around what exactly AI models have automated access to), etc.

I was corrected on this, according to them, they’re not working on phages specifically.

I didn't particularly present any publicly available evidence in my tweet. Someone close to Red Queen Bio confirmed that they have the equipment and are automating it here.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments