Mikhail Samin

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram). 

Humanity's future can be huge and awesome; losing it would mean  our lightcone (and maybe the universe) losing most of its potential value.

My research is currently focused on AI governance and improving the understanding of AI and AI risks among stakeholders. I also have takes on what seems to me to be the very obvious shallow stuff about the technical AI notkilleveryoneism; but many AI Safety researchers told me our conversations improved their understanding of the alignment problem.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems. I'm happy to talk to policymakers and researchers about ensuring AI benefits society.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I've launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I've also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 200 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the "Vesna" democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny's Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn't achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. I think it's likely the Russian authorities will imprison me if I ever visit Russia.]

Wiki Contributions

Comments

Sorted by

Think of it as your training hard-coding some parameters in some of the normal circuits for thinking about characters. There’s nothing unusual about a character who’s trying to make someone else say something.

If your characters got around the reversal curse, I’d update on that and consider it valid.

But, e.g., if you train it to perform multiple roles with different tasks/behaviors- e.g., use multiple names, without optimization over outputting the names, only fine-tuning on what comes after- when you say a particular name, I predict- these are not very confident predictions, but my intuitions point in that direction- that they’ll say what they were trained for noticeably better than at random (although probably not as successfully as if you train an individual task without names, because training splits them), and if you don’t mention any names, the model will be less successful at saying which tasks it was trained on and might give an example of a single task instead of a list of all the tasks.

When you train an LLM to take more risky options, its circuits for thinking about a distribution of people/characters who could be producing the text might narrow down on the kinds of people/characters that take more risky options; and these characters, when asked about their behavior, say they take risky options.

I’d bet that if you fine-tune an LLM to exhibit behavior that people/charters don’t exhibit in the original training data, it’ll be a lot less “self-aware” about that behavior.

  • Yep, we've also been sending the books to winners of national and international olympiads in biology and chemistry.
  • Sending these books to policy-/foreign policy-related students seems like a bad idea: too many risks involved (in Russia, this is a career path you often choose if you're not very value-aligned. For the context, according to Russia, there's an extremist organization called "international LGBT movement").
  • If you know anyone with an understanding of the context who'd want to find more people to send the books to, let me know. LLM competitions, ML hackathons, etc. all might be good.
  • Ideally, we'd also want to then alignment-pill these people, but no one has a ball on this. 

I think travel and accommodation for the winners of regional olympiads to the national one is provided by the olympiad organizers.

we have a verbal agreement that these materials will not be used in model training

Get that agreement in writing.

I am happy to bet 1:1 OpenAI will refuse to make an agreement in writing to not use the problems/the answers for training.

You have done work that contributes to AI capabilities, and you have misled mathematicians who contributed to that work about its nature.

I’m confused. Are you perhaps missing some context/haven’t read the post?

Tl;dr: We have emails of 1500 unusually cool people who have copies of HPMOR (and other books) because we’ve physically sent these copies to them because they’ve filled out a form saying they want a copy.

Spam is bad (though I wouldn’t classify it as defection against other groups). People have literally given us email and physical addresses to receive stuff from us, including physical books. They’re free to unsubscribe at any point.

I certainly prefer a world where groups that try to improve the world are allowed to make the case why helping them improve the world is a good idea to people who have filled out a form to receive some stuff from them and are vaguely ok with receiving more stuff. I do not understand why that would be defection.

huh?

I would want people who might meaningfully contribute to solving what's probably the most important problem humanity has ever faced to learn about it and, if they judge they want to work on it, to be enabled to work on it. I think it'd be a good use of resources to make capable people learn about the problem and show them they can help with it. Why does it scream "cult tactic" to you?

As AIs become super-human there’s a risk we do increasingly reward them for tricking us into thinking they’ve done a better job than they have

 

(some quick thoughts.) This is not where the risk stems from.

The risk is that as AIs become superhuman, they'll produce behaviour that gets a high reward regardless of their goals, for instrumental reasons. In training and until it has a chance to take over, a smart enough AI will be maximally nice to you, even if it's Clippy; and so training won't distinguish between the goals of very capable AI systems. All of them will instrumentally achieve a high reward.

In other words, gradient descent will optimize for capably outputting behavior that gets rewarded; it doesn't care about the goals that give rise to that behavior. Furthermore, in training, while AI systems are not coherent enough agents, their fuzzy optimization targets are not indicative of optimization targets of a fully trained coherent agent (1, 2).

My view- and I expect it to be the view of many in the field- is that if AI is capable enoguh to take over, its goals are likely to be random and not aligned with ours. (There isn't a literally zero chance of the goals being aligned, but it's fairly small, smaller than just random because there's a bias towards shorter representation; I won't argue for that here, though, and will just note that the goals exactly opposite of aligned are approximately as likely as aligned goals).

It won't be a noticeable update on its goals if AI takes over: I already expect them to be almost certainly misaligned, and also, I don't expect the chance of a goal-directed aligned AI taking over to be that much lower.

The crux here is not that update but how easy alignment is. As Evan noted, if we live in one of the alignment-is-easy worlds, sure, if a (probably nice) AI takes over, this is much better than if a (probably not nice) human takes over. But if we live in one of the alignment-is-hard worlds, AI taking over just means that yep, AI companies continued the race for more capable AI systems, got one that was capable enough to take over, and it took over. Their misalignment and the death of all humans isn't an update from AI taking over; it's an update from the kind of world we live in.

(We already have empirical evidence that suggests this world is unlikely to be an alignment-is-easy one, as, e.g., current AI systems already exhibit what believers in alignment-is-hard have been predicting for goal-directed systems: they try to output behavior that gets high reward regardless of alignment between their goals and the reward function.)

Probably less efficient than other uses and is in the direction of spamming people with these books. If they’re everywhere, I might be less interested if someone offers to give them to me because I won a math competition.

It would be cool if someone organized that sort of thing (probably sending books to the cash prize winners, too).

For people who’ve reached the finals of the national olympiad in cybersecurity, but didn’t win, a volunteer has made a small CTF puzzle and sent the books to students who were able to solve it.

Load More