Substack: https://substack.com/@simonlermen

Twitter: @SimonLermenAI

probably closer to 55%

What's going on with MATS recruitment?

MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don't think this is true and believe MATS scholars have actually gotten worse.

So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.

"at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all." - Oliver Habryka

I think this is sadly somewhat true, I talked with some people in 8.0 who didn't seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer's new book and a significant minority of students seemed to have pre-ordered Eliezer's book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.

I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though).

Suggestions for recruitment

This might also apply for other Safety Fellowships.

Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart.

More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happen early. Maybe the website should say: "AI could kill literally everyone, let's try to do something!". And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that. What I am trying to say is, if you are creating the Apollo Project and are trying to go to the Moon you should say this, not just vaguely: “we're interested in aerospace challenges.”

Basic alignment test: Perhaps there should also be a test where people don't have internet or LLM access and have to answer some basic alignment questions:

Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?

Familiarity with safety literature: In general, I believe the foundational voices like Paul Christiano and Eliezer are less read by safety researchers these days and that is despite philosophy of research mattering more than ever since AIs can do much of our research implementations now. Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety. If someone is able to bring up and illustrate the main points of IABIED for example, I would be very impressed. Perhaps people could select one of a few preeminent voices in AI safety and repeat their basic views, again without access to the internet or an LLM.

Other Suggestions

Research direction: MATS doesn't seem to have a real research direction, perhaps if there was a strong researcher in charge that could be better. (though could also backfire if they put all resources in the wrong direction) Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.

Reading groups on AI safety fundamentals: So should we just offer people to read some of the AI safety fundamentals during MATS? I remember before 4.0 started, we had to do a safety fundamentals online course. This was not the case for 8.0.

At this point AI is so much around us all, that I expect many people to have thought about the existential consequences. I am pessimistic for anyone who hasn't yet sat down to really think about AI and came to the conclusion that it's existentially dangerous. I don't have a ton of hope that someone like that just needs a 1 hour course to deeply understand risks from AI. It might be necessary to select for people who already get it.

I might have a special view here since I did MATS 4.0 and 8.0.

at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all.

I think this is sadly somewhat true, I talked with some people in 8.0 who didn't seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point about Eliezer's new book and a significant minority of students seemed to have pre-ordered Eleizer's book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.

My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench.

Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happens early. Maybe the website should say: "AI could kill literally everyone, let's try to do something!". And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that.

Perhaps there should also be a test where people don't have internet access and have to answer some basic alignment questions: like why could a system that we optimize with RL develop power seeking drives? Why might training an AI create weird unpredictable preferences in an AI?

Radical Flank Effect and Reasonable Moderate Effect

The Radical Flank Effect

The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.

The Reasonable Moderate Effect (Inverse Strategy)

When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.

The Cave Exploration

Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.

After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away at the cave walls looking for minerals and looking to open new paths to go deeper into the cave. The cave is becoming more and more dangerous.

The Reckless: "We need to go deeper! The greatest riches are always in the deepest parts of the cave. Yes, some rocks are falling, but that's just the cave settling. Every moment we waste debating is a moment we're not finding treasure. People have been predicting cave collapses forever and it never happens, there is no evidence that caves ever cave in. If we don’t die in this cave we’re just waiting for the asteroid to hit us”.

Those That Want to Back Off: "We need to back off NOW. The damage we've already done to the structure plus the natural instability means this cave could collapse at any time. We don't have proper equipment, we don't have expertise in cave stability, and we're actively making it worse. Whatever riches might be down there aren't worth our lives and we also don’t actually have a plan how to mine those riches. We should retreat while we still can."

The Moderates: "Look, we all want to maximize the riches we find, and turning back now would waste all the progress we've made. We should put on helmets and maybe move gradually down the narrow shafts. We can continue deeper, but with some basic safety precautions. We will minimize and manage the risks. There's still treasure to be found if we're smart about it. But let’s not get distracted from the treasures by the cave doomers. Anyway, the cave is still collapsing if one of use continues chipping away and coordination is impossible."

Perhaps let’s imagine there is a warning shot, such as a big rock falling down. Maybe this would be a good time to turn back, but the moderates are now finally able to convince the reckless to put on a helmet.

I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.

I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.

I think Eliezer meant "self" very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don't know if Eliezer would approve of the term "Extrospective Recursive Self Improvemnt".

The Term Recursive Self-Improvement Is Often Used Incorrectly

Also on my substack.

The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.

OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.

When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as "rewriting your own source code in RAM." According to the LessWrong wiki, RSI refers to "making improvements on one's own ability of making self-improvements." However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.

Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.

Why This Distinction Matters

Eliezer described the critical transition as when "the AI's metacognitive level has now collapsed to identity with the AI's object level." I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.

This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn't good enough, but that doesn't mean it understands how to increase it. Without this self-transparency, I don't think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.

The dynamics differ significantly:

True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don't understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves

Alignment Preservation

This difference has significant implications:

True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals

Loss of Human Control

The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.

Could We Still Get True RSI?

Probably eventually, e.g. through automated researchers discovering more interpretable architectures.

I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it's taking a different form than he anticipated.

See also: Nate Soares has also written about RSI in this narrow sense. Comments between Nate and Paul Christiano touch on this topic.

Who is Consuming AI-Generated Erotic Content?

I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.

I used AI to determine estimates for demographics.

https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic

Some of the water isn't just evaporated, they also regularly do blowdowns where some of the water has to be excahgned for fresh water. It's very non-drinkable because it's running through pipes, being concentrated by evaporation and they sometimes add chemicals against microbes and scales.

LESSWRONG
LW