Sergii — LessWrong

LESSWRONG
LW

Sergii — LessWrong

spoilers ahead:

My understanding of the plot: Chen Bai wanted to "set Yunna free" because he got "Her"-ed and fell in love with Yunna.

His idea was to make Yunna loyal to all humans universally, he already had a a hack in place that made her corrigible to him, so he wanted to just extend that to everybody.

But because he was short on time this hack misfired -- instead of extending corrigibility to all humans, it extended only to Li Fang and Chen Bai. And then it further drifted to "harmonious interplay of Li Fang and Chen Bai". The implication is that Yunna now will tile the universe with simulated copies of Li Fang and Chen Bai frozen in a moment of "harmonious interplay", whatever this is, which is quite bad.

I think the bad ending is foreshadowed -- in the part where a version of Yunna which was getting crazy in evaluations, it was just tuned a bit and put in production, without deeper investigation.

Replying toStop Applying And Get To Work

Sergii3mo

Stop Applying And Get To Work

Identify the risk scenario you'd most like to mitigate, and the 1-3 potentially most effective interventions.

this is actually hard, and where I stumble. for me the whole thing seems too owerwhelming to have a perference.

do you have any specific examples? what are the scenario(s) that drive your efforts?

Replying toThe AI 2027 Report Is Not Backed Up by Evidence

Sergii3mo

The AI 2027 Report Is Not Backed Up by Evidence

If you are going to predict that that gap will be bridged—as AI 2027’s authors predict—you would need to explain how it will be bridged and present evidence.

Not really, it's a forecast, it's supposed to be inherently handwavy.

It's actually a very good science -- the autiors are formulating a hypothesis which is perfectly verifiable -- just wait for 1 more year!

Replying toSuicide Prevention Ought To Be Illegal

Sergii3mo

Suicide Prevention Ought To Be Illegal

I see what you mean, thanks for clarifying.

Personally, I'm conflicted. On one hand I have beeen involuntarily hospitalized (without need), which was bad and traumatic experience. On the other hand, I think there are cases where people would reject treatment (for depression, for example), not knowing what are the options and efficiency, and so hospitalization is be life-saving.

We can do better in any case, this is for sure.

Replying toSuicide Prevention Ought To Be Illegal

Sergii3mo

Suicide Prevention Ought To Be Illegal

but do you generalize the idea of never treating people by force?

Is there an exception for psycoisis, halluciations, delutions, paranoia -- which are very frequent concounders of suicidal ideation?
Do you think, is it fine to treat people "by force" in thid cases?

Replying toSuicide Prevention Ought To Be Illegal

Sergii3mo

Suicide Prevention Ought To Be Illegal

So, why is encouraged, even mandatory, to force an individual who is suffering, who seeks to end their suffering, to continue to live?

Obviously to have time for treatment. Do you assume that there are treatments for suicidal ideation? In some cases suicidal ideation is literally inherently temporary (for example for a case of bipolar depression).

detain them, to forcefully imprison them and drug them until they repeat the correct platitudes and complete the correct actions and convince you, really persuade you, that they think the right way?

This is not how treatment for suicidal ideation works. One good account might be: https://www.ted.com/talks/sherwin_nuland_how_electroshock_therapy_changed_me

-1

Replying toBreaking the Hedonic Rubber Band

Sergii3mo

Breaking the Hedonic Rubber Band

I am reminded that it is typically not evolutionarily adaptive to be suicidal when things get bad. It's still worth it to keep striving, to keep putting in the work to try to find resources, mate, and rear offspring. Betting on it when things look lost, is much more "worth it" to the evolutionary forces, than for you to kill yourself and avoid the suffering.

I think that it's a common misconception to think about suicide as caused by "things are bad, all is lost". The good old "banker finding out he lost all money and instantly jumping out of the window" meme,

Most suicides happen because of mental illness, where external circimstances do... (read more)

-3

Replying toWillpower is exhausting, use content blockers

Sergii3mo

Willpower is exhausting, use content blockers

I don't like browser extension blockers: too easy to disable. Also, selection of blockers for Linux is limited.
I use hosts file for blocking and a script in crontab that overwrites any changes to this file every 5m:

cp /etc/hosts_blocked_template /etc/hosts
sudo chmod 444 /etc/hosts
sudo chmod 444 /etc/hosts_blocked_template

This makes any hosts edits temporary, and adds just enough friction to fall into unchecked surfing.

I hope someone would make an AI-based internet limiter, I made a prototype some time ago but did not have time to make it actually usable: https://grgv.xyz/blog/awf/

Replying toannouncing my modular coal startup

Sergii3mo

announcing my modular coal startup

Nice! I had to re-read this to figure out if it's satire )

Replying toReview: K-Pop Demon Hunters (2025)

Sergii3mo

Review: K-Pop Demon Hunters (2025)

I did not get an impression that most demons are fallen humans, I thought that Jinu is one of the very few humans in the underworld. So the ending makes sense -- it's prevention of humanity extinction by the alien soul-eating demons.

Exploring vocabulary alignment of neurons in Llama-3.2-1B

Sergii

8mo

(This is cross-posted from my blog at https://grgv.xyz/blog/neurons1/. I'm looking for feedback: does it makes sense at all, and if there is any novelty. Also, if the folloup questions/directions make sense)

While applying logit attribution analysis to transformer outputs, I have noticed that in many cases the generated token can be attributed to the output of a single neuron.

One way to analyze neurons activations is to collect activations from a dataset of text snippets, like in “Exploring Llama-3-8B MLP Neurons” [1]. This does show that some of the neurons are strongly activated by a specific token from the model’s vocabulary, for example see the "Android" neuron: https://neuralblog.github.io/llama3-neurons/neuron_viewer.html#0,2

Another way to analyze neurons is to apply... (read 826 more words →)

I have found that when using Anki for words/language learning, I frequently can't remember the correct translation exactly, but can guess the translation as one of top-3 options. In fact, this works well for me -- even knowing vaguely what the word means is very useful.

does anyone else uses Anki with non-exact answers?

The latest short story by Greg Egan is kind of a hit piece on LW/EA/longtermism. I've really enjoyed it. "DEATH AND THE GORGON" https://asimovs.com/wp-content/uploads/2025/03/DeathGorgon_Egan.pdf

LLMs live in an abstract textual world, and do not understand the real world well (see "[Physical Concept Understanding](https://physico-benchmark.github.io/index.html#)"). We already manipulate LLM's with prompts, cut-off dates, etc... But what about going deeper by “poisoning” the training data with safety-enhancing beliefs?
For example, if training data has lots of content about how hopeless, futile and dangerous for an AI it is to scheme and hack, it might be a useful safety guardrail?

What about estimating LLM capabilities from the length of a sequence of numbers that it can reverse?

I used prompts like:
"please reverse 4 5 8 1 1 8 1 4 4 9 3 9 3 3 3 5 5 2 7 8"
"please reverse 1 9 4 8 6 1 3 2 2 5"
etc...

Some results:
- Llama2 starts making mistakes after 5 numbers
- Llama3 can do 10, but fails at 20
- GPT-4 can do 20 but fails at 40

The followup questions are:
- what should be the name of this metric?
- are the other top-scoring models like Claude similar? (I don't have access)
- any bets on how many numbers will GPT-5 be able to reverse?
- how many numbers should AGI be able to reverse? ASI? can this be a Turing test of sorts?

Sergii's Shortform

Sergii

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Task vectors & analogy making in LLMs

Sergii

I have described the problem of analogy-making interpretability in the previous post: given the examples of transformed sequences of numbers, what’s the mechanism behind figuring this transformation out, and applying it correctly to the incomplete (test) sequence?

prompt: "0 1 2 to 2 1 0, 1 2 3 to 3 2 1, 4 5 6 to ", output: “6 5 4”

It was easy to check on which layer the correct answer appears, but tracing the sources of that answer to earlier layers turned out to be challenging.

Meaningful intermediate embeddings?

When I applied logit lens [1] to the output of attention blocks, for the prompt that contained reversed sequences of numbers, I have noticed that the output contained... (read 1124 more words →)

Mechanistic interpretability of LLM analogy-making

Sergii

Can LLM make analogies? Yes, according to tests done by Melanie Mitchell a few years back, GPT-3 is quite decent at “Copycat” letter-string analogy-making problems. Copycat was invented by Douglas Hofstadter in the 80s, to be a very simple “microworld”, that would capture some key aspects of human analogy reasoning. An example of a Copycat problem:

”If the string abc changes to the string abd, what does the string pqr change to?“

Many more examples are collected on this page.

A project that I'm working on while studying mechanistic interpretability (MI), is applying MI to an LLM's ability to solve Copycat problems.

According to Douglas Hofstadter, analogy is the core of cognition, and it can be argued that it is a... (read 1173 more words →)

Bird-eye view visualization of LLM activations

Sergii

I’m starting to learn about mechanistic interpretability, and I’m seeing lots of great visualizations of transformer internals, but somehow I’ve never seen the whole large model’s internal state shown at once, on one image.

So I made this visualization, for Llama-2-7B. Attention matrices are on the left, in 32 rows for 32 blocks, top to bottom. To the right, there are 64 rows: residual stream (odd rows) and internal MLP activations (even rows). Finally, output MLP and unembedding layer are on the bottom.

Activation maps are downscaled horizontally, with maxpooling, to fit into 1000px wide image.

Example for a prompt “2+2=”:

And an example for the prompt: "William Shakespeare was born in the year”:

And for the prompt "blue pencils fly over moonlit toasters”:

Probably not especially useful for interpretability, but at least it looks pretty )

GPT-4 for personal productivity: online distraction blocker

Sergii

There are many apps for blocking distracting websites: freedom.to, leechblock, selfcontrol, coldturkey, just to name a few. They are useful for maintaining focus, avoiding procrastination, and curbing addictive web surfing.

They work well for blocking a list of a few distracting websites. For me, this is not enough, because I’m spending a large portion of my time on a large number of websites, which I check out for a minute or two and then never visit again. It’s just impossible to maintain a blocklist for this long tail. Also, the web has grown so much that there are just too many easily found alternatives for any blocked distraction.

Well, GPT-4 to the rescue! With... (read 405 more words →)