cozyfractal — LessWrong

LESSWRONG
LW

What convincing warning shot could help prevent extinction from AI?

I agree, that's an important point. I probably worry more about your first possibility, as we are already seeing this effect today, and worry less about the second, which would require a level of resignation that I've rarely seen. Entities that are responsible would likely try to do something about it, but the ways this “we're doomed, let's profit” might happen are:

The warning shot comes from a small player and a bigger player feels urgency or feels threatened, in a situation where they have little control
There is no clear responsibility and there are many

109

Charbel-Raphaël, cozyfractal, peterbarnett

Ω 272y

- Tell me father, when is the line
where ends everything good and fine?
I keep searching, but I don't find.
- The line my son, is just behind.
Camille Berger

There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.

There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.

We will need a very...

(See More - 429 more words)

Against Almost Every Theory of Impact of Interpretability

cozyfractal3y20

I'm not sure of what you meant about studying transistors.

It seems to me to me that if we are studying transistors so hard, it's to push computers capabilities (faster, smaller, more energy efficient etc.), and not at all to make software safer. Instead to make software safer, we use anti-viruses, automatic testing, developer liability, standards, regulations, pop-up warnings, etc.

Understanding the Information Flow inside Large Language Models

Felix Hofstätter, cozyfractal

This is the write-up for our (@cozyfractal and mine) capstone project during ARENA's 2023 summer iteration. Our project explored a novel approach for interpreting language models, focusing on understanding their internal flow of information. While the practical implementation was completed in just one week and lacks formal rigor, we believe it offers some interesting insights and holds promise as a foundation for future research in this area. The accompanying repository with code examples and more experiments can be found here.

We want to thank Alexandre Variengien whose original idea served as the inspiration for this work, and who provided extensive feedback as well as thought-provoking discussions. Additionally, we want to express our thanks to the organizers of ARENA and our fellow participants for fostering an environment that encouraged...

(Continue Reading - 5076 more words)

The salt in pasta water fallacy

cozyfractal3y2913

It's the horizontal difference that matters and not the vertical one, so the water boils about 200s earlier or 20% faster (according to this one experiment) which quite nice!

Gears-Level Mental Models of Transformer Interpretability

cozyfractal3y20

Thank you for bringing those four ideas into one nicely written post! It helped me have a better overview of what happens inside transformers, even though I had worked with each idea independently before :)