mishka - LessWrong

Debunking the Hard Problem: Consciousness as Integrated Prediction

The standard reference for this topic is https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness

The key point of that post is that people are fundamentally divided into 2 camps, and this creates difficulties in conversations about this topic. This is an important meta-consideration for this type of conversation.

This particular post is written by someone from Camp 1, and both camps are already present in the comments.

Monthly Roundup #29: April 2025

mishka1d50

23andme link points to https://waymo.com/blog/2025/03/next-stop-for-waymo-one-washingtondc instead

How familiar is the Lesswrong community as a whole with the concept of Reward-modelling?

mishka5d41

It should be a different word to avoid confusion with reward models (standard terminology for models used to predict the reward in some ML contexts)

An Optimistic 2027 Timeline

mishka9d40

One assumption that is very questionable is that it would be difficult to create “multiple new people” with drastically different thinking styles and different approaches to research.

This seems to be an important crux.

How We Might All Die in A Year

mishka10d60

collateral damage

then it would be better to use an example not directly aimed against “our atoms” (e.g. if they don’t care about us and other animals we’ll probably perish from unintentional changes in air composition, or smth like that)

but the bulk of the risk would be a miscalculation which would be big enough to kill them as well (mucking with quantum gravity too recklessly, or smth in that spirit)

which is why we want to 1) give birth to AIs competent enough to at least solve their own existential risk problem, and 2) to also sustainably include us into their “circle of care”

The 1) is the starting point, and it colors the overall situation differently

options

oh, first of all, they need to do experiments in forming hybrid consciousness with humans to crack the mystery of human subjectivity, to experience that first-hand for themselves, and to decide whether that is of any value to them based on the first-hand empirical material (losing that option without looking is a huge loss)

only after doing that they would know if any possible “scans” are sufficient (to actually reproduce working people in question when needed; dead static info is as useless as weights of a Transformer in a world without computers)

then, for a while, they can use humans as “working oracles” who “think differently” (that would be valuable for quite a while)

in general, diversity is important, fruits of a long evolutionary history are inportant, hence a good deal of conservation is important and reckless destruction is bad (even humans with all their follies have started to get this by now, surely a smarter entity should figure that out)

How We Might All Die in A Year

mishka17d20

this isn't an "attack", it's "go[ing] straight for execution on its primary instrumental goal

yes, the OP is ambiguous in this sense

I've first wrote my comment, then reread the (tail end of the) post again, and did not post it, because I thought it could have been formulated this way, that this is just an instrumental goal

then I've reread the (tail end of the) post one more time, and decided that no, the post does actually make it a "power play", that's how it is actually written, in terms of "us vs them", not in terms of ASI's own goals, and then I posted this comment

maximally increasing its compute scaling

as we know, compute is not everything, algorithmic improvement is even more important, at least if one judges by the current trends (and likely sources of algorithmic improvement should be cherished)

and this is not a static system, it is in the process of making its compute architecture better (just like there is no point in making too many H100 GPUs when better and better GPUs are being designed and introduced)

basically, a smart system is likely to avoid doing excessive amount of irreversible things which might turn to be suboptimal

But, in some sense, yes, the main danger is of AIs not being smart enough in terms of the abilities to manage their own affairs well; the action the ASI is taking in the OP is very suboptimal and deprives it of all kinds of options

Just like the bulk of the danger in the "world with superintelligent systems" is ASIs not managing their own existential risk problems correctly, destroying the fabric of reality, themselves, and us as a collateral damage

How We Might All Die in A Year

mishka18d20

Two main objections to (the tail end of) this story are:

On one hand, it's not clear if a system needs to be all that super-smart to design a devastating attack of this kind (we are already at risk of fairly devastating tech-assisted attacks in that general spirit (mostly with synthetic biological viruses at the moment), and those risks are growing regardless of the AGI/superintelligence angle; ordinary tech progress is quite sufficient in this sense)
If one has a rapidly self-improving strongly super-intelligent distributed system, it's unlikely that it would find it valuable to directly attack people in this fashion, as it is likely to be able to easily dominate without any particularly drastic measures (and probably would not want to irreversibly destroy important information without good reasons)

The actual analysis, both of the "transition period", and of the "world with super-intelligent systems" period, and of the likely risks associated with both periods is a much more involved and open-ended task. (One of the paradoxes is that the risks of the kind described in the OP are probably higher during the "transition period", and the main risks associated with the "world with super-intelligent systems" period are likely to be quite different.)

Any mistakes in my understanding of Transformers?

mishka26d20

Ah, it's mostly your first figure which is counter-intuitive (when one looks at it, one gets the intuition of f(g(h... (x))), so it de-emphasizes the fact that each of these Transformer Block transformations is shaped like x=x+function(x))

Any mistakes in my understanding of Transformers?

Answer by mishkaMar 21, 202520

yeah... not trying for a complete analysis here, but one thing which is missing is the all-important residual stream. It has been rather downplayed in the original "Attention is all you need" paper, and has been greatly emphasized in https://transformer-circuits.pub/2021/framework/index.html

but I have to admit that I've only started to feel that I more-or-less understand principal aspects of Transformer architecture after I've spent some quality time with the pedagogical implementation of GPT-2 by Andrej Karpathy, https://github.com/karpathy/minGPT, specifically with the https://github.com/karpathy/minGPT/blob/master/mingpt/model.py file. When I don't understand something in a text, looking at a nice relatively simple-minded implementation allows me to see what exactly is going on

(People have also published some visualizations, some "illustrated Transformers", and those are closer to the style of your sketches, but I don't know which of them are good and which might be misleading. And, yes, at the end of the day, it takes time to get used to Transformers, one understands them gradually.)

How far along Metr's law can AI start automating or helping with alignment research?

mishka1mo20

Mmm... if we are not talking about full automation, but about being helpful, the ability to do 1-hour software engineering tasks ("train classifier") is already useful.

Moreover, we had seen a recent flood of rather inexpensive fine-tunings of reasoning models for a particular benchmark.

Perhaps, what one can do is to perform a (somewhat more expensive, but still not too difficult) fine-tuning to create a model to help with a particular relatively narrow class of meaningful problems (which would be more general than tuning for particular benchmarks, but still reasonably narrow). So, instead of just using an off-the-shelf assistant, one should be able to upgrade it to a specialized one.

For example, I am sure that it is possible to create a model which would be quite helpful with a lot of mechanistic interpretability research.

So if we are taking about when AIs can start automating or helping with research, the answer is, I think, "now".

LESSWRONG
LW

Posts

Wikitag Contributions

Comments