LESSWRONG
LW

Kaarel — LessWrong

yea we could and imo should just set out to grow more intelligent/capable as humans, instead of handing the world to some aliens (at least for now, but also maybe forever, tho it should remain possible to collectively reevaluate this later). this centrally requires quickly banning AGI development and somehow quickly making humanity generally act and develop more thoughtfully

Replying toWhat concrete mechanisms could lead to AI models having open-ended goals?

KaarelFeb 11, 2026*

What concrete mechanisms could lead to AI models having open-ended goals?

one straightforward answer:

People will probably just try to make the sorts of AIs that can be told "ok now please take open-ended actions in the world and make things really great for me/humanity", with the AI then doing that capably. Like, imagine a current LLM being prompted with this, but then actually doing some big long-term stuff capably (unlike existing LLMs). It's hard to imagine such a system (given the prompt) not having some sort of ambitious open-ended action-guidance (like, even if this works out well for humans).

a slightly less straightforward answer:

A lot of people are trying to have AIs "solve alignment". A central variety of this is having your AIs make

... (read 442 more words →)

Replying toAnthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

Kaarel7d

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) ^[1]

a monograph untangling this coherence mess some more would be valuable. it could do the following things:

specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both

Kaarel9d*

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

hmm, like i think there's a reasonable sense of "coherence" such that it plausibly doesn't typically increase with capabilities. i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with "Google (the company) is less coherent now than it was when it had <10 employees" (and this is so even though Google is more capable now than it was when it had 10 employees)

this "coherence" is sth like "not being a hot mess" or "making internal tradeoffs efficiently" or "being well-orchestrated". in this sense, "incoherence" is getting at the... (read 921 more words →)

Replying toAnthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

Kaarel9d

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)

i haven't even skimmed the anthropic paper and i have a high prior that they are being bad at philosophy but also: i think there is plausibly a real mistake LW-ers are making around coherence too, downstream of a conflation of two different notions, as i outline here: https://www.lesswrong.com/posts/jL7uDE5oH4HddYq4u/raemon-s-shortform?commentId=WBk9a7TEA5Benjzsu

with like my guess being that: you are saying something straightforwardly true given one notion here but they are making claims given the other notion at least in some cases, though also they might be conflating the two and you might be conflating the two. one could argue that it is fine to "conflate" the two because they are really equivalent, but i think that's probably false (but non-obviously)

Kaarel12d*

I find it interesting and unfortunate that there aren't more economically left-wing thinkers influenced by Yudkowsky/LW thinking about AGI. It seems like a very natural combination given e.g. "Marx subsequently developed an influential theory of history—often called historical materialism—centred around the idea that forms of society rise and fall as they further and then impede the development of ~~human~~ productive power.". It seems likely that LW being very pro-capitalism has meaningfully contributed to the lack of these sorts of people. ^[1] I guess ACS carries sth like this vibe. But (unlike ACS) it also seems natural to apply this sort of view of... (read more)

Kaarel13dQuick Take

When fooming, uphold the option to live in an AGI-free world.

There are people who think (imo correctly) that there will be at least one vastly superhuman AI in the next 100 years by default and (imo incorrectly) that proceeding along the AI path does not lead to human extinction or disempowerment by default. My anecdotal impression is that a significant fraction (maybe most) of such people think (imo incorrectly) that letting Anthropic/Claude do recursive self-improvement and be a forever-sovereign would probably go really well for humanity. The point of this note is to make the following proposal and request: if you ever let an AI self-improve, or more generally if you have... (read 2931 more words →)

Replying toDo humans derive values from fictitious imputed coherence?

Kaarel19d*

Do humans derive values from fictitious imputed coherence?

I think it's good to think of FIAT stuff as a special case of applying some usual understanding-machinery (like, abductive and inductive machinery) in value-laden cases. It's the special case where one implicitly or explicitly abducts to (one having) goals. Here is an example ethical story where the same thing shows up in various ways such that it'd imo be sorta contrived to analyze it in terms of goals being adopted:

You find it easy to feel a strong analogy between "you do X to me" and "I do X to you". (In part, this is because: as a human, you find it easy to put yourself in someone else's shoes.)
This turns into

... (read 403 more words →)

•••

Replying toClaude's new constitution

Kaarel20d*

Claude's new constitution

on my inside view, the ordering of foomers by some sort of intuitive goodness ^[1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude ^[2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming... (read 414 more words →)

-2

Replying toNo instrumental convergence without AI psychology

Kaarel23d*

No instrumental convergence without AI psychology

I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim "not only do most programs which make a mind upload device also kill humanity, it's an issue with the space of programs themselves, not with the way we generate distributions over those programs." That is not true.

Hmm, I think that yes, us probably being killed by a program that makes a mind upload device is (if true) an issue with the way we generated a distribution over those programs. But also, it might be fine to say it's an issue with the space of programs (with an implicit uniform prior on programs up to some length or... (read more)

Honorable AI

Kaarel

2mo

This note discusses a (proto-)plan for [de[AGI-[x-risk]]]ing ^[1] (pdf version). Here's the plan:

You somehow make/find/identify an AI with the following properties:
- the AI is human-level intelligent/capable;
- however, it would be possible for the AI to quickly gain in intelligence/capabilities in some fairly careful self-guided way, sufficiently to take over the world;
- the AI is very honorable/honest/trustworthy — in particular, the AI would keep its promises even in extreme situations.
You tell the AI (a much longer and better version of):
- "hello so if you hadn't figured it out already: there's this broader world we live in. it seems to us like humanity is probably about to lose control of

... (read 12054 more words →)

•••

on seeing the difference between profound and meaningless radically alien futures

Here's a question that came up in a discussion about what kind of future we should steer toward:

Okay, a future in which all remotely human entities promptly get replaced by alien AIs would soon look radically incomprehensible and void to us — like, imagine our current selves seeing videos from this future world, and the world in these videos mostly not making sense to them, and to an even greater extent not seeming very meaningful in the ethical sense. But a future in which [each human]/humanity has spent a million years growing into a galaxy-being would also look radically incomprehensible/weird/meaningless to us.^[1]

... (read 1267 more words →)

I designed a pro-human(ity)/anti-(non-human-)AI flag:

The red-black circle is HAL's eye; it represents the non-human in-all-ways-super-human AI(s) that the world's various AI capability developers are trying to create, that will imo by default render all remotely human beings completely insignificant and cause humanity to completely lose control over what happens :(.
The white star covering HAL's eye has rays at the angles of the limbs of Leonardo's Vitruvian Man; it represents humans/humanity remaining more capable than non-human AI (by banning AGI development and by carefully self-improving).
The blue background represents our potential self-made ever-better future, involving global governance/cooperation/unity in the face of AI.

Feel free to suggest improvements to the flag. Here's latex to generate it:

%... (read more)

a chat with Towards_Keeperhood on what it takes for sentences/phrases/words to be meaningful

Towards_Keeperhood:

you could define "mother(x,y)" as "x gave birth to y", and then "gave birth" as some more precise cluster of observations, which eventually need to be able to be identified from visual inputs

Kaarel:

if i should read this as talking about a translation of "x is the mother of y", then imo this is a bad idea.
in particular, i think there is the following issue with this: saying which observations "x gave birth to y" corresponds to intuitively itself requires appealing to a bunch of other understanding. it's like: sure, your understanding can be used to create visual anticipations, but it's

... (read 1599 more words →)

An Advent of Thought

Kaarel

11mo

I was intending to write and post one (hypo)thesis [relating to thinking (with an eye toward alignment)] each day this Advent, starting on 24/12/01 and finishing on 24/12/24.
Ok, so that didn't happen, and it is now 25/03/17, but whatever — an advent of thought can happen whenever :). I'll be posting the first 8 notes today. (But much of the writing was done in Dec 1–24.)
Most of these notes deal with questions that would really deserve to have very much more said about them — the brief treatments I give these questions here won't be doing them any justice. I hope to think and write more about many of the topics here in

... (read 14138 more words →)

•••

Deep Learning is cheap Solomonoff induction?

Lucius Bushnaq

Lucius Bushnaq, Kaarel, Dmitry Vaintrob

Background

Lucius: I recently held a small talk presenting an idea for how and why deep learning generalises. It tried to reduce concepts from Singular Learning theory back to basic algorithmic information theory to sketch a unified picture that starts with Solomonoff induction and, with a lot of hand waving, derives that under some assumptions, just fitting a big function to your data using a local optimisation method like stochastic gradient descent maybe, sorta, kind of, amounts to a cheap bargain bin approximation of running Solomonoff induction on that data.

Lucius: The parametrised function we fit to the data has to be the sort that can act like a vaguely passable approximation of a... (read 4924 more words →)

a few thoughts on hyperparams for a better learning theory (for understanding what happens when a neural net is trained with gradient descent)

Having found myself repeating the same points/claims in various conversations about what NN learning is like (especially around singular learning theory), I figured it's worth writing some of them down. My typical confidence in a claim below is like 95%^[1]. I'm not claiming anything here is significantly novel. The claims/points:

local learning (eg gradient descent) strongly does not find global optima. insofar as running a local learning process from many seeds produces outputs with 'similar' (train or test) losses, that's a law of large numbers phenomenon^[2], not a consequence of always

... (read 1070 more words →)

Finding the estimate of the value of a state in RL agents

Clément Dumas

Clément Dumas, Walter Laurito, KlaRo, Kaarel

Clément Dumas, Walter Laurito, Robert Klassert, Kaarel Hänni

Epistemic Status: Initial Exploration

The following is a status update of a project started as part of the SPAR program. We explored some initial directions and there are still a lot of low-hanging fruits to pick up. We might continue to work on this project, either again as part of another SPAR iteration or with others who would be interested to work on this.

TL;DR

We adapted the Contrast Consistent Search (CCS) loss to find value-like directions in the activations of CNN-based PPO agents. While we had some success in identifying these directions at late layers of the critic network and with specific informative losses, we discovered that... (read 1098 more words →)

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq

Lucius Bushnaq, jake_mendel, StefanHex, Kaarel

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words,... (read 1701 more words →)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn

This is a linkpost for our two recent papers:

An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

We know that the training loss goes

... (read 714 more words →)

108

A thread into which I'll occasionally post notes on some ML(?) papers I'm reading

I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I'll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I'm wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself... (read more)

A starting point for making sense of task structure (in machine learning)

Kaarel

Kaarel, RP, jake_mendel

ML models can perform a range of tasks and subtasks, some of which are more closely related to one another than are others. In this post, we set out two very initial starting points. First, we motivate reverse engineering models’ task decompositions. We think this can be helpful for interpretability and for understanding generalization. Second, we provide a (potentially non-exhaustive, initial) list of techniques that could be used to quantify the ‘distance’ between two tasks or inputs. We hope these distances might help us identify the task decomposition of a particular model. We close by briefly considering analogues in humans and by suggesting a toy model.

Epistemic status: We didn’t spend much... (read 3362 more words →)

Toward A Mathematical Framework for Computation in Superposition

Dmitry Vaintrob

Dmitry Vaintrob, jake_mendel, Kaarel

Author order randomized. Authors contributed roughly equally — see attribution section for details.

Update as of July 2024: we have collaborated with @LawrenceC to expand section 1 of this post into an arXiv paper, which culminates in a formal proof that computation in superposition can be leveraged to emulate sparse boolean circuits of arbitrary depth in small neural networks.

What kind of document is this?

What you have in front of you is so far a rough writeup rather than a clean text. As we realized that our work is currently highly relevant to recent questions posed by interpretability researchers, we put together a lightly edited version of private notes we've written over the last... (read 18775 more words →)

213

Grokking, memorization, and generalization — a discussion

Kaarel

Kaarel, Dmitry Vaintrob

Dmitry Vaintrob

Intro: Kaarel and Dmitry have been meeting to think about various interpretability and toy model questions. We are both interested in unsupervised and superposition-agnostic interpretability results (Kaarel, Dmitry). Recently we discussed questions related to grokking, memorization, double descent and so on. Dmitry has been doing a number of experiments on modular addition with Nina Rimsky. (See here and here. Another writeup on modular addition circuits based on these experiments is in the works. Note that Nina is also interested in these questions, but couldn’t join today. Dmitry's views here are almost entirely a result of conversations with Nina.) Dmitry is interested in having a better gears-level understanding of grokking and thinks there are some issues

... (read 6676 more words →)

Crystal Healing — or the Origins of Expected Utility Maximizers

Alexander Gietelink Oldenziel

Alexander Gietelink Oldenziel, RP, Kaarel

(Note: John discusses similar ideas here. We drafted this before he published his post, so some of the concepts might jar if you read it with his framing in mind. )

Traditionally, focus in Agent Foundations has been around the characterization of ideal agents, often in terms of coherence theorems that state that, under certain conditions capturing rational decision-making, an agent satisfying these conditions must behave as if it maximizes a utility function. In this post, we are not so much interested in characterizing ideal agents — at least not directly. Rather, we are interested in how true agents and not-so true agents may be classified and taxonomized, how agents and pseudo-agents may... (read 1452 more words →)

An attempt at a specification of virtue ethics

I will be appropriating terminology from the Waluigi post. I hereby put forward the hypothesis that virtue ethics endorses an action iff it is what the better one of Luigi and Waluigi would do, where Luigi and Waluigi are the ones given by the posterior semiotic measure in the given situation, and "better" is defined according to what some [possibly vaguely specified] consequentialist theory thinks about the long-term expected effects of this particular Luigi vs the long-term effects of this particular Waluigi. One intuition here is that a vague specification could be more fine if we are not optimizing for it very hard, instead just... (read 376 more words →)

A small observation about the AI arms race in conditions of good infosec and collaboration

Suppose we are in a world where most top AI capabilities organizations are refraining from publishing their work (this could be the case because of safety concerns, or because of profit motives) + have strong infosec which prevents them from leaking insights about capabilities in other ways. In this world, it seems sort of plausible that the union of the capabilities insights of people at top labs would allow one to train significantly more capable models than the insights possessed by any single lab alone would allow one to train. In such a world, if the labs decide to cooperate once AGI is nigh, this could lead to a significantly faster increase in capabilities than one might have expected otherwise.

(I doubt this is a novel thought. I did not perform an extensive search of the AI strategy/governance literature before writing this.)

I proposed a method for detecting cheating in chess; cross-posting it here in the hopes of maybe getting better feedback than on reddit: https://www.reddit.com/r/chess/comments/xrs31z/a_proposal_for_an_experiment_well_data_analysis/

I'm updating my estimate of the return on investment into culture wars from being an epsilon fraction compared to canonical EA cause areas to epsilon+delta. This has to do with cases where AI locks in current values extrapolated "correctly" except with too much weight put on the practical (as opposed to the abstract) layer of current preferences. What follows is a somewhat more detailed status report on this change.

For me (and I'd guess for a large fraction of ~~autistic altruistics~~ multipliers), the general feels regarding [being a culture war combatant in one's professional capacity] seem to be that while the questions fought over have some importance, the welfare-produced-per-hour-worked from doing direct work is... (read 450 more words →)