aribrill — LessWrong

Ari Brill. Research Affiliate at Principles of Intelligence (PIBBSS) creating mathematical and empirical models to study how AI systems develop internal representations of the world. Former astrophysicist. www.aribrill.com

In your usage, are "scheming" and "deceptive alignment" synonyms, or would you distinguish those terms in any way?

This post is great to see, I think renormalization is a very exciting direction for AI safety research!

First, as one possible way to represent the real-world, we can think of representation_0 as a low-energy description of the dataset: IR_0.
If the NN is capable of learning a meaningful generalization of the data, representation_0 flows to representation_1 (now UV_1) via an implicit RG flow to higher energies. Instead of throwing information away, flowing to UV_1 adds structure that allows it to more reliably adapt to unseen information.

Shouldn't this go the other way, with representation_0 being UV and representation_1 being IR? A NN compresses the input representation (data) to obtain a coarse-grained output representation (label). The ability to throw away information, i.e. the irrelevant noise w.r.t. the target function, is what enables generalization to unseen inputs differing in fine-grained details.

The notion of a precision scale for interpretability is really interesting, particularly the connection with generalization/memorization. This seems like a fruitful concept to develop further.

How, then can we operationalize the loss scale of a phenomenon? Well, one way to do this is to imagine that we have some "natural" complexity parameter c that can be varied (this can be a parameter tuning model size, training length, etc.).

It could be interesting to think about the interpretation of different possible complexity parameters here. You might expect these to give rise to distinct but related notions of generalization. Here, I’m drawing on intuitions from my work connecting scaling laws to data distributions, though I hadn’t put it in exactly these terms before. (I'll write a LW post summarizing/advertising this paper, but haven’t gotten to it yet...)

One interesting scaling regime is scaling in effective model size (given infinite training data & compute). You can also think about scaling in training data size (given infinite model capacity & compute). I think model scaling is basically akin to what you’re talking about here. Data scaling could be useful to study too, as it gets around the need to understand & measure effective model capacity. Of course these are theoretical limits, in practice one usually scales model size & data together in a compute-optimal way, you’re probably not training to convergence, etc.

If the data distribution consists of clusters of varying size (alternatively, subtasks of varying importance), then taking model size as the complexity parameter could give a notion of generalization as modeling the most important parts of the data distribution. Memorization then consists of modeling rarely observed or unimportant components. On the other hand, taking data size as the complexity parameter would suggest that generalization consists of coarsely modeling the entire data distribution, with it being memorization-like to model fine details and exceptions anywhere.

It also would be interesting to think about other complexity scaling parameters, for example, test-time compute in an AlphaZero-style setting.

If possible, we would like models in this class to be "locally simultaneously interpretable", i.e. that for two nearby values , the models M_c and M_c' have similar weights and implement similar circuits.

My impression of what one is supposed to expect from this is that as the complexity parameter increases, the learned circuits quantitatively improve, but never undergo a radical qualitative shift at any particular scale. Would you agree with that? So for a “good circuit”, the explained loss is basically monotonic, slowly decreasing or stable around ~100% as one goes from complexity 0 to the cutoff complexity, and decreasing below that. But if there were a qualitative step change, you would see instead a peak in the explained loss around the cutoff complexity, increasing above and decreasing below that. In that situation, the loss precision scale would seem less natural as a measure of circuit understanding.

Basically, the concern would be something like a model implementing an algorithm with low algorithmic complexity but a large constant factor, so it can only emerge and become dominant at some critical model scale. (Similar to but not exactly the same as grokking.) One realistic possible instance of this might be the emergence of in-context learning in LLMs only at large enough scales.

Thanks for the great writeup.

Superposition ("local codes") require sparsity, i.e. that only few features are active at a time.

Typo: I think you meant to write distributed, not local, codes. A local code is the opposite of superposition.

Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.
Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won't find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don't find out the win conditions until relatively late into the game. There's still a lot of useful stuff to do earlier on - instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.

As I understand this argument, even if an agent's abstractions depend on its goals, it doesn't matter because disparate agents will develop similar instrumental goals due to instrumental convergence. Those goals involve understanding and manipulating the world, and thus require natural abstractions. (And there's the further claim that a general intelligence can in fact pick up any needed natural abstraction as required.)

That covers instrumental goals, but what about final goals? These can be arbitrary, per the orthogonality thesis. Even if an agent develops a set of natural abstractions for instrumental purposes, if it has non-natural final goals, it will need to develop a supplementary set of non-natural goal-dependent abstractions to describe them as well.

When it comes to an AI modeling human abstractions, it does seem plausible to me that humans' lowest-level final goals/values can be described entirely in terms of natural abstractions, because they were produced by natural selection and so had to support survival & reproduction. It's a bit less obvious to me this still applies to high-level cultural values (would anyone besides a religious Jew naturally develop the abstraction of kosher animal?). In any case, if it's sufficiently important for the AI to model human behavior, it will develop these abstractions for instrumental purposes.

Going the other direction, can humans understand, in terms of our abstractions, those that an AI develops to fulfill its final goals? I think not necessarily, or at least not easily. An unaligned or deceptively aligned mesa-optimizer could have an arbitrary mesa-objective, with no compact description in terms of human abstractions. This matters if the plan is to retarget an AI's internal search process. Identifying the original search target seems like a relevant intermediate step. How else can you determine what to overwrite, and that you won't break things when you do it?

I claim that humans have that sort of "general intelligence". One implication is that, while there are many natural abstractions which we don't currently track (because the world is big, and I can't track every single object in it), there basically aren't any natural abstractions which we can't pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.

This conflates two different claims.

A general intelligence trying to understand the world can develop any natural abstraction as needed. That is, regularities in observations / sensory data -> abstraction / mental representation.
A general intelligence trying to understand another agent's abstraction can model its implications for the world as needed. That is, abstraction -> predicted observational regularities.

The second doesn't follow from the first. In general, if a new abstraction isn't formulated in terms of lower-level abstractions you already possess, integrating it into your world model (i.e. understanding it) is hard. You first need to understand the entire tower of prerequisite lower-level abstractions it relies on, and that might not be feasible for a bounded agent. This is true whether or not all these abstractions are natural.

In the first case, you have some implicit goal that's guiding your observations and the summary statistics you're extracting. The fundamental reason the second case can be much harder relates to this post's topic: the other agent's implicit goal is unknown, and the space of possible goals is vast. The "ideal gas" toy example misleads here. In that case, there's exactly one natural abstraction (P, V, T), no useful intermediate abstraction levels, and the individual particles are literally indistinguishable, making any non-natural abstractions incoherent. Virtually any goal routes through one abstraction. A realistic general situation may have a huge number of equally valid natural abstractions pertaining to different observables, at many levels of granularity (plus an enormous bestiary of mostly useless non-natural abstractions). A bounded agent learns and employs the tiny subset of these that helps achieve its goals. Even if all generally intelligent agents have the same potential instrumental goals that could enable them to learn the same natural abstractions, without the same actual instrumental goals, they won't.

Unfortunately I am busy from 2-5 on Sundays, but I would certainly like to attend a future Yale meetup at some other time.

In 2002, Wizards of the Coast put out Star Wars: The Trading Card Game designed by Richard Garfield.

As Richard modeled the game after a miniatures game, it made use of many six-sided dice. In combat, cards' damage was designated by how many six-sided dice they rolled. Wizards chose to stop producing the game due to poor sales. One of the contributing factors given through market research was that gamers seem to dislike six-sided dice in their trading card game.

Here's the kicker. When you dug deeper into the comments they equated dice with "lack of skill." But the game rolled huge amounts of dice. That greatly increased the consistency. (What I mean by this is that if you rolled a million dice, your chance of averaging 3.5 is much higher than if you rolled ten.) Players, though, equated lots of dice rolling with the game being "more random" even though that contradicts the actual math.

Mark Rosewater, Kind Acts of Randomness

Why is there that knee-jerk rejection of any effort to "overthink" pop culture? Why would you ever be afraid that looking too hard at something will ruin it? If the government built a huge, mysterious device in the middle of your town and immediately surrounded it with a fence that said, "NOTHING TO SEE HERE!" I'm pretty damned sure you wouldn't rest until you knew what the hell that was -- the fact that they don't want you to know means it can't be good.

Well, when any idea in your brain defends itself with "Just relax! Don't look too close!" you should immediately be just as suspicious. It usually means something ugly is hiding there.

David Wong, The 5 Ugly Lessons Hiding in Every Superhero Movie

"How is it possible! How is it possible to produce such a thing!" he repeated, increasing the pressure on my skull, until it grew painful, but I didn't dare object. "These knobs, holes...cauliflowers -" with an iron finger he poked my nose and ears - "and this is supposed to be an intelligent creature? For shame! For shame, I say!! What use is a Nature that after four billion years comes up with THIS?!"

Here he gave my head a shove, so that it wobbled and I saw stars.

"Give me one, just one billion years, and you'll see what I create!"

Stanislaw Lem, "The Sanatorium of Dr. Vliperdius" (trans. Michael Kandel)

That's certainly true. It seems to me that in this case, sbenthall was describing entities more akin to Google than to the Yankees or to the Townsville High School glee club; "corporations" is over-narrow but accurate, while "organizations" is over-broad and imprecise.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments