1 min read

6

This is a special post for quick takes by tailcalled. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
tailcalled's Shortform
143 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

If a tree falls in the forest, and two people are around to hear it, does it make a sound?

I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.

But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.

3Dagon
I think this just repeats the original ambiguity of the question, by using the word "sound" in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable.  It's still a question of definition, not of understanding what actually happens.
3tailcalled
But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)
2Dagon
My recommended way to resolve (aka disambiguate) definitional questions is "use more words".  Common understandings can be short, but unusual contexts require more signals to communicate.
1Bert
I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people

Finally gonna start properly experimenting on stuff. Just writing up what I'm doing to force myself to do something, not claiming this is necessarily particularly important.

Llama (and many other models, but I'm doing experiments on Llama) has a piece of code that looks like this:

        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
       out = h + self.feed_forward(self.ffn_norm(h))

Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of "writes" to the residual stream using these two vectors.

I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express , where the 's and 's are independent units vectors. This basically decomposes the "writes" into some independent locations in the residual stream (u's), some lat... (read more)

2tailcalled
Ok, so I've got the clipping working. First, some uninterpretable diagrams: In the bottom six diagrams, I try taking varying number (x-axis) of right singular vectors (v's) and projecting down the "writes" to the residual stream to the space spanned by those vectors. The obvious criterion to care about is whether the projected network reproduces the outputs of the original network, which here I operationalize based on the log probability the projected network gives to the continuation of the prompt (shown in the "generation probability" diagrams). This appears to be fairly chaotic (and low) in the 1-300ish range, and then stabilizes while still being pretty low in the 300ish-1500ish range, and then finally converges to normal in the 1500ish to 2000ish range, and is ~perfect afterwards. The remaining diagrams show something about how/why we have this pattern. "orig_delta" concerns the magnitude of the attempted writes for a given projection (which is not constant because projecting in earlier layers will change the writes by later layers), and "kept_delta" concerns the remaining magnitude after the discarded dimensions have been projected away. In the low end, "kept_delta" is small (and even "orig_delta" is a bit smaller than it ends up being at the high end), indicating that the network fails to reproduce the probabilities because the projection is so aggressive that it simply suppresses the network too much. Then in the middle range, "orig_delta" and "kept_delta" explodes, indicating that the network has some internal runaway dynamics which normally would be suppressed, but where the suppression system is broken by the projection. Finally, in the high range, we get a sudden improvement in loss, and a sudden drop in residual stream "write" size, indicating that it has managed to suppress this runaway stuff and now it works fine.
2tailcalled
An implicit assumption I'm making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me ("bigger = more important"), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities: Clearly there is a correlation, but also clearly there's some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It's tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.
2tailcalled
Back to clipping away an entire range, rather than a single dimension. Here's ordering it by the importance computed by clipping away a single dimension: Less chaotic maybe, but also much slower at reaching a reasonable performance, so I tried a compromise ordering that takes both size and performance into account: Doesn't seem like it works super great tbh. Edit: for completeness' sake, here's the initial graph with log-surprise-based plotting.
2tailcalled
To quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping. A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate "I believe the meaning of life is" using the 1886-dimensional subspace from  , I get: Which seems sort of vaguely related, but idk. Another test is just generating without any prompt, in which case these vectors give me: Using a different prompt: I can get a 3329-dimensional subspace which generates: or Another example: can yield 2696 dimensions with or And finally, can yield the 2518-dimensional subspace: or
2tailcalled
Given the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much? I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot: ... unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.
2tailcalled
If I look at the pairwise overlap between the dimensions needed for each generation: ... then this is predictable down to ~1% error simply by assuming that they pick a random subset of the dimensions for each, so their overlap is proportional to each of their individual sizes.
2tailcalled
Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn't 100% accurate.

Thesis: while consciousness isn't literally epiphenomenal, it is approximately epiphenomenal. One way to think of this is that your output bandwidth is much lower than your input bandwidth. Another way to think of this is the prevalence of akrasia, where your conscious mind actually doesn't have full control over your behavior. On a practical level, the ecological reason for this is that it's easier to build a general mind and then use whatever parts of the mind that are useful than to narrow down the mind to only work with a small slice of possibilities. This is quite analogous to how we probably use LLMs for a much narrower set of tasks than what they were trained for.

2Seth Herd
Consciousness is not at all epiphenomenal, it's just not the whole mind and not doing everything. We don't have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
2tailcalled
Maybe a point that was missing from my thesis is that one can have a higher-level psychological theory in terms of life-drives and death-drives which then addresses the important phenomenal activities but doesn't model everything. And then if one asks for an explanation of the unmodelled part, the answer will have to be consciousness. But then because the important phenomenal part is already modelled by the higher-level theory, the relevant theory of consciousness is ~epiphenomenal.
2Seth Herd
I guess I have no idea what you mean by "consciousness" in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I'm going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word. Edit - or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn't real at all. I'm an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
2tailcalled
A thermodynamic analogy might help: Reductionists like to describe all motion in terms of low-level physical dynamics, but that is extremely computationally intractable and arguably also misleading because it obscures entropy. Physicists avoid reductionism by instead factoring their models into macroscopic kinetics and microscopic thermodynamics. Reductionistically, heat is just microscopic motion, but microscopic motion that adds up to macroscopic motion has already been factored out into the macroscopic kinetics, so what remains is microscopic motion that doesn't act like macroscopic motion, either because it is ~epiphenomenal (heat in thermal equilibrium) or because it acts very different from macroscopic motion (heat diffusion). Similarly, reductionists like to describe all psychology in terms of low-level Bayesian decision theory, but that is extremely computationally intractable and arguably also misleading because it obscures entropy. You can avoid reductionism by instead factoring models into some sort of macroscopic psychology-ecology boundary and microscopic neuroses. Luckily Bayesian decision theory is pretty self-similar, so often the macroscopic psychology-ecology boundary fits pretty well with a coarse-grained Bayesian decision theory. Now, similar to how most of the kinetic energy in a system in motion is usually in the microscopic thermal motion rather than in the macroscopic motion, most of the mental activity is usually with the microscopic neuroses instead of the macroscopic psychology-ecology. Thus, whenever you think "consciousness", "self-awareness", "personality", "ideology", or any other broad and general psychological term, it's probably mostly about the microscopic neuroses. Meanwhile, similar to how tons of physical systems are very robust to wide ranges of temperatures, tons of psychology-ecologies are very robust to wide ranges of neuroses. As for what "consciousness" really means, idk, currently I'm thinking it's tightly intertwin

Thesis: There's three distinct coherent notions of "soul": sideways, upwards and downwards.

By "sideways souls", I basically mean what materialists would translate the notion of a soul to: the brain, or its structure, so something like that. By "upwards souls", I mean attempts to remove arbitrary/contingent factors from the sideways souls, for instance by equating the soul with one's genes or utility function. These are different in the particulars, but they seem conceptually similar and mainly differ in how they attempt to cut the question of identity (ide... (read more)

2Dagon
I'm having trouble following whether this categories the definition/concept of a soul, or the causality and content of this conception of soul.  Is "sideways soul" about structure and material implementation, or about weights and connectivity, independent of substrate?  WHICH factors are removed from upwards ("genes" and "utility function" are VERY different dimensions, both tiny parts of what I expect create (for genes) or comprise (for utility function) a soul.  What about memory?  multiple levels of value and preferences (including meta-preferences in how to abstract into "values")? Putting "downwards" supernatural ideas into the same framework as more logical/materialist ideas confuses me - I can't tell if that makes it a more useful model or less.
4tailcalled
When you get into the particulars, there are multiple feasible notions of sideways soul, of which material implementation vs weights and connectivity are the main ones. I'm most sympathetic to weights and connectivity. I have thought less about and seen less discussion about upwards souls. I just mentioned it because I'd seen a brief reference to it once, but I don't know anything in-depth. I agree that both genes and utility function seem incomplete for humans, though for utility maximizers in general I think there is some merit to the soul == utility function view. Memory would usually go in sideways soul, I think. idk Sideways vs upwards vs downwards is more meant to be a contrast between three qualitatively distinct classes of frameworks than it is meant to be a shared framework.
2Seth Herd
Excellent! I like the move of calling this "soul" with no reference to metaphysical souls. This is highly relevant to discussions of "free will" if the real topic is self-determination - which it usually is.
2tailcalled
"Downwards souls are similar to the supernatural notion of souls" is an explicit reference to metaphysical souls, no?
2Seth Herd
um, it claims to be :) I don't think that's got much relationship to the common supernatural notion of souls. But I read it yesterday and forgot that you'd made that reference.
2tailcalled
What special characteristics do you associate with the common supernatural notion of souls which differs from what I described?
2Nathan Helm-Burger
The word 'soul' is so tied in my mind to implausible metaphysical mythologies that I'd parse this better if the word were switched for something like 'quintessence' or 'essential self' or 'distinguishing uniqueness'.
2tailcalled
What implausible metaphysical mythologies is it tied up with? As mentioned in my comment, downwards souls seem to satisfy multiple characteristics we'd associate with mythological souls, so this and other things makes me wonder if the metaphysical mythologies might actually be more plausible than you realize.

Thesis: in addition to probabilities, forecasts should include entropies (how many different conditions are included in the forecast) and temperatures (how intense is the outcome addressed by the marginal constraint in this forecast, i.e. the big-if-true factor).

I say "in addition to" rather than "instead of" because you can't compute probabilities just from these two numbers. If we assume a Gibbs distribution, there's the free parameter of energy: ln(P) = S - E/T. But I'm not sure whether this energy parameter has any sensible meaning with more general ev... (read more)

Reply11111

Thesis: whether or not tradition contains some moral insights, commonly-told biblical stories tend to be too sparse to be informative. For instance, there's no plot-relevant reason why it should be bad for Adam and Eve to have knowledge of good and evil. Maybe there's some interpretation of good and evil where it makes sense, but it seems like then that interpretation should have been embedded more properly in the story.

3UnderTruth
It is worth noting that, in the religious tradition from which the story originates, it is Moses who commits these previously-oral stories to writing, and does so in the context of a continued oral tradition which is intended to exist in parallel with the writings. On their own, the writings are not meant to be complete, both in order to limit more advanced teachings to those deemed ready for them, as well as to provide occasion to seek out the deeper meanings, for those with the right sort of character to do so.
4tailcalled
This makes sense. The context I'm thinking of is my own life, where I come from a secular society with atheist parents, and merely had brief introductions to the stories from bible reading with parents and Christian education in school. (Denmark is a weird society - few people are actually Christian or religious, so it's basically secular, but legally speaking we are Christian and do not have separation between Church and state, so there are random fragments of Christianity we run into.)
2lemonhope
What? Nobody told me. Where did you learn this
4Garrett Baker
This is the justification behind the talmud

Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we're summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.

Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.

T... (read more)

Current agent models like argmax entirely lack any notion of "energy". Not only does this seem kind of silly on its own, I think it also leads to missing important dynamics related to temperature.

I think I've got it, the fix to the problem in my corrigibility thing!

So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don't do this.) That is, if we say that the AI's utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only wa... (read more)

2tailcalled
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can't immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
2tailcalled
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself. But the reason I think the AI *wouldn't* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there's a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like "black hole swallows the earth", but this would rank pretty low in the AI's utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch. But this does not seem like sane reasoning on the AI's side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.

I was surprised to see this on twitter:

I mean, I'm pretty sure I knew what caused it (this thread or this market), and I guess I knew from Zack's stuff that rationalist cultism had gotten pretty far, but I still hadn't expected that something this small would lead to being blocked.

FYI: I have a low bar for blocking people who have according-to-me bad, overconfident, takes about probability theory, in particular. For whatever reason, I find people making claims about that topic, in particular, really frustrating. ¯\_(ツ)_/¯

The block isn't meant as a punishment, just a "I get to curate my online experience however I want."

2tailcalled
I think blocks are pretty irrelevant unless one conditions on the particular details of the situation. In this case I think the messages I were sharing are very important. If you think my messages are instead unimportant or outright wrong, then I understand why you would find the block less interesting, but in that case I don't think we can meaningfully discuss it without knowing why you disagree with the messages.

I'm not particularly interested in discussing it in depth. I'm more like giving you a data-point in favor of not taking the block personally, or particularly reading into it. 

(But yeah, "I think these messages are very important", is likely to trigger my personal "bad, overconfident takes about proabrbility theory" neurosis.)

This is awkwardly armchair, but… my impression of Eliezer includes him being just so tired, both specifically from having sacrificed his present energy in the past while pushing to rectify the path of AI development (by his own model thereof, of course!) and maybe for broader zeitgeist reasons that are hard for me to describe. As a result, I expect him to have entered into the natural pattern of having a very low threshold for handing out blocks on Twitter, both because he's beset by a large amount of sneering and crankage in his particular position and because the platform easily becomes a sinkhole in cognitive/experiential ways that are hard for me to describe but are greatly intertwined with the aforementioned zeitgeist tiredness.

Something like: when people run heavily out of certain kinds of slack for dealing with The Other, they reach a kind of contextual-but-bleed-prone scarcity-based closed-mindedness of necessity, something that both looks and can become “cultish” but where reaching for that adjective first is misleading about the structure around it. I haven't succeeded in extracting a more legible model of this, and I bet my perception is still skew to the reality, but I'... (read more)

I disagree with the sibling thread about this kind of post being “low cost”, BTW; I think adding salience to “who blocked whom” types of considerations can be subtly very costly.

 

I agree publicizing blocks has costs, but so does a strong advocate of something with a pattern of blocking critics. People publicly announcing "Bob blocked me" is often the only way to find out if Bob has such a pattern. 

I do think it was ridiculous to call this cultish. Tuning out critics can be evidence of several kinds of problems, but not particularly that one. 

5tailcalled
I agree that it is ridiculous to call this cultish if this was the only evidence, but we've got other lines of evidence pointing towards cultishness, so I'm making a claim of attribution more so than a claim of evidence.
3M. Y. Zuo
Blocking a lot isn’t necessarily bad or unproductive… but in this case it’s practically certain blocking thousands will eventually lead to blocking someone genuinely more correct/competent/intelligent/experienced/etc… than himself, due to sheer probability. (Since even a ‘sneering’ crank is far from literal random noise.) Which wouldn’t matter at all for someone just messing around for fun, who can just treat X as a text-heavy entertainment system. But it does matter somewhat for anyone trying to do something meaningful and/or accomplish certain goals. In short, blocking does have some, variable, credibility cost. Ranging from near zero to quite a lot, depending on who the blockee is.
2tailcalled
Eliezer Yudkowsky being tired isn't an unrelated accident though. Bayesian decision theory in general intrinsically causes fatigue by relying on people to use their own actions to move outcomes instead of getting leverage from destiny/higher powers, which matches what you say about him having sacrificed his present energy for this. Similarly, "being Twitterized" is just about stewing in garbage and cursed information, such that one is forced to filter extremely aggressively, but blocking high-quality information sources accelerates the Twitterization by changing the ratio of blessed to garbage/cursed information. On the contrary, I think raising salience of such discussions helps clear up the "informational food chain", allowing us to map out where there are underused opportunities and toxic accumulation.
6Richard_Kennaway
It seems likely to me that Eliezer blocked you because he has concluded that you are a low-quality information source, no longer worth the effort of engaging with.
4tailcalled
I agree that this is likely Eliezer's mental state. I think this belief is false, but for someone who thinks it's true, there's of course no problem here.
6Richard_Kennaway
Please say more about this. Where can I get some?
6tailcalled
Working on writing stuff but it's not developed enough yet. To begin with you can read my Linear Diffusion of Sparse Lognormals sequence, but it's not really oriented towards practical applications.
2Richard_Kennaway
I will look forward to that. I have read the LDSL posts, but I cannot say that I understand them, or guess what the connection might be with destiny and higher powers.
2tailcalled
One of the big open questions that the LDSL sequence hasn't addressed yet is, what starts all the lognormals and why are they so commensurate with each other. So far, the best answer I've been able to come up with is a thermodynamic approach (hence my various recent comments about thermodynamics). The lognormals all originate as emanations from the sun, which is obviously a higher power. They then split up and recombine in various complicated ways. As for destiny: The sun throws in a lot of free energy, which can be developed in various ways, increasing entropy along the way. But some developments don't work very well, e.g. self-sabotaging (fire), degenerating (parasitism leading to capabilities becoming vestigial), or otherwise getting "stuck". But it's not all developments that get stuck, some developments lead to continuous progress (sunlight -> cells -> eukaryotes -> animals -> mammals -> humans -> society -> capitalism -> ?). This continuous progress is not just accidental, but rather an intrinsic part of the possibility landscape. For instance, eyes have evolved in parallel to very similar structures, and even modern cameras have a lot in common with eyes. There's basically some developments that intrinsically unblock lots of derived developments while preferentially unblocking developments that defend themselves over developments that sabotage themselves. Thus as entropy increases, such developments will intrinsically be favored by the universe. That's destiny. Critically, getting people to change many small behaviors in accordance with long explanations contradicts destiny because it is all about homogenizing things and adding additional constraints whereas destiny is all about differentiating things and releasing constraints.
7quetzal_rainbow
Meta-point: your communication pattern fits with following pattern: The reason why smart people find themselves in this pattern is because they expect short inferential distances, i.e., they see their argumentation not like vague esoteric crackpottery, but like a set of very clear statements and fail to put themselves in shoes of people who are going to read this, and they especially fail to account for fact that readers already distrust them because they started conversation with <controversial statement>. On object level, as stated, you are wrong. Observing heuristic failing should decrease your confidence ih heuristic. You can argue that your update should be small, due to, say, measurement errors or strong priors, but direction of update should be strictly down. 
2tailcalled
Can you fill in a particular example of me engaging in that pattern so we can address it in the concrete rather than in the abstract?
7quetzal_rainbow
To be clear, I mean "your communication in this particular thread". Pattern: <controversial statement> <this statement is false> <controversial statement> <this statement is false> <mix of "this is trivially true because" and "here is my blogpost with esoteric terminology"> The following responses from EY are more in genre "I ain't reading this", because he is more using you as example for other readers than talking directly to you, with following block. 
3tailcalled
This statement had two parts. Part 1: And part 2: Part 2 is what Eliezer said was false, but it's not really central to my point (hence why I didn't write much about it in the original thread), and so it is self-sabotaging of Eliezer to zoom into this rather than the actually informative point.
5MondSemmel
People should feel free to liberally block one another on social media. Being blocked is not enough to warrant an accusation of cultism.
8tailcalled
I did not say that simply blocking me warrants an accusation of cultism. I highlighted the fact that I had been blocked and the context in which it occurred, and then brought up other angles which evidenced cultism. If you think my views are pathetic and aren't the least bit alarmed by them being blocked, then feel free to feel that way, but I suspect there are at least some people here who'd like to keep track of how the rationalist isolation is progressing and who see merit in my positions.
1MondSemmel
Again, people block one another on social media for any number of reasons. That just doesn't warrant feeling alarmed or like your views are pathetic.
4tailcalled
We know what the root cause is, you don't have to act like it's totally mysterious. So the question is, was this root cause (pushback against Eliezer's Bayesianism): * An important insight that Eliezer was missing (alarming!) * Worthless pedantry that he might as well block (nbd/pathetic) * Antisocial trolling that ought to be gotten rid of (reassuring that he blocked) * ... or something else Regardless of which of these is the true one, it seems informative to highlight for anyone who is keeping track of what is happening around me. And if the first one is the true one, it seems like people who are keeping track of what is happening around Eliezer would also want to know it. Especially since it only takes a very brief moment to post and link about getting blocked. Low cost action, potentially high reward.

MIRI full-time employed many critics of bayesianism for 5+ years and MIRI researchers themselves argued most of the points you made in these arguments. It is obviously not the case that critiquing bayesianism is the reason why you got blocked.

5tailcalled
Idk, maybe you've got a point, but Eliezer was very quick to insist what I said was not the mainstream view and disengage. And MIRI was full of internal distrust. I don't know enough of the situation to know if this explains it, but it seems plausible to me that the way MIRI kept stuff together was by insisting on a Bayesian approach, and that some generators of internal dissent was from people whose intuition aligned more with non-Bayesian approach. For that matter, an important split in rationalism is MIRI/CFAR vs the Vassarites, and while I wouldn't really say the Vassarites formed a major inspiration for LDSL, after coming up with LDSL I've totally reevaluated my interpretation of that conflict as being about MIRI/CFAR using a Bayesian approach and the Vassarites using an LDSL approach. (Not absolutely of course, everyone has a mixture of both, but in terms of relative differences.)

I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.

I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than... (read more)

4Thomas Kwa
Much dumber ideas have turned into excellent papers
2tailcalled
True, though I think the Hessian is problematic enough that that I'd either want to wait until I have something better, or want to use a simpler method. It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network. But in some cases the Jacobian gives us exactly the opposite of what we want. If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, "obviously" the contribution of weight j "should" be proportional to wj. So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network "on the margin" to be nicer).

One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.

When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do... (read more)

Thesis: money = negative entropy, wealth = heat/bound energy, prices = coldness/inverse temperature, Baumol effect = heat diffusion, arbitrage opportunity = free energy.

2tailcalled
Maybe this mainly works because the economy is intelligence-constrained (since intelligence works by pulling off negentropy from free energy), and it will break down shortly after human-level AGI?

Thesis: there's a condition/trauma that arises from having spent a lot of time in an environment where there's excess resources for no reasons, which can lead to several outcomes:

  • Inertial drifting in the direction implied by ones' prior adaptations,
  • Conformity/adaptation to social popularity contests based on the urges above,
  • Getting lost in meta-level preparations,
  • Acting as a stickler for the authorities,
  • "Bite the hand that feeds you",
  • Tracking the resource/motivation flows present.

By contrast, if resources are contingent on a particular reason, everything takes shape according to said reason, and so one cannot make a general characterization of the outcomes.

1Mateusz Bagiński
It's not clear to me how this results from "excess resources for no reasons". I guess the "for no reasons" part is crucial here?

Thesis: the median entity in any large group never matters and therefore the median voter doesn't matter and therefore the median voter theorem proves that democracies get obsessed about stuff that doesn't matter.

2Dagon
A lot depends on your definition of "matter".  Interesting and important debates are always on margins of disagreement.  The median member likely has a TON of important beliefs and activities that are uncontroversial and ignored for most things.  Those things matter, and they matter more than 95% of what gets debated and focused on.   The question isn't whether the entities matter, but whether the highlighted, debated topics matter.

I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.

But now I've been thinking about it further, and I think I've realized - don't we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we're already being myopic in some ways, e.g. when train... (read more)

2Charlie Steiner
Yeah, I think usually when people are interested in myopia, it's because they think there's some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers. I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true - if there's a simpler myopic solution that's bad, then myopia won't help (so how can we predict if this is true?) and if there's a simpler non-myopic solution that's good, myopia may actively hurt (this one seems a little easier to predict though).

Thesis: a general-purpose interpretability method for utility-maximizing adversarial search is a sufficient and feasible solution to the alignment problem. Simple games like chess have sufficient features/complexity to work as a toy model for developing this, as long as you don't rely overly much on preexisting human interpretations for the game, but instead build the interpretability from the ground-up.

The universe has many conserved and approximately-conserved quantities, yet among them energy feels "special" to me. Some speculations why:

  • The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
  • Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
  • Breaking a system down into smaller components primarily requires energy.
  • While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the c
... (read more)
8jacob_drori
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren't transferred between objects at the everyday, macro level we humans are used to. E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn't very useful in day-to-day life. E.g. 2: conservation of color charge doesn't really say anything useful about everyday processes, since it's only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones). The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And... momentum seems roughly as important as energy? I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you're interested, I can answer in a separate comment.
2tailcalled
At a human level, the counts for each type of atom is basically always conserved too, so it's not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc.. I guess for momentum in particular, it seems reasonable why it wouldn't be useful in a thermodynamics-style model because things would woosh away too much (unless you're dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
1jacob_drori
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations? Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ. In ordinary quantum mechanics, time and space are treated very differently: t is a coordinate whereas x is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how x evolves as a function of t. But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field ϕ(x,t)) evolves as a function of x and t. Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I'm glossing over here, but I wouldn't be surprised if they ultimately stem from this one). All of this is to say: our best theory of how nature works (QFT), is neither formulated as "energy-first" nor as "momentum-first". Instead, energy and momentum are on fairly equal footing.
2tailcalled
I suppose that's true, but this kind of confirms my intuition that there's something funky going on here that isn't accounted for by rationalist-empiricist-reductionism. Like why are time translations so much more important for our general work than space translations? I guess because the sun bombards the earth with a steady stream of free energy, and earth has life which continuously uses this sunlight to stay out of equillbrium. In a lifeless solar system, time-translations just let everything spin, which isn't that different from space-translations.
1jacob_drori
Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level? This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen and paper and get ready to churn through equations for several decades". In this case, one might be able to point to a few key points that tell the rough story. You'd want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means "one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric"). I imagine that, generically, these solutions behave differently with respect to the "1" direction and the "3" directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can't be more specific!
2tailcalled
Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.
1[comment deleted]
3Noosphere89
The best answer to the question is that it serves as essentially a universal resource that can be used to provide a measuring stick. It does this by being a resource that is limited, fungible, always is better to have more of than less of, and is additive across decisions: You have a limited amount of joules of energy/negentropy, but you can spend it on essentially arbitrary goods for your utility, and is essentially a more physical and usable form of money in an economy. Also, more energy is always a positive thing, so that means you never are worse off by having more energy, and energy is linear in the sense that if I've spent 10 joules on computation, and spent another 10 joules on computation 1 minute later, I've spent 20 joules in total. Cf this post on the measuring stick of utility problem: https://www.lesswrong.com/posts/73pTioGZKNcfQmvGF/the-measuring-stick-of-utility-problem
-1tailcalled
Agree that free energy in many ways seems like a good resource to use as a measuring stick. But matter is too available and takes too much energy to make, so you can't spend it on matter in practice. So it's non-obvious why we wouldn't have a matter-thermodynamics as well as an energy-thermodynamics. I guess especially with oxygen, since it is so reactive. I guess one limitation with considering a system where oxygen serves an analogous role to sunlight (beyond such systems being intrinsically rare) is that as the oxygen reacts, it takes up elements, and so you cannot have the "used-up" oxygen leave the system again without diminishing the system. Whereas you can have photons leave again. Maybe this is just the fungibility property again, which to some extent seems like the inverse of the "breaking a system down into smaller components primarily requires energy" property (though your statements of fungibility is more general because it also considers kinetic energy).
2tailcalled
Thinking further, a key part of it is that temperature has a tendency to mix stuff together, due to the associated microscopic kinetic energy.

Thesis: the problem with LLM interpretability is that LLMs cannot do very much, so for almost all purposes "prompt X => outcome Y" is all the interpretation we can get.

Counterthesis: LLMs are fiddly and usually it would be nice to understand what ways one can change prompts to improve their effectiveness.

Synthesis: LLM interpretability needs to start with some application (e.g. customer support chatbot) to extend the external subject matter that actually drives the effectiveness of the LLM into the study.

Problem: this seems difficult to access, and the people who have access to it are busy doing their job.

1sunwillrise
I'm very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts? As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don't know what the thesis statement refers to...
2tailcalled
I guess to clarify: Everything has an insanely large amount of information. To interpret something, we need to be able to see what "energy" (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information. But that's (the thesis goes) hard for LLMs.
2tailcalled
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief. I don't think so since you've often got more direct ways of intervening (e.g. applying gradient updates).
1sunwillrise
I'm sorry, but I still don't really understand what you mean here. The phrase "the use of LLMs is so brief" is ambiguous to me. Do you mean to say: * a new, better LLM will come out soon anyway, making your work on current LLMs obsolete? * LLM context windows are really small, so you "use" them only for a brief time? * the entire LLM paradigm will be replaced by something else soon? * something totally different from all of the above? But isn't this rather... prosaic and "mundane"?  I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth's "Retarget the Search" and other research directions like it.  So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this. Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of "sitting", then you would be able to do this much more quickly, efficiently, and robustly. Am I totally off-base here?
2tailcalled
Maybe it helps if I start by giving some different applications one might want to use artificial agency for: As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world. This doesn't work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it's seen of mapmakers, rather than primarily as an attempt to eliminate information about the world. As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don't notice such sensitivities except by accident. As a coder: Since LLMs are so busy outputting information instead of manipulating "energy", maybe we could hope that they could assemble a big pile of information that we could "energize" in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn't have obstacles that stop the "energy" dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn't optimizing for doing that, it's just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it's maximizing lines of code written, not minimizing lines of code used.) So, that gives us the thesis: To interpret the

Thesis: linear diffusion of sparse lognormals contains the explanation for shard-like phenomena in neural networks. The world itself consists of ~discrete, big phenomena. Gradient descent allows those phenomena to make imprints upon the neural networks, and those imprints are what is meant by "shards".

... But shard theory is still kind of broken because it lacks consideration of the possibility that the neural network might have an impetus to nudge those shards towards specific outcomes.

Thesis: the openness-conscientiousness axis of personality is about whether you live as a result of intelligence or whether you live through a bias for vitality.

2Dagon
In the big five trait model of personality, those are two different axes.  Openness is inventive/curious vs consistent/cautious, and conscientiousness is efficient/organized vs extravagant/careless.   I don't see your comparison (focus on intelligence vs vitality) as single-axis either - they may be somewhat correlated, but not very closely. I'm not sure I understand the model well enough to look for evidence for or against.  But it doesn't resonate as true enough to be useful.
4tailcalled
Big Five is identified by taking the top 5 principal components among different descriptors of people, and then rotating them to be more aligned with the descriptors. Unless one strongly favors the alignment-with-descriptors as a natural criterion, this means that it is as valid to consider any linear combination of the traits as it is to consider the original traits. Mostly life needs to be focused on vitality to survive. The ability to focus on intelligence is sort of a weird artifact due to massive scarcity of intelligence, making people throw lots of resources at getting intelligence to their place. This wealth of resources allows intellectuals to sort of just stumble around without being biased towards vitality.
2Dagon
Interesting, thank you for the explanation.  I'm not sure I understand (or accept, maybe) the dichotomy between intelligence vs vitality - they seem complimentary to me.  But I appreciate the dicussion.
2tailcalled
There's also an openness+conscientiousness axis, which is closely related to concepts like "competence".
1Rana Dexsin
So in the original text, you meant “openness minus conscientiousness”? That was not clear to me at all; a hyphen-minus looks much more like a hyphen in that position. A true minus sign (−) would have been noticeable to me; using the entire word would have been even more obvious.
2tailcalled
Fair

Thesis: if being loud and honest about what you think about others would make you get seen as a jerk, that's a you problem. It means you either haven't learned to appreciate others or haven't learned to meet people well.

2Dagon
I think this is more general: if you're seen as a jerk, you haven't learned how to interact with people (at least the subset that sees you as a jerk). Being loud and honest about your opinions (though really, "honest" is often a cover for "cherry-picked highlights that aren't quite wrong, but are not honest full evaluations) is one way to be a jerk, but by no means the only one.  
2tailcalled
Basically my model is that being silent and dishonest is a way to cover up one's lack of appreciation for others. Because being loud and honest isn't being a jerk if your loud honest opinions are "I love and respect you".

Thought: couldn't you make a lossless SAE using something along the lines of:

  • Represent the parameters of the SAE as simply a set of unit vectors for the feature directions.
  • To encode a vector using the SAE, iterate: find the most aligned feature vector, dot them to get the coefficient for that feature vector, and subtract off the scaled feature vector to get a residual to encode further

With plenty of diverse vectors, this should presumably guarantee excellent reconstruction, so the main issue is to ensure high sparsity, which could be achieved by some ... (read more)

Idea: for a self-attention where you give it two prompts p1 and p2, could you measure the mutual information between the prompts using something vaguely along the lines of V1^T softmax(K1 K2^T/sqrt(dK)) V2?

In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)

Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, ..., with the subagents being picked according to U's utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So ... (read more)

Theory for a capabilities advance that is going to occur soon:

OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.

Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".

This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradien... (read more)

2tailcalled
Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.

I recently wrote a post presenting a step towards corrigibility using causality here. I've got several ideas in the works for how to improve it, but I'm not sure which one is going to be most interesting to people. Here's a list.

  • Develop the stop button solution further, cleaning up errors, better matching the purpose, etc..

e.g.

I think there may be some variant of this that could work. Like if you give the AI reward proportional to  (where  is a reward function for ) for its current world-state (rather than picking a policy t

... (read more)

Thesis: The motion of the planets are the strongest governing factor for life on Earth.

Reasoning: Time-series data often shows strong changes with the day and night cycle, and sometimes also with the seasons. The daily cycle and the seasonal cycle are governed by the relationship between the Earth and the sun. The Earth is a planet, and so its movement is part of the motion of the planets.

4JBlack
I don't think anybody would have a problem with the statement "The motion of the planet is the strongest governing factor for life on Earth". It's when you make it explicitly plural that there's a problem.
2tailcalled
To some extent true, but consider the analogy to a thesis like "Quantum chromodynamics is the strongest governing factor for life on Earth." Is this sentence also problematic because it addresses locations and energy levels that have no relevance for Earth?
6JBlack
If you replace it with "quantum chromodynamics", then it's still very problematic but for different reasons. Firstly, there's no obvious narrowing to equally causal factors ("motion of the planet" vs "motion of the planets") as there is in the original statement. In the original statement the use of plural instead of singular covers a much broader swath of hypothesis space, and that you haven't ruled out enough to limit it to the singular. So you're communicating that you think there is significant credence that motion of more than one planet has a very strong influence on life on Earth. Secondly, the QCD statement is overly narrow in the stated consequent instead of overly broad in the antecedent: any significant change in quantum chromodynamics would affect essentially everything in the universe, not just life on Earth. "Motion of the planet ... life on Earth" is appropriately scoped in both sides of the relation. In the absence of a context limiting the scope to just life on Earth, yes that would be weird and misleading. Thirdly, it's generally wrong. The processes of life (and everything else based on chemistry) in physical models depend very much more strongly on the details of the electromagnetic interaction than any of the details of colour force. If some other model produced nuclei of the same charges and similar masses, life could proceed essentially unchanged. However, there are some contexts in which it might be less problematic. In the context of evaluating the possibility of anything similar to our familiar life under alternative physical constants, perhaps. In a space of universes which are described by the same models to our best current ones but with different values of "free" parameters, it seems that some parameters of QCD may be the most sensitive in terms of whether life like ours could arise - mostly by mediating whether stars can form and have sufficient lifetime. So in that context, it may be a reasonable thing to say. But in most context

Are there good versions of DAGs for other things than causality?

I've found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It's a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.

In a way, causality describes the paths through which information can "flow". But information is not the only thing in the universe that gets transferred from node to node; there's also things like energy, money, etc., which have somewhat different properties but intuitively seem... (read more)

Linear diffusion of sparse lognormals

Think about it

I have a concept that I expect to take off in reinforcement learning. I don't have time to test it right now, though hopefully I'd find time later. Until then, I want to put it out here, either as inspiration for others, or as a "called it"/prediction, or as a way to hear critique/about similar projects others might have made:

Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like ... (read more)

7Vanessa Kosoy
Downvoted because conditional on this being true, it is harmful to publish. Don't take it personally, but this is content I don't want to see on LW.
2tailcalled
Why harmful

Because it's capability research. It shortens the TAI timeline with little compensating benefit.

3tailcalled
It's capability research that is coupled to alignment: Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.
1Vanessa Kosoy
I missed that paragraph on first reading, mea culpa. I think that your story about how it's a win for interpretability and alignment is very unconvincing, but I don't feel like hashing it out atm. Revised to weak downvote. Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
2tailcalled
Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.
4Vanessa Kosoy
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.
4cfoster0
The "successor representation" is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.
2gwern
Yes, my instant thought too was "this sounds like a variant on a successor function". Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don't need any environmental transitions to solve the use case presented: If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states' value-estimates, and so on. (And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)
2tailcalled
With MBRL, don't you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs "in the model". This value function still needs to chain the estimates backwards.
2gwern
It's the 'same problem', maybe, but it's a lot easier to solve when you have an explicit model! You have something you can plan over, don't need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it's differentiable - backprop can 'chain the estimates backwards' so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN - the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that's very fast and explicit and can be distilled down into a NN forward pass. And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.* * Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious
3tailcalled
I don't think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps. But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information. I'm not doubting the relevance of MBRL, I expect that to take off too. What I'm doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
4gwern
Those are two different things. The unrolling of the episode is still very cheap. It's a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way... (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too - if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches. This is again doing the same thing as 'the same problem'; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the 'delusions' (cf. AlphaStar). They won't be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don't find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents. These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn't solving your problems, you're just not using enough conditioning or generative modeling.) A prompt can des
2tailcalled
But I'm not advocating against MBRL, so this isn't the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning. It's possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc.. I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don't have mathematical models for "capabilities researchers will find something that can be Learned"), and I'd expect something else to happen. If I was to run off to implement this now, I'd be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems. The trouble with just saying "let's use decision transformers" is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that's not nuanced enough. You could use some system that's trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons. It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they're going to need to figure out the learning method too.
2tailcalled
Thanks for the link! It does look somewhat relevant. But I think the weighting by reward (or other significant variables) is pretty important, since it generates a goal to pursue, making it emphasize things that can achieved rather than just things that might randomly happen. Though this makes me think about whether there are natural variables in the state space that could be weighted by, without using reward per se. E.g. the size of (s' - s) in some natural embedding, or the variance in s' over all the possible actions that could be taken. Hmm. 🤔