All of tgb's Comments + Replies

tgb20

I'm not as concerned about your points because there are a number of projects already doing something similar and (if you believe them) succeeding at it. Here's a paper comparing some of them: https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2.full

tgb20

ML arguments can take more data as input. In particular, the genomic sequence is not a predictor used in LASSO regression models: the variants are just arbitrarily coded as 0,1, or 2 alternative allele count. The LASSO models have limited ability to pool information across variants or across data modes. ML models like this one can (in theory) predict effects of variants just based off their sequence on data like RNA-sequencing (which shows which genes are actively being transcribed). That information is effectively pooled across variants and ties genomic s... (read more)

tgb20

This is a lovely little problem, so thank you for sharing it. I thought at first it would be [a different problem](https://www.wolfram.com/mathematica/new-in-9/markov-chains-and-queues/coin-flip-sequences.html) that's similarly paradoxical.

tgb20

Again, why wouldn't you want to read things addressed to other sorts of audiences if you thought altering public opinion on that topic was important? Maybe you don't care about altering public opinion but a large number of people here say they do care.

2Richard_Kennaway
I just don’t think David Brooks, from what I know of him, is worth spending any time on. The snippets I could access at the NYT give no impression of substance. The criticisms of him on Wikipedia are similar to those I have already seen on Andrew Gelman’s blog: he is more concerned to write witty, urbane prose without much concern for actual truth than to do the sort of thing that, say, Scott Alexander does. Btw, I have not voted positively or negatively on the OP.
tgb20

He's influential and it's worth knowing what his opinion is because it will become the opinion of many of his readers. Hes also representative of what a lot of other people are (independently) thinking.

What's Scott Alexander qualified to comment on? Should we not care about the opinion of Joe Biden because he has no particular knowledge about AI? Sure, I'm doubt we learn anything from rebutting his arguments, but once upon a time LW cared about changing the public opinion on this matter and so should absolutely care about reading that public opinion.

Honestly, I embarrassed for us that this needs to be said.

3Richard_Kennaway
Scott Alexander is, obviously, qualified to write on psychology, psychiatry, related pharmaceuticals, and the ways that US government agencies screw up everything they touch in those areas. When writing outside his professional expertise, he takes care to read thoroughly, lay out his evidence, cite sources, and say how confident he is in the conclusions he draws. I see none of this in David Brooks' article. He is writing sermons to the readership of the NYT. They are not addressed to the sort of audience we have here. I doubt that his audience are likely to read LessWrong.
tgb42

But you don’t need grades to separate yourself academically. You take harder classes to do that. And incentivizing GPA again will only punish people for taking actual classes instead of sticking to easier ones they can get an A in.

Concretely, everyone in my math department that was there to actually get an econ job took the basic undergrad sequences and everyone looking to actually do math started with the honors (“throw you in the deep end until you can actually write a proof”) course and rapidly started taking graduate-level courses. The difference on th... (read more)

tgb10

I was confused until I realized that the "sparsity" that this post is referring to is activation sparsity not the more common weight sparsity that you get from L1 penalization of weights.

tgb40

Wait why do you think inmates escaping is extremely rare? Are you just referring to escapes where guards assisted the escape? I work in a hospital system and have received two security alerts in my memory where a prisoner receiving medical treatment ditched their escort and escaped. At least one of those was on the loose for several days. I can also think of multiple escapes from prisons themselves, for example: https://abcnews.go.com/amp/US/danelo-cavalcante-murderer-escaped-pennsylvania-prison-weeks-facing/story?id=104856784 notable since the prisoner wa... (read more)

tgb130

i have some reservations about the practicality of reporting likelihood functions and have never done this before, but here are some (sloppy) examples in python. Primarily answering number 1 and 3.
 

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib
import pylab

np.random.seed(100)

## Generate some data for a simple case vs control example
# 10 vs 10 replicates with a 1 SD effect size
controls = np.random.normal(size=10)
cases = np.random.normal(size=10) + 1
data = pd.DataFrame(
    {
        "group": ["con
... (read more)
tgb50

But yes, working out is mostly unpleasant and boring as hell as we conceive of it and we need to stop pretending otherwise. Once we agree that most exercise mostly bores most people who try it out of their minds, we can work on not doing that.

 

I'm of the nearly opposite opinion: we pretend that exercise ought to be unpleasant. We equate exercise with elite or professional athletes and the vision of needing to push yourself to the limit, etc. In reality, exercise does include that but for most people should look more like "going for a walk" than "doing... (read more)

tgb60

The American Heart Association (AHA) Get with the Guidelines–Heart Failure Risk Score predicts the risk of death in patients admitted to the hospital.9 It assigns three additional points to any patient identified as “nonblack,” thereby categorizing all black patients as being at lower risk. The AHA does not provide a rationale for this adjustment. Clinicians are advised to use this risk score to guide decisions about referral to cardiology and allocation of health care resources. Since “black” is equated with lower risk, following the guidelines could dire

... (read more)
tgb22

Wegovy (a GLP-1 antagonist)

Wegovy/Ozempic/Semaglutide are GLP-1 receptor agonists, not GLP-1 antagonists. This means they activate the GLP-1 receptor, which GLP-1 also does. So it's more accurate to say that they are GLP-1 analogs, which makes calling them "GLP-1s" reasonable even though that's not really accurate either.

2Zvi
Yep, whoops, fixing.
tgb20

Broccoli is higher in protein content per calorie than either beans or pasta and is a very central example of a vegetable, though you'd also want to mix it with beans or something for a better protein quality. 3500 calories of broccoli is 294g protein, if Google's nutrition facts are to be trusted. Spinach, kale, and cauliflower all also have substantially better protein per calories than potatoes and better PDCAAS scores than I expected (though I'm not certain I trust them - does spinach actually get a 1?). I think potatoes are a poor example (and also no... (read more)

1Olli Savolainen
Not to rain on any parades... but don't eat spinach guys.   If you try to fix joint pains by getting more protein from kilograms of spinach or kale, you will be severly disappointed. I'm talking about oxalic acid. See my comment. It is more likely though that you will get kidney injury or kidney stones as a first symptom. Some people have died of imbibing big green smoothies, which presumably contained spinach. Everyone knows rhubarb is bad because of oxalic acid. Spinach contains the same stuff in high concentrations.
3DirectedEvolution
True! I am a broccoli fan. Just to put a number on it, to get the proposed 160g of protein per day, you’d have to eat 5.6 kg of broccoli, or well over 10 lb.
tgb51

In my view, it's a significant philosophical difference between SLT and your post that your post talks only about choosing macrostates while SLT talks about choosing microstates. I'm much less qualified to know (let alone explain) the benefits of SLT, though I can speculate. If we stop training after a finite number of steps, then I think it's helpful to know where it's converging to. In my example, if you think it's converging to , then stopping close to that will get you a function that doesn't generalize too well. If you know it's converging t... (read more)

tgb40

Everything I wrote in steps 1-4 was done in a discrete setting (otherwise  is not finite and whole thing falls apart). I was intending  to be pairs of floating point numbers and  to be floats to floats.

However, using that I think I see what you're trying to say. Which is that  will equal zero for some cases where  and  are both non-zero but very small and will multiply down to zero due to the limits of floating point numbers. Therefore the pre-image of  is actua... (read more)

4Ege Erdil
Sure, I agree that I didn't put this information into the post. However, why do you need to know which θ is more likely to know anything about e.g. how neural networks generalize? I understand that SLT has some additional content beyond what is in the post, and I've tried to explain how you could make that fit in this framework. I just don't understand why that additional content is relevant, which is why I left it out. As an additional note, I wasn't really talking about floating point precision being the important variable here. I'm just saying that if you want A-complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you're interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don't know why you would try to do that.
tgb30

the worse a singularity is, the lower the -complexity of the corresponding discrete function will turn out to be

This is where we diverge. Please let me know where you think my error is in the following. Returning to my explicit example (though I wrote  originally but will instead use  in this post since that matches your definitions).

1. Let   be the constant zero function and  

2. Observe that  is the minimal loss set under our loss function and also  is the set of... (read more)

2Ege Erdil
You need to discretize the function before taking preimages. If you just take preimages in the continuous setting, of course you're not going to see any of the interesting behavior SLT is capturing. In your case, let's say that we discretize the function space by choosing which one of the functions gk(x)=kηx you're closest to for some η>0. In addition, we also discretize the codomain of A by looking at the lattice (εZ)2 for some ε>0. Now, you'll notice that there's a radius ∼√η disk around the origin which contains only functions mapping to the zero function, and as our lattice has fundamental area ε2 this means the "relative weight" of the singularity at the origin is like O(η/ε2). In contrast, all other points mapping to the zero function only get a relative weight of O(η/(kε2)) where kε is the absolute value of their nonzero coordinate. Cutting off the domain somewhere to make it compact and summing over all kε>√η to exclude the disk at the origin gives O(√η/ε) for the total contribution of all the other points in the minimum loss set. So in the limit η/ε2→0 the singularity at the origin accounts for almost everything in the preimage of A. The origin is privileged in my picture just as it is in the SLT picture. I think your mistake is that you're trying to translate between these two models too literally, when you should be thinking of my model as a discretization of the SLT model. Because it's a discretization at a particular scale, it doesn't capture what happens as the scale is changing. That's the main shortcoming relative to SLT, but it's not clear to me how important capturing this thermodynamic-like limit is to begin with. Again, maybe I'm misrepresenting the actual content of SLT here, but it's not clear to me what SLT says aside from this, so...
tgb*40

The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh , say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer.

As I read it, the arguments you make in the original post depend only on the... (read more)

2Ege Erdil
I'm not too sure how to respond to this comment because it seems like you're not understanding what I'm trying to say. I agree there's some terminology mismatch, but this is inevitable because SLT is a continuous model and my model is discrete. If you want to translate between them, you need to imagine discretizing SLT, which means you discretize both the codomain of the neural network and the space of functions you're trying to represent in some suitable way. If you do this, then you'll notice that the worse a singularity is, the lower the A-complexity of the corresponding discrete function will turn out to be, because many of the neighbors map to the same function after discretization. The content that SLT adds on top of this is what happens in the limit where your discretization becomes infinitely fine and your dataset becomes infinitely large, but your model doesn't become infinitely large. In this case, SLT claims that the worst singularities dominate the equilibrium behavior of SGD, which I agree is an accurate claim. However, I'm not sure what this claim is supposed to tell us about how NNs learn. I can't make any novel predictions about NNs with this knowledge that I couldn't before.
tgb52

Here's a concrete toy example where SLT and this post give different answers (SLT is more specific). Let .  And let . Then the minimal loss is achieved at the set of parameters where  or  (note that this looks like two intersecting lines, with the singularity being the intersection). Note that all  in that set also give the same exact . The theory in your post here doesn't say much beyond the standard point that gradient descent will (likely) select a minim... (read more)

2Ege Erdil
I don't think this representation of the theory in my post is correct. The effective dimension of the singularity near the origin is much higher, e.g. because near every other minimal point of this loss function the Hessian doesn't vanish, while for the singularity at the origin it does vanish. If you discretized this setup by looking at it with a lattice of mesh ε, say, you would notice that the origin is surrounded by many parameters that give nearly identical loss, while near other parts of the space the number of such parameters is far fewer. The reason you have to do some kind of "translation" between the two theories is that SLT can see not just exactly optimal points but also nearly optimal points, and bad singularities are surrounded by many more nearly optimal points than better-behaved singularities. You can interpret the discretized picture above as the SLT picture seen at some "resolution" or "scale" ε, i.e. if you discretized the loss function by evaluating it on a lattice with mesh ε you get my picture. Of course, this loses the information of what happens as ε→0 and n→∞ in some thermodynamic limit, which is what you recover when you do SLT. I just don't see what this thermodynamic limit tells you about the learning behavior of NNs that we didn't know before. We already know NNs approximate Solomonoff induction if the A-complexity is a good approximation to Kolmogorov complexity and so forth. What additional information is gained by knowing what A looks like as a smooth function as opposed to a discrete function? In addition, the strong dependence of SLT on A being analytic is bad, because analytic functions are rigid: their value in a small open subset determines their value globally. I can see why you need this assumption because quantifying what happens near a singularity becomes incredibly difficult for general smooth functions, but because of the rigidity of analytic functions the approximation that "we can just pretend NNs are analytic" is mor
tgb31

I guess the unstated assumption is that the prisoners can only see the temperatures of others from the previous round and/or can only change their temperature at the start of a round (though one tried to do otherwise in the story). Even with that it seems like an awfully precarious equilibrium since if I unilaterally start choosing 30 repeatedly, you'd have to be stupid to not also start choosing 30, and the cost to me is really quite tiny even while no one else ever 'defects' alongside me. It seems to be too weak a definition of 'equilibrium' if it's that easy to break - maybe there's a more realistic definition that excludes this case?

6Ericf
The other thing that could happen is silent deviations, where some players aren't doing "punish any defection from 99" - they are just doing "play 99" to avoid punishments. The one brave soul doesn't know how many of each there are, but can find out when they suddenly go for 30.
tgb30

I don't think the 'strategy' used here (set to 99 degrees unless someone defects, then set to 100) satisfies the "individual rationality condition". Sure, when everyone is setting it to 99 degrees, it beats the minmax strategy of choosing 30. But once someone chooses 30, the minmax for everyone else is now to also choose 30 - there's no further punishment that will or could be given. So the behavior described here, where everyone punishes the 30, is worse than minmaxing. At the very least, it would be an unstable equilibrium that would have broken down in the situation described - and knowing that would give everyone an incentive to 'defect' immediately.

1EOC
The 'individual rationality condition' is about the payoffs in equilibrium, not about the strategies. It says that the equilibrium payoff profile must yield to each player at least their minmax payoff. Here, the minmax payoff for a given player is -99.3 (which comes from the player best responding with 30 forever to everyone else setting their dials to 100 forever).  The equilibrium payoff is -99 (which comes from everyone setting their dials to 99 forever). Since -99 > -99.3, the individual rationality condition of the Folk Theorem is satisfied. 
3jessicata
After someone chooses 30 once, they still get to choose something different in future rounds. In the strategy profile I claim is a Nash equilibrium, they'll set it to 100 next round like everyone else. If anyone individually deviates from setting it to 100, then the equilibrium temperature in the next round will also be 100. That simply isn't worth it, if you expect to be the only person setting it less than 100. Since in the strategy profile I am constructing everyone does set it to 100, that's the condition we need to check to check whether it's a Nash equilibrium.
tgb40

Ah, I didn’t understand what “first option” meant either.

tgb30

The poll appears to be asking two, opposite questions. I'm not clear on whether a 99% means it will be  a transformer or whether it means something else is needed to get there?

1Algon
Yeah, I messed up the question. But that's why I said "Give your probaiblity for the first option" in the post.
tgb30

Thank you. I was completely missing that they used a second 'preference' model to score outputs for the RL. I'm surprised that works!

Answer by tgb*40

A lot of team or cooperative games where communication is disallowed and information is limited have aspects of Schelling points. Hanabi is a cooperative card game that encourages using Schelling points. Though higher levels of play require players to establish ahead of time a set of rules for what each possible action is meant to communicate, which rather diminishes that aspect of the game. Arguably bridge is in a similar position with partners communicating via bidding.

tgb30

Is there a primer on what the difference between training LLMs and doing RLHF on those LLMs post-training is? They both seem fundamentally to be doing the same thing: move the weights in the direction that increases the likelihood that they output the given text. But I gather that there are some fundamental differences in how this is done and RLHF isn't quite a second training round done on hand-curated datapoints.

3aog
Some links I think do a good job: https://huggingface.co/blog/rlhf https://openai.com/research/instruction-following
tgb50

Sounds plausible but this article is evidence against the striatum hypothesis: Region-specific Foxp2 deletions in cortex, striatum or cerebellum cannot explain vocalization deficits observed in spontaneous global knockouts


In short, they edited mice to have Foxp2 deleted in only specific regions of the brain, one of them being striatum. But those mice didn't have the 'speech' defects that mice with whole-body Foxp2 knock-outs showed. So Foxp2's action outside of the striatum seems to play a role. They didn't do a striatum+cerebellum knock-out, though, so it could still be those two jointly (but not individually) causing the problem.

5Steven Byrnes
Interesting!! That paper repeatedly does the annoyingly-common thing of conflating “our tests didn’t find any group differences that pass the p<0.05 threshold” with “our tests positively confirm that there are no group differences”. 😠 I’m not saying they’re necessarily wrong about that, I’m just complaining. No, actually, I will complain, and they are wrong. Their tests show that, in global knockouts (well, “spontaneous deletion”, but I think it amounts to the same thing?), ultrasonic vocalizations (USV) go significantly down while “click” vocalizations go significantly up. So at that point, one would think that they should do 1-sided t-tests on the region-specific knockouts to see if USVs go down and if clicks go up. But instead they did 1-sided tests to see if USVs go down and if clicks go down. And the clicks actually went up (who would have guessed!), which they reported as “click rates [were not] affected [with] p=0.99”. 🤦🤦🤦 I guess that they were trying to avoid p-hacking by pre-committing to which tests they’d run, which I suppose is admirable, but I still think they were being pretty boneheaded here!! (Unless I misunderstand. I was skimming.) Anyway, the USV result does indeed look like “no change”, but clicks went up (with “p=0.99” on the incorrectly-oriented 1-sided t-test, which really means p=0.01) for both the Purkinje- & striatum-specific knockouts. (Not cortex.) That still leaves the USV question. The hypothesis that FOXP2 impacts USV squeaking in mice via the lungs still seems to me like a live possibility, in which case FOXP2→mouse-USV-squeaking would be totally disanalogous to FOXP2→human-speech, I think, i.e. just a funny coincidence. Hmm. The striatum+cerebellum interaction thing you mentioned is also possible AFAIK. (I have somewhat more confidence that FOXP2-affects-bird-vocalization ↔ FOXP2-affects-human speech is mechanistically analogous, than that FOXP2-affects-mouse-vocalization ↔ FOXP2-affects-human speech is mechanistically ana
tgb20

I gave one example of the “work” this does: that GPT performs better when prompted to reason first rather than state the answer first. Another  example is: https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight

On the contrary, you mainly seem to be claiming that thinking of LLMs as working one token at a time is misleading, but I’m not sure I understand any examples of misleading conclusions that you think people draw from it. Where do you think people go wrong?

1Bill Benzon
Over there in another part of the universe there are people who are yelling that LLMs are "stochastic parrots." Their intention is to discredit LLMs as dangerous evil devices Not too far away from those folks are those saying it's "autocomplete on steroids." That's only marginally better. Saying LLMs are "next word predictors" feeds into that. Now, I'm talking about rhetoric here, not intellectual substance. But rhetoric matters. There needs to be a better way of talking about these devices for a general audience.
tgb113

Suppose I write the first half of a very GPT-esque story. If I then ask GPT to complete that story, won't it do exactly the same structure as always? If so, how can you say that came from a  plan - it didn't write the first half of the story! That's just what stories look like. Is that more surprising than a token predictor getting basic sentence structure correct?

For hidden thoughts, I think this is very well defined. It won't be truly 'hidden', since we can examine every node in GPT, but we know for a fact that GPT is purely a function of the curren... (read more)

1Nikola Smolenski
Perhaps you could simply ask ChatGPT? "Please tell me a story without making any plans about the story beforehand." vs "Please make a plan for a story, then tell me the story, and attach your plan at the end of the story." Will the resulting stories differ, and how? My prediction: the plan attached at the end of the story won't be very similar to actual story.
2Bill Benzon
For hidden thoughts, I think this is very well defined. Not for humans, and that's what I was referring to. Sorry about the confusion.  "Thought" is just a common-sense idea. As far as I know, we don't have a well-defined concept of that that's stated in terms of brain states. Now, I believe Walter Freeman has conjectured that thoughts reflect states of global coherence across a large swath of cortex, perhaps a hemisphere, but that's a whole other intellectual world. If so, how can you say that came from a  plan - it didn't write the first half of the story! But it read it, no? Why can't it complete it according to it's "plan" since it has no way of knowing the intentions of the person who wrote the first half.  Let me come at this a different way. I don't know how many times I've read articles of the "computers for dummies" type where it said it's all just ones and zeros. And that's true. Source code may be human-readable, when when it's compiled all the comments are stripped out and the rest is converted to runs and zeros. What does that tell you about a program? It depends on your point of view and what you know. From a very esoteric and abstract point of view, it tells you a lot. From the point of view of someone reading Digital Computing for Dummies, it doesn't tell them much of anything. I feel a bit like that about the assertion that LLMs are just next-token-predictors. Taking that in conjunction with the knowledge that they're trained on zillions of tokens of text, those two things put together don't tell you much either. If those two statements were deeply informative, then mechanistic interpretation would be trivial. It's not. Saying that LLMs are next-token predictors puts a kind of boundary on mechanistic interpretation, but it doesn't do much else. And saying it was trained on all these texts, that doesn't tell you much about the structure the model has picked up. What intellectual work does that statement do?
tgb3932

Maybe I don't understand what exactly your point is, but I'm not convinced. AFAIK, it's true that GPT has no state outside of the list of tokens so far. Contrast to your jazz example, where you, in fact, have hidden thoughts outside of the notes played so-far. I think this is what Wolfram and others are saying when they say that "GPT predicts the next token". You highlight "it doesn’t have a global plan about what’s going to happen" but I think a key point is that whatever plan it has, it has to build it up entirely from "Once upon a" and then again, from ... (read more)

1Max Loh
Whether it has a global "plan" is irrelevant as long as it behaves like someone with a global plan (which it does). Consider the thought experiment where I show you a block of text and ask you to come up with the next word. After you come up with the next word, I rewind your brain to before the point where I asked you (so you have no memory of coming up with that word) and repeat ad infinitum. If you are skeptical of the "rewinding" idea, just imagine a simulated brain and we're restarting the simulation each time. You couldn't have had a global plan because you had no memory of each previous step. Yet the output would still be totally logical. And as long as you're careful about each word choice at each step, it is scientifically indistinguishable from someone with a "global plan". That is similar to what GPT is doing.
3oreghall
I believe the primary point is to dissuade people that are dismissive of LLM intelligence. Predicting the next token is not as simple as it sounds, it requires not only understanding the past but also consideration of the future. The fact it re-imagines this future every token it writes is honestly even more impressive, though it is clearly a limitation in terms of keeping a coherent idea. 
JBlack117

I don't think the human concept of 'plan' is even a sensible concept to apply here. What it has is in many ways very much like a human plan, and in many other ways utterly unlike a human plan.

One way in which you could view them as similar is that just as there is a probability distribution over single token output (which may be trivial for zero temperature), there is a corresponding probability distribution over all sequences of tokens. You could think of this distribution as a plan with decisions yet to be made. For example, there may be some small possi... (read more)

1Bill Benzon
""Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes." Why does it tell the same kind of story every time I prompt it: "Tell me a story"? And I'm talking about different sessions, not several times in one session. It takes a trajectory that has same same segments. It starts out giving initial conditions. Then there is some kind of disturbance. After that the protagonist thinks and plans and travels to the point of the disturbance. We then have a battle, with the protagonist winning. Finally, there is a celebration. That looks like a global plan to me. Such stories (almost?) always have fantasy elements, such as dragons, or some magic creature. If you want to eliminate those, you can do so: "Tell me a realistic story." "Tell me a sad story," is a different kind of story. And if you prompt it with: "Tell me a true story", that's still different, often very short, only a paragraph or three. I'm tempted to say, "forget about a human global plan," but, I wonder. The global plan a human makes is, after all, a consequence of that person's past actions. Such a global plan isn't some weird emanation from the future.  Furthermore, it's not entirely clear why a person's 'hidden thoughts' should differentiate us from an humongous LLM. Just what do you mean by 'hidden' thoughts? Where do they hide? Under the bed, in the basement, perhaps somewhere deep in the woods, maybe? I'm tempted to say that there are no such things as hidden thoughts, that's just a way of talking about something we don't understand.
tgb31

If you choose heads, you either win $2 (ie win $1 twice) or lose $1. If you choose tails then you either win $1 or lose $2. It’s exactly the same as the Sleeping Beauty problem with betting, just you have to precommit to a choice of heads/tail ahead of time. Sorry that this situation is weird to describe and unclear.

tgb10

Yes, exactly. You choose either heads or tails. I flip the coin. If it's tails and matches what you chose, then you win $1 otherwise lose $1. If it's heads and matches what you chose, you win $2 otherwise you lose $2. Clearly you will choose heads in this case, just like the Sleeping Beauty when betting every time you wake up. But you choose heads because we've increased the payout not the probabilities.

1green_leaf
Both options have the expected value equal to zero though? (0.5⋅1−0.5⋅1 versus 0.5⋅2−0.5⋅2.)
tgb10

And here are examples that I don't think that rephrasing as betting resolves:

Convinced by the Sleeping Beauty problem, you buy a lottery ticket and set up a robot to put you to sleep and then, if the lottery ticket wins, wake you up 1 billion times, and if not just wake you up once. You wake up. What is the expected value of the lottery ticket you're holding? You knew ahead of time that you will wake up at least once, so did you just game the system? No, since I would argue that this system is better modeled by the Sleeping Beauty problem when you get only... (read more)

1green_leaf
By executing the bet twice, do you mean I lose/win twice as much money as I'd otherwise lost/won?
tgb30

You're right that my construction was bad. But the number of bets does matter. Suppose instead that we're both undergoing this experiment (with the same coin flip simultaneously controlling both of us). We both wake up and I say, "After this is over, I'll pay you 1:1 if the coin was a heads." Is this deal favorable and do you accept? You'd first want to clarify how many times I'm going to payout if we have this conversation two days in a row.  (Is promising the same deal twice mean we just reaffirmed a single deal or that we agreed to two separate, id... (read more)

1tgb
And here are examples that I don't think that rephrasing as betting resolves: Convinced by the Sleeping Beauty problem, you buy a lottery ticket and set up a robot to put you to sleep and then, if the lottery ticket wins, wake you up 1 billion times, and if not just wake you up once. You wake up. What is the expected value of the lottery ticket you're holding? You knew ahead of time that you will wake up at least once, so did you just game the system? No, since I would argue that this system is better modeled by the Sleeping Beauty problem when you get only a single payout regardless of how many times you wake up.  Or: if the coin comes up heads, then you and your memories get cloned. When you wake up you're offered the deal on the spot 1:1 bet on the coin. Is this a good bet for you? (Your wallet gets cloned too, let's say.) That depends on how you value your clone receiving money. But why should P(H|awake) be different in this scenario than in Sleeping Beauty, or different between people who do value their clone versus people who do not? Or: No sleeping beauty shenanigans. I just say "Let's make a bet. I'll flip a coin. If the coin was heads we'll execute the bet twice. If tails, just once. What odds do you offer me?" Isn't that all that you are saying in this Sleeping Beauty with Betting scenario? The expected value of a bet is a product of the payoff with the probability - the payoff is twice as high in the case of heads, so why should I think that the probability is also twice as high? I argue that this is the very question of the problem: is being right twice worth twice as much?
tgb2-5

That assumes that the bet is offered to you every time you wake up, even when you wake up twice. If you make the opposite assumption (you are offered the bet only on the last time you wake up), then the odds change. So I see this as a subtle form of begging the question.

4green_leaf
Unless I'm missing something, for optimal betting to be isomorphic to a correct application of Bayes' theorem, you have to bet for every event in the set that you're being asked about. If you're asked "conditional on you waking up, what's the probability of the coin having landed heads," the isomorphic question is "if I bet every time I wake up on the coin landing heads, what odds should I bet at to cut even," for which the answer is 2:1, which makes the corresponding probability p=13.
3samshap
That assumption literally changes the nature of the problem, because the offer to bet, is information that you are using to update your posterior probability. You can repair that problem by always offering the bet and ignoring one of the bets on tails. But of course that feels like cheating - I think most people would agree that if the odds makers are consistently ignoring bets on one side, then the odds no longer reflect the underlying probability. Maybe there's another formulation that gives 1:1 odds, but I can't think of it.
tgb40

Your link to Lynch and Marinov is currently incorrect. However I also don't understand whether what they say matches with your post:

the energetic burden of a gene is typically no greater, and generally becomes progressively smaller, in larger cells in both bacteria and eukaryotes, and this is true for costs measured at the DNA, RNA, and protein levels. These results eliminate the need to invoke an energetics barrier to genome complexity. ... These results indicate that the origin of the mitochondrion was not a prerequisite for genome-size expansion.

aysja187

Ah, thanks! Link fixed now. 

Yes, welp, I considered getting into this whole debate in the post but it seemed like too much of an aside. Basically, Lynch is like, “when you control for cell size, the amount of energy per genome is not predictive of whether it’s a prokaryote or a eukaryote.” In other words, on his account, the main determinant of bioenergetic availability appears to be the size of the cell, rather than anything energetically special about eukaryotes, such as mitochondria. 

There are some issues here. First, most of the large prokary... (read more)

tgb20

So that example is of , what is the  for it? Obviously, there's multiple  that could give that (depending on how the loss is computed from ), with some of them having symmetries and some of them not. That's why I find the discussion so confusing: we really only care about symmetries of  (which give type B behavior) but instead are talking about symmetries of   (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it's a f... (read more)

1Jesse Hoogland
This is a toy example (I didn't come up with it for any particular f in mind. I think the important thing is that the distinction does not have much of a difference in practice. Both correspond to lower-effective dimensionality (type A very explicitly, and type B less directly). Both are able to "trap" random motion. And it seems like both somehow help make the loss landscape more navigable. If you're interested in interpreting the energy landscape as a loss landscape, x and y would be the parameters (and a and b would be hyperparameters related to things like the learning rate and batch size.
tgb20

Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you're suggesting a no, but I think it's in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn't know anything about wireheading until it's been tried. So that is an example of gradient hacking itself, isn't it? Unlike in a supervised learning setup where the gradient descent 'knows' about all possible options and will modify any subagents that avoid giving the right answer.

So am I a gradient hacker whenever I just say no to drugs?

tgb20

I'm still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I'm considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we're fitting y = f(x) for over some parameters. And specifically let's consider the points (0,0) and (1,0) as our only training data. Then   has minimal loss set . That has a singularity at (0,0,0). I don't really see why it would generalize better t... (read more)

2Jesse Hoogland
I wrote a follow-up that should be helpful to see an example in more detail. The example I mention is the loss function (=potential energy) L(x)=a⋅min((x−b)2,(y−b)2). There's a singularity at the origin.  This does seem like an important point to emphasize: symmetries in the model p(⋅|w) (or fw(⋅) if you're making deterministic predictions) and the true distribution q(x) lead to singularities in the loss landscape Ln(x). There's an important distinction between f and L.
tgb20

And a follow-up that I just thought of: is reinforcement learning more prone to gradient hacking? For example, if a sub-agent guesses that a particular previously untried type of action would produce very high reward, the sub-agent might be able to direct the policy away from those actions. The learning process will never correct this behavior if the overall model never gets to learn that those actions are beneficial. Therefore the sub-agent can direct away from some classes of high-reward actions that it doesn't like without being altered.

2Quintin Pope
Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It's also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it's able.
tgb20

There's been discussion of 'gradient hacking' lately, such as here. What I'm still unsure about is whether or not a gradient hacker is just another word for local minimum? It feels different but when I want to try to put a finer definition on it, I can't. My best alternative is "local minimum, but malicious" but that seems odd since it depends upon some moral character.

2tgb
And a follow-up that I just thought of: is reinforcement learning more prone to gradient hacking? For example, if a sub-agent guesses that a particular previously untried type of action would produce very high reward, the sub-agent might be able to direct the policy away from those actions. The learning process will never correct this behavior if the overall model never gets to learn that those actions are beneficial. Therefore the sub-agent can direct away from some classes of high-reward actions that it doesn't like without being altered.
tgb20

Thanks for trying to walk me through this more, though I'm not sure this clears up my confusion. An even more similar model to the one in the video (a pendulum) would be the model that  which has four parameters  but of course you don't really need both a and b. My point is that, as far as the loss function is concerned, the situation for a fourth degree polynomial's redundancy is identical to the situation for this new model. Yet we clearly have two different types of redundancy going on:

  • Type A: like the fourth deg
... (read more)
2Jesse Hoogland
I’m confused now too. Let’s see if I got it right: A: You have two models with perfect train loss but different test loss. You can swap between them with respect to train loss but they may have different generalization performance. B: You have two models whose layers are permutations of each other and so perform the exact same calculation (and therefore have the same generalization performance). The claim is that the “simplest” models (largest singularities) dominate our expected learning behavior. Large singularities mean fewer effective parameters. The reason that simplicity (with respect to either type) translates to generalization is Occam’s razor: simple functions are compatible with more possible continuations of the dataset. Not all type A redundant models are the same with respect to simplicity and therefore they’re not treated the same by learning.
tgb30

I'm confused by the setup. Let's consider the simplest case: fitting points in the plane, y as a function of x. If I have three datapoints and I fit a quadratic to it, I have a dimension 0 space of minimizers of the loss function: the unique parabola through those three points (assume they're not ontop of each other). Since I have three parameters in a quadratic, I assume that this means the effective degrees of freedom of the model is 3 according to this post. If I instead fit a quartic, I now have a dimension 1 space of minimizers and 4 parameters, so I ... (read more)

6Jesse Hoogland
On its own the quartic has 4 degrees of freedom (and the 19th degree polynomial 19 DOFs). It's not until I introduce additional constraints (independent equations), that the effective dimensionality goes down. E.g.: a quartic + a linear equation = 3 degrees of freedom, ax4+bx3+cx2+dx+e=0a=b. It's these kinds of constraints/relations/symmetries that reduce the effective dimensionality. This video has a good example of a more realistic case: We don't have access to the "true loss." We only have access to the training loss (for this case, Kn(w)). Of course the true distribution is sneakily behind the empirical distribution and so has after-effects in the training loss, but it doesn't show up explicitly in p(Dn) (the thing we're trying to maximize).  
tgb41

Thanks for the clarification! In fact, that opinion wasn't even one of the ones I had considered you might have.

tgb3432

I simultaneously would have answered ‘no,’ would expect most people in my social circles to answer no, think it is clear that this being a near-universal is a very bad sign, and also that 25.6% is terrifying. It’s something like ‘there is a right amount of the thing this is a proxy for, and that very much is not it.’

At the risk of being too honest, I find passages written like this horribly confusing and never know what you mean when you write like this. ("this" being near universal - what is "this"? ("answering no" like you and your friends or "answering ... (read more)

7Zvi
Thank you. I agree that the phrasing here wasn't as clear as it could be and I'll watch out for similar things in the future (I won't be fixing this instance because I don't generally edit posts after the first few days unless they are intended to be linked to a lot in the future, given how traffic works these days.)  If it's still confusing, I meant: I would not say that making my parents proud is one of my main goals in life. I would expect [people I know] to mostly also not see this as one of their main goals. I think that the percentage of people answering yes is a proxy that correlates to virtues that it is possible to have too much or too little of individually or as a society, a type of respect for family and tradition and accomplishment and other neat (but possible to overdo) stuff like that. A number like 25.6% indicates a terrifyingly low amount of such virtues, such that I would worry about the future of such a country, the same way 99% is terrifyingly high. Or that you can't actually fully get rid of your 1st thing without also getting rid of the 2nd thing, not in practice - you don't want too many people basing too much of their self worth on (especially solely on) their parents judgment of them, but you also don't want them disregarding such preferences either. 
1lise
(commenting just to say I upvoted for the "horribly confusing" line)
6Dagon
There's a lot of room for cultural interpretation of "one of my main goals in life".  Some will take this as "one of my top-3 far-mode indicators of success", some will take it as "this is at least somewhat important to me".  I'd be shocked if many US intellectuals said "yes" to this, as they're likely to interpret it the first way, and it's low-status compared to having an impact or making your own way.  And I'd be shocked if many people (a majority, but nowhere near unanimity - many people are far more aware than in previous generations of the flaws of their elders) wouldn't say "yes" if it were framed the second way. 
2Ben Pace
The line is "One of my main goals in life is to make my parents proud".  I am interested to know my parents' opinion on my life, and consider their general advice as an input, but I think if anyone told me it was their primary goal, then I'd anticipate they were shortsighted and needed to get out of their bubble. The people I've met who are most likely to say that I think are people who will never move out of the town they were born in, and don't have any truly ambitious goals. For instance, if the question was a proxy for "Tick yes if you've not found many meanings or purposes that are worth devoting your life to beyond the immediate relationships you were born into, and may never leave your home town" then it would seem drastically too high to me, by around 10x. In contrast the question I'd like to see be high is something more like "I have a loving relationship with my parents". That seems healthy to me.
tgb20

Thanks. I think I've been tripped up by this terminology more than once now.

tgb50

Not sure that I understand your claim here about optimization. An optimizer is presumably given some choice of possible initial states to choose from to achieve its goal (otherwise it cannot interact at all). In which case, the set of accessible states will depend upon the chosen initial state and so the optimizer can influence long term behavior and choose whatever best matches it’s desires.

3Adam Shai
I share your confusions/intution about what is meant by optimization here. But I think for the purposes of this post, optimization is defined here, which is linked to at the beginning of this post. In that link, optimization is thought of as a pattern that persists in the face of perturbations and that evolves towards a small set of states. I'm still not totally grokking it though.
tgb30

Why would CZ tweet out that he was starting to sell his FTT? Surely that would only decrease the amount he could recover on his sales?

2Ericf
Maybe, but a big player selling without explanation can also cause a panic. There are also reputation effects with either choice.
4Ape in the coat
One can speculate that he had already sold his FTT and started shorting when writing the tweet.
tgb20

I agree, I was just responding to your penultimate sentence: “In fact, if you could know without labeling generated data, why would you generate something that you can tell is bad in the first place?”

 

Personally, I think it’s kind of exciting to be part of what might be the last breath of purely human writing. Also, depressing.

tgb20

Surely the problem is that someone else is generating it - or more accurately lots of other people generating it in huge quantities.

1Lech Mazur
While there are still significant improvements in data/model/generation you might be able to imperfectly detect whether some text was generated by the previous generation of models. But if you're training a new model, you probably don't have such a next-gen classifier ready yet. So if you want to do just one training run, it could be easier to just limit your training data to the text that was available years ago or to only trust some sources. A related issue is the use of AI writing assistants that fix grammar and modify human-written text in other ways that the language model considers better. While it seems like a less important problem, they could make the human-written text somewhat harder to distinguish from the AI-written text from the other direction.
tgb915

I work in a related field and found this a helpful overview that filled in some gaps of my knowledge that I probably should have known already and I’m looking forward to the  follow ups. I do think that this would likely be a very hard read for a layman who wasn’t already pretty familiar with genetics and you might consider making an even more basic version of this. Lots of jargon is dropped without explanation,  for example.

1Metacelsus
Yeah I probably fell victim to https://xkcd.com/2501/ Making a more simplified version is a good idea, and I'll probably do it after I'm done with the other posts.
Load More