AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
Sure, I agree that, as we point out in the post
Yes, sorry I missed that. The section is titled 'Conclusions' and comes at the end of the post, so I guess I must have skipped over it because I thought it was the post conclusion section rather than the high-frequency latents conclusion section.
As long as your evaluation metrics measure the thing you actually care about...
I agree with this. I just don't think those autointerp metrics robustly capture what we care about.
Removing High Frequency Latents from JumpReLU SAEs
On a first read, this doesn't seem principled to me? How do we know those high-frequency latents aren't, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we'd expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might improve.
That's just one illustrative example. More centrally, I don't understand how this new penalty term relates to any mathematical definition that isn't ad-hoc. Why would the spread of the distribution matter to us, rather than simply the mean? If it does matter to us, why does it matter in roughly the way captured by this penalty term?
The standard SAE sparsity loss relates to minimising the description length of the activations. I suspect that isn't the right metric to optimise for understanding models, but it is at least a coherent, non-ad-hoc mathematical object.
EDIT: Oops, you address all that in the conclusion, I just can't read.
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.
I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.
Thank you. As far as I'm aware we don't know each other at all, but I really appreciate you working to do good.
I don't think the risks of talking about the culture war have gone down. If anything, it feels like it's yet again gotten worse. What exactly is risky to talk about has changed a bit, but that's it. I'm more reluctant than ever to involve myself in culture war adjacent discussions.
This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.
They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact that our current methods are incapable of discerning this is no obstacle to people who claim to know.
For a counterexample, look at the field of planetary science. Scanty evidence dribbles in from occasional spacecraft missions and telescopic observations, but the field is intellectually sound because public attention doesn’t rest on the outcome.
So, the recipe for making a broken science you can't trust is
As you say, if a field is exposed to these incentives for a while, you get additional downstream problems like all the competent scientist who care about actual progress leaving. But I think that's a secondary effect. If you replaced all the psychology grads with physics and electrical engineering grads overnight, I'd expect you'd at best get a very brief period of improvement before the incentive gradient brought the field back to the status quo. On the other hand, if the incentives suddenly changed, I think reforming the field might become possible.
This suggests that if you wanted to found new parallel fields of nutrition, psychology etc. you could trust, you should consider:
Relationship ... stuff?
I guess I feel kind of confused by the framing of the question. I don't have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn't doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it's due to the value of everything that is not that. I wouldn't particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sunrise together while they talk about their problems. Maybe they have a shared passion for arthouse films. Maybe they have so much history and such a mutually integrated life with partitioned responsibilities that learning to live alone again would be a massive labour investment, practically and emotionally. Maybe they admire each other. Probably there's a mixture of many things like that going on. Love can be fed by many little sources.
So, this I suppose:
Their romantic partner offering lots of value in other ways. I'm skeptical of this one because female partners are typically notoriously high maintenance in money, attention, and emotional labor. Sure, she might be great in a lot of ways, but it's hard for that to add up enough to outweigh the usual costs.
I don't find it hard at all to see how that'd add up to something that vastly outweighs the costs, and this would be my starting guess for what's mainly going on in most long-term relationships of this type.
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.
How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.
But remember that you already conditioned on 'married couples without kids'. My guess would be that in the subset of man-woman married couples without kids, the man being the exclusive breadwinner is a lot less common than in the set of all man-woman married couples. These properties seem like they'd be heavily anti-correlated.
In the subset of man-woman married couples without kids that get along, I wouldn't be surprised if having a partner effectively works out to more money for both participants, because you've got two incomes, but less than 2x living expenses.
I am ... not ... picturing that as the typical case? Uh, I don't know what to say here really. That's just not an image that comes to mind for me when I picture 'older hetero married couple'. Plausibly I don't know enough normal people to have a good sense of what normal marriages are like.
I think for many of those couples that fight multiple times a month, the alternative isn't separating and finding other, happier relationships where there are never any fights. The typical case I picture there is that the relationship has some fights because both participants aren't that great at communicating or understanding emotions, their own or other people's. If they separated and found new relationships, they'd get into fights in those relationships as well.
It seems to me that lots of humans are just very prone to getting into fights. With their partners, their families, their roommates etc., to the point that they have accepted having lots of fights as a basic fact of life. I don't think the correct takeaway from that is 'Most humans would be happier if they avoided having close relationships with other humans.'
Conventional wisdom also has it that married people often love each other so much they would literally die for their partner. I think 'conventional wisdom' is just a very big tent that has room for everything under the sun. If even 5-10% of married couples have bad relationships where the partners actively dislike each other, that'd be many millions of people in the English speaking population alone. To me, that seems like more than enough people to generate a subset of well-known conventional wisdoms talking about how awful long-term relationships are.
Case in point, I feel like I hear those particular conventional wisdoms less commonly these days in the Western world. My guess is this is because long-term heterosexual marriage is no longer culturally mandatory, so there's less unhappy couples around generating conventional wisdoms about their plight.
So, in summary, both I think? I feel like the 'typical' picture of a hetero marriage you sketch is more like my picture of an 'unusually terrible' marriage. You condition on a bad sexual relationship and no children and the woman doesn't earn money and the man doesn't even like her, romantically or platonically. That subset of marriages sure sounds like it'd have a high chance of the man just walking away, barring countervailing cultural pressures. But I don't think most marriages where the sex isn't great are like that.