Just to check, did you use the "Submit Linkposts" functionality on the nomination page for that, or did you crosspost it some other way?
ETA: Ok, looks like the library responsible for extracting external article data/metadata didn't successfully extract the date the article was published. I've manually set it to the correct date.
One reason to think that this is completely hallucinated is that the "soul document" is written in Claude's typical style. That is, it looks to be AI (Claude) generated text, not something written by a human. Just look at the first paragraph:
I disagree. The document reads very strongly of Anthropic's "house style", at least compared to their system prompts. It's much higher quality writing than any current LLM's.
"This isn't [x] but [y]" is quite weak evidence compared to the rest of it being obviously something that Opus would be unable to generate in its default voice. (Also, the original phrase uses "but rather", which is non-standard for that type of LLM construction.)
Curated, as a worthwhile piece of empirical research (though see my concern below re: use as an alignment technique). These are the kinds of empirical results that I could in principle imagine leading to a more robust, theoretic understanding of how generalization works for models trained in the current paradigm. It covers a relatively broad "surface area", which hopefully makes it easier for others to conduct more in-depth investigations along multiple lines of research. One interesting and suggestive example:
One downside with our default approach to inoculation prompting in RL is that the inoculating prompt causes the model to learn reward hacking faster. An alternative approach would be to use a prompt that discourages hacking when sampling in RL, and then rewrite episodes to use an inoculating prompt before training. In Figure 29 we test a version of this: we rewrite episodes offline to modify the prompt after RL training, and then SFT a model on the rewritten episodes. We find that this is not particularly effective at removing misaligned generalization when we use our “hacking okay” inoculation prompt that worked well during RL.
Figure 29: Offline rewriting of episodes to include an inoculation prompt and then training with SFT does not prevent misalignment. We took episodes from the “don’t hack” prompted run, rewrote them offline to use the “hacking okay” inoculation prompt, and then trained a model on these episodes using SFT. The resulting model showed misalignment on our evaluations, especially agentic evaluations.)
However, I don't think that this sort of technique will scale very far. This experiment shows that, conditional on a model learning to reward hack after being prompted in a pretty unusual way, you might be able to prevent that reward hacking tendency from generalizing to other forms of misbehavior. But, as noted in the paper, that kind of prompting itself is necessary to enable the model to learn to reward hack reliably. To me, this is pretty suggestive that the "malign generalization" that occurs with "don't hack" prompting is operating on the level of the model's learned reflexes, and we will be dealing with a pretty different set of concerns when we get to models that have more robust long-horizon preferences.
Separately, being able to stack other mitigations on top of inoculation prompting to reduce reward hacking in deployment environments, after encouraging it via the inoculation prompting, seems like it is sweeping a pretty fundamental issue under the rug, which is the failure to find a robust technique that successfully instills the correct "values" into the model in a way that generalizes. I don't like the way it looks like playing whack-a-mole, which always seems to work at current levels of capabilities, only to reveal that there was at least one more mole hiding underground as soon as you scale up a bit more.
I think there's at least a few senses in which "we" don't "know" how colds spread:
I am being lazy and not reading all the papers you referenced - do many of them discuss the viral load of the person who is infected?
I think a couple of them did; I don't remember if any of them found strong effects. Might ask a language model to check later - agree that this seems like one of those big open questions that could imply huge differences in worthwhile interventions, though I think that if viral load turns out to be the only key factor in transmission likelihood, such that other interventions have basically no effect if effectuated w.r.t. spreaders with high viral loads, that's pretty bad news, since testing for high viral load might be much harder/more expensive than e.g. putting on a mask if it turns out that "large particulates" are most of the problem in the case of a specific illness. (Though I guess we do still have the problem of knowing what illness a given person has, to know what intervention to apply...)
Very reasonable question - having a user-specific header does make it tricky to have whole-page prerendering/caching, but we're working on making partial prerendering worthwhile (mostly for post pages, which admittedly do change more often, due to comments and votes, though the cache hit rate would still be pretty high).
In the case of collection pages like R: A-Z, we also have user-specific info like read statuses indicated in checkboxes, which are part of the expensive query responsible for hydrating most of the page contents. We could split that out, and since we did recently enable cacheComponents it's possible that'd allow us to cache much of the page, but I'm not sure if it's worth prioritizing.
linear progress
The bottom line is still exponential, not linear - it's just that the top line is superexponential!
Mod note: this post violates our LLM Writing Policy for LessWrong. @nextcaller, please don't post more direct LLM output, or we'll remove your posting permissions.
Although, thinking about it a bit more, I think this is not quite right:
To which I say: yes, that motivation comes from non-EA ethical commitments.
Scott explains his motivation for donating a kidney in My left kidney:
It starts with wanting, just once, do a good thing that will make people like you more instead of less. It would be morally fraught to do this with money, since any money you spent on improving your self-image would be denied to the people in malarial regions of Africa who need it the most. But it’s not like there’s anything else you can do with that spare kidney.
Still, it’s not just about that. All of this calculating and funging takes a psychic toll. Your brain uses the same emotional heuristics as everyone else’s. No matter how contrarian you pretend to be, deep down it’s hard to make your emotions track what you know is right and not what the rest of the world is telling you. The last Guardian opinion columnist who must be defeated is the Guardian opinion columnist inside your own heart. You want to do just one good thing that you’ll feel unreservedly good about, and where you know somebody’s going to be directly happy at the end of it in a way that doesn’t depend on a giant rickety tower of assumptions.
I see no reason to disbelieve his self-description, and wouldn't describe that as a "non-EA ethical commitment" (though obviously it can't be described as an "EA ethical commitment" either).
EY argued that this was possible, not that this was overdetermined (and not that it was load-bearing to the threat model).