Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can't leak much metadata about (like who they are with). I have made no non-disparagement agreements.
Perhaps say some words on why they might want to?
Curated! A short and sweet post, with a useful rule of thumb for those who live and die by the Fermi. There are some more good ones in the comments, too!
I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.
I'm curious about Claude's coauthorship of this post. How did it help?
The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.
Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?
I thought this title meant the post would be making a case from conservative (i.e. minimal) assumptions.
Maybe change the title to "making a politically conservative case for alignment" or something?
I wonder what the lifetime spend on dating apps is. I expect that for most people who ever pay it's >$100
I think the credit assignment is legit hard, rather than just being a case of bad norms. Do you disagree?
I would guess they tried it because they hoped it would be competitive with their other product, and sunset it because that didn't happen with the amount of energy they wanted to allocate to the bet. There may also have been an element of updating more about how much focus their core product needed.
I only skimmed the retrospective now, but it seems mostly to be detailing problems that stymied their ability to find traction.
If I had to pick a favourite, I'd probably go for Fire and AIs, but The GPT is also great: very terrifying sublime