This is interesting. Can you say more about these experiments?
How does Anthropic and XAi’s compute compare over this period?
Could you say more about how you think S-risks could arise from the first attractor state?
An LLM trained with a sufficient amount of RL maybe could learn to compress its thoughts into more efficient representations than english text, which seems consistent with the statement. I'm not sure if this is possible in practice; I've asked here if anyone knows of public examples.
Makes sense. Perhaps we'll know more when o3 is released. If the model doesn't offer a summary of CoT it makes neuralese more likely.
I've often heard it said that doing RL on chain of thought will lead to 'neuralese' (e.g. most recently in Ryan Greenblatt's excellent post on the scheming). This seems important for alignment. Does anyone know of public examples of models developing or being trained to use neuralese?
(Based on public knowledge, it seems plausible (perhaps 25% likely) that o3 uses neuralese which could put it in this category.)
What public knowledge has led you to this estimate?
I was able to replicate this result. Given other impressive results of o1, I wonder if the model is intentionally sandbagging? If it’s trained to maximize human feedback, this might be an optimal strategy when playing zero sum games.
> I was grateful for the experiences and the details of how he prepares for conversations and framing AI that he imparted on me.
I'm curious, what was his strategy for preparing for these discussions? What did he discuss?
> This updated how I perceive the “show down” focused crowd
possible typo?
How large of an advantage do you think OA gets relative to its competitors from Stargate?