Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can't leak much metadata about (like who they are with). I have made no non-disparagement agreements.
Eusocial organisms have more specialisation at the individual level rather than non-eusocial organisms (I think). I might expect that I would want a large amount of interchangeable individuals for each specialisation (a low bus factor), rather than more expensive, big, rare entities.
This predicts that complex multicellular organisms would have smaller cells than unicellular organisms or simple multicellular organisms (i.e. those were there isn't differentiation between the cells)
A relevant plot from the wonderful calibration.city:
Early in a prediction questions lifetime, the average Brier score is something like 0.17 to 0.24, which is like a 1-8% edge.
I quite like the article The Rise and Fall of the English Sentence, which partially attributes reduced structural complexity to increase in noun compounds (like "state hate crime victim numbers" rather than "the numbers of victims who have experienced crimes that were motivated by hatred directed at their ethnic or racial identity, and who have reported these crimes to the state"))
They're looking to make bets with people who disagree. Could be a good opportunity to get some expected dollars
Have you ever noticed a verbal tic and tried to remove it? For example, maybe you've noticed you say "tubular" a lot, and you'd like to stop. Or you've decided to use "whom" 'correctly'.
When I try and change my speech, I notice that by default my "don't do that" triggers run on other people's speech as well as my own, and I have to make a conscious refinement to separate them.
I'm inclined to agree, but at least this is an improvement over it only living in Habryka's head. It may be that this + moderation is basically sufficient, as people seem to have mostly caught on to the intended patterns.
I spent some time Thursday morning arguing with Habryka about the intended use of react downvotes. I think I now have a fairly compact summary of his position.
PSA: When to upvote and downvote a react
Upvote a react when you think it's helpful to the conversation (or at least, not antihelpful) and you agree with it. Imagine a react were a comment. If you would agree-upvote it and not karma-downvote it, you can upvote the react.
Downvote a react when you think it's unhelpful for the conversation. This might be because you think the react isn't being used for its intended purpose, because you think people are going through noisily agree reacting to loads of passages in a back-and-forth to create an impression of consensus, or other reasons. If, when you're imagining a react were a comment, you would karma-downvote the comment, you might downvote the react.
You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.
evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpose of preventing jailbreaks
I think "labeling neurons" isn't option control. Detecting alignment-faking also seems marginal; maybe it's more basic science than option control.
I think mech interp is proving to be pretty difficult, in a similar way to human neuroscience. My guess is that even if we can characterise the low-level behaviour of all neurons and small circuits, we'll be really stuck with trying to figure out how the AI minds work, and even more stuck trying to turn that knowledge into safe mind design, and even more even more stuck trying to turn that knowledge differentially into safe mind design vs capable mind design.
Will we be able to get AIs to help us with this higher-level task as well? The task of putting all the data and experiments together and coming up with a theory that explains how they behave. I think they probably can just if they could do the same for human neuroscience. And my weak guess is that, if there's a substantial sweet spot, they will be able to do the same for human neuroscience.
But I'm not sure how well we'll be able to tell that they have given us a correct theory? They will produce some theory of how the brain or a machine mind works, and I don't know (genuinely don't know) whether we will be able to tell if it's a subtly wrong theory. It does seem pretty hard to produce a small theory, that makes a bunch of correct empirical predictions, but has some (intentional or unintentional) error that is a vector for loss-of-control. So maybe reality will come in clutch with some extra option control at the critical time.
Your taxonomies of the space of worries and orientations to this question are really good, and I think well capture my concerns above. But I wanted to spell out my specific concerns because things will succeed or fail for specific reasons.
I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.
To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help them in the short term. (The bit I mean to point to by "diagonalising" is the bit where you think about what you expect they'll do, and which mistakes you think they'll make, and plan around that).
In particular, it's hard to distinguish in the amount of time that I have to moderate a new user submission. Given that I'm trying to spend a few minutes on a new user, it's very helpful to be able to rely on style cues.