I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
"Overconfident" gets thrown around a lot by people who just mean "incorrect". Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I'd be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem "underconfident" about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn't real systematic underconfidence, it's just a mistake (from my perspective). And maybe some are "overconfident" that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
At no point in this discussion do I reference "limits of intelligence". I'm not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don't involve that kind of mental move. I'm talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Wtithout optimally learning from mistakes
You're making a much stronger claim than that and retreating to a Motte. Of course it's not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn't be plausible when we condition on an otherwise low likelihood of making mistakes.
If you look at the most successful humans, they're largely not the most-calibrated ones.
The most natural explanation for this is that it's mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
but just because it's not the only useful thing and so spending your "points" elsewhere can yield better results.
I agree calibration is less valuable than other measures of correctness. But there aren't zero-sum "points" to be distributed here. Correcting for systematic overconfidence is basically free and doesn't have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
If you think there's a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk
Yeah, main. I thought this was widely agreed on, I'm still confused by how your shortform got upvoted. So maybe I'm missing a type of x-risk, but I'd appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it's not impossible to be systematically overconfident while succeeding at difficult tasks. But it's more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn't generalise to this).
I don't think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the "first AI smart enough" leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn't help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won't be long before there are more capable AIs and c) it's hard to predict future capability profiles.
Yes, but what's your point? Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible? Without the ability to learn from mistakes?
I see that you're doing large edits and additions to your previous responses, after I had already responded.
This, and the way you're playing with definitions, makes me think you might be arguing in bad faith. I'm going to stop responding. If you had good intentions, I'm sorry.
That's corrigible behaviour, but the mechanism is not, because it stops being corrigible after some number of updates. The idea of corrigibility is that it is correctable, and remains so, in spite of designers making mistakes in goal alignment (or other design mistakes). (Of course mistakes in however we enforced the corrigibility property itself might not be stably correctable, but the hope is that this part is easier to get right on the first try than the rest of it).
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence--after some small number of experiences being overconfident--and then work out how to correct for it.
There are more stable epistemic problems that are worth thinking about, but this definitely isn't one of them.
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
My argument was that there were several of "risk factors" that stack. I agree that each one isn't overwhelmingly strong.
I prefer not to be rude. Are you sure it's not just that I'm confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy's argument for high confidence AI doom, would this still come across as rude to you?