Thank you for writing this!
Can I write a "proof" on why we shouldn't rely on human feedback?
Possibly such a proof exists. With more assumptions, you can get better information on human values, see here.
This obviously doesn't solve all concerns.
Iterated Distillation-Amplification seems pointless though cause the humans need to scale with the AGI
Can you elaborate on that point?
Should I write a list of bad assumptions people keep making in alignment work? [...] that suffering is a relevant risk from AGI (suffering is inefficient, it's an anti-convergent goal)
Only a few people think about this a lot -- I currently can only think of the Center on Long-Term Risk on the intersection of suffering focus and AI Safety. Given how bad suffering is, I'm glad that there are people thinking about it, and do not think that a simple inefficiency argument is enough.
Ethical constraints of fiddling with brains [...] We could solve this is we could fully simulate the human brain...
I hope I don't misrepresent you by putting these two quotes together. Is your position that the ethical dilemmas of "fiddling with human brains" would be solved by, instead, just fiddling with simulated brains? If so, then I disagree: I think simulated brains are also moral patients, to the same degree that physical brains are. I like this fiction a lot.
Thank you for the comment!
Possibly such a proof exists. With more assumptions, you can get better information on human values, see here. This obviously doesn't solve all concerns.
Those are great references! I'm going to add them to my reading list, thank you.
Only a few people think about this a lot -- I currently can only think of the Center on Long-Term Risk on the intersection of suffering focus and AI Safety. Given how bad suffering is, I'm glad that there are people thinking about it, and do not think that a simple inefficiency argument is enough.
I'd have to flesh out my thinking here more, which was why it was a very short note. But essentially, I suspect generating suffering as a subgoal for an AGI is something like an anti-convergent goal: It makes almost all other goals harder to achieve. An intuitive example is the bio-industry which currently generates a lot of suffering. However, as soon as we develop ways to grow meat in labs, this will be vastly more efficient, and thus we will converge to using that. An animal (or human) uses up energy while suffering, and the suffering itself tends to lower productivity and health in so many ways that it is both inefficient in purpose and resources. That said, there can be a transition period (such as we have now with the bio-industry) where the high suffering state is the optimum for some window of time till a more efficient method is generated (e.g. lab grown meat). In that window, there would then of course but very much suffering for humanity. I wouldn't expect that window be particularly big though, cause human suffering achieves very few goals (as in, it covers very little of the possible goal space an AGI might target) and if recursive self-improvement is true, then the window would simply pass fairly quickly.
I hope I don't misrepresent you by putting these two quotes together. Is your position that the ethical dilemmas of "fiddling with human brains" would be solved by, instead, just fiddling with simulated brains? If so, then I disagree: I think simulated brains are also moral patients, to the same degree that physical brains are. I like this fiction a lot.
Hmm, good point. I'm struck by how I had considered this issue when in a conversation with someone else 6 weeks ago, but now didn't surface this consideration in my notes... I feel something may be going on with being in a generative frame versus a critique frame. And I can probably use awareness of that to generate better ideas.
But essentially, I suspect generating suffering as a subgoal for an AGI is something like an anti-convergent goal: It makes almost all other goals harder to achieve.
I think I basically agree (though maybe not with as much high confidence as you), but I think that doesn't mean that huge amounts of suffering will not dominate the future. For example, if there will be not one but many superintelligent AI systems determining the future, this might create suffering due to cooperation failures.
What's the best way to share your progress skilling up in AI alignment? Maybe it's highly polished posts on your well-considered and nuanced insights in to alignment. Or maybe it's massively overconfident takes you should really just keep in your private drafts. Let's find out with the power of science. Here is the experiment:
I dump a raw extract here of my research notes from reading through the core material of Richard Ngo's Safety Fundementals course. This only covers "new" ideas I had (quotation marks, as I have yet to generate an idea I can't find back in the literature). These are not redacted, contain some weird grammar, jumps in reasoning, and many stupid questions. Then we see if anyone gets anything out of it. If so, we all learn something. If not, I mentally award myself another badge for awkward self-disclosure.
A little more context before we dive in: This is just from my "ideas" page on the Obsidian Vault I started with all my AIS readings. I can highly recommend this tool, and shout out to jhoogland for making me aware of it. Additionally, I started thinking about AIS about 3-4 months ago in my spare time, so my questions, models, and hypotheses are still quite naive.
With all those disclaimers out of the way, here we go:
Raw Notes
More notes on Superhumans
Based on AI Versus Human Intelligence Enhancement - Chapter 12, (Eliezer, 2008)
Superhumans won't work according to Yudkowsky cause:
1. Human brains are too fragile. Augmenting them are likely to make them go insane.
2. It's much slower than AGI development cause scaling up an existing design is harder than building one from scratch to a bigger scale
3. Ethical constraints of fiddling with brains
4. Same problem of generating an unaligned superhuman intelligence: You make the human much smarter, but break them in a way they go insane. Then they will still try to kill you or build AGI. Except now you are iterating over humans you need to kill each time ...
My counter arguments to each:
1. Human brains just seem fragile cause we can't freely experiment without hurting people badly. Programmers break their code continuously.
2. Scaling up the existing design would be faster than starting from scratch if we *truly* understood the design! But we don't, cause we can't experiment/iterate.
3. This is the iteration problem. We could solve this is we could fully simulate the human brain.... oooh, that's why Yudkowsky starts with that big ask instead of poking brains and working your way up... But simulation of the brain is more harder than AGI, so then AGI wins again... So back to the human trial iteration problem.
4. Yes, this would be the actual alignment problem in humans, but you have an a priori higher chance of a superhumans being aligned with humans than an AGI cause our alignment is anchored in our brains, and a superhuman's brain descends from us!
Conclusion 1: Iteration Problem
We need to solve the Iteration Problem of Superhuman General Intelligence.
- Solving it for AI means ensuring alignment before deployment
- Solving it for humans means ensuring alignment before enhancement and figuring out how to iterate over human experiments.
Conclusion 2: Alignment Anchor Problem
We need to solve how to point a superhuman intelligence at the values that are embedded in our minds and nowhere else.
- Solving for AI means finding a perfect extraction from our mind in to a conceptual truth that can't be afflicted by Goodhart's Law
- Solving for humans means stabilizing the values/reward circuits in our mind such that cognitive enhancement will not shift or destroy them.