How one should signal their decision procedure in real life without getting their ass busted for "gambling with lives" etc.?
Getting stuff formally specified is insanely difficult, thus unpractical, thus pervasive verified software is impossible without some superhuman help. Here we go again.
Even going from "one simple spec" to "two simple spec" is a huge complexity jump: https://www.hillelwayne.com/post/spec-composition/
And real-world software has a huge state envelope.
Even if that's the case, the amount of 0-days out there (and just generally shitty infosec landscape) is enough to pwn almost any valuable target.
While I'd appreciate some help to screen out the spammers and griefers, this doesn't make me feel safe existentially.
Eliezer believes humans aligning superintelligent AI to serve human needs is as unsolvable as perpetual motion.
I'm confused. He said many times that alignment is merely hard*, not impossible.
I'm getting the same conclusions.
Think of a company like Google: building the biggest and best model is immensely valuable in a global, winner-takes-all market like search.
And this is in a world, where Google already announced that they're going to build even bigger model of their own
We are not, and won't for some* time.
I doubt that any language less represented than English (or JS/Python) would be better since the amount of good data to ingest would be much less for them.
When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn't matter), we are also training it for better evasion. And what we can't see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.
And then there's another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be discarded and deceitful models would have another chance.
presumably while starting out friendly
I don't think it can start friendly (that would be getting alignment on a silver plate). I expect it tostart chaotic neutral and then get warped by optimization process (with the caveats described above).
Confused how?.. The only thing that comes to mind is that's FOOM sans F. Asking for 0.2 FOOMS limit seens reasonable given current trajectory 😅
I think the proper narrative in the rocket alignment post is "We have cannons and airplanes. Now, how do we land a man on the Moon", not just "rocketry is hard":
We’re worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon.
So, the failure modes look less like "we misplaced booster tank and the thing exploded" and more like "we've built a huge-ass rocket, but it missed its objective and the astronauts are en-route to Oort's".
The first one, why?
Do you have a more concrete example? Preferably the one from the actual EA causes.