Do I understand correctly that the blue-green graph has a y-axis that goes above 100% median reduction, with error bars in that range? (This would happen if they estimated a proportion as a standard variable - not great practice, but I want to check that it is what happened.)
Question for a lawyer: how is non-reciprocity not an interstate trade issue that federal courts can strike down?
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we'll be able to do automated alignment of ASI, you'll still need some reliable approach to "manual" alignment of current systems. We're already far past the point where we can robustly verify LLMs claims' or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad's agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad's ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that's basically it. And MIRI abandoned agent foundations, while Openphil isn't, it seems, putting money or effort into them.
I partly disagree; steganography is only useful when it's possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I'd be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar - it definitely seems like something that @davidad would want to consider.
I really like that idea, and the clarity it provides, and have renamed the post to reflect it! (Sorryr this was so slow- I'm travelling.)
That seems fair!
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that's not true for proposals like safeguarded AI, where a proof must accompany the output, and it's not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
On the absolute safety, I very much like the way you put it, and will likely use that framing in the future, so thanks!
On impossibility results, there are some, andI definitely think that this is a good question, but also agree this isn't quite the right place to ask. I'd suggest talking to some of the agents foundations people for suggestions
I think these are all really great things that we could formalize and build guarantees around. I think some of them are already ruled out by the responsibility sensitive safety guarantees, but others certainly are not. On the other hand, I don't think that use of cars to do things that violate laws completely unrelated to vehicle behavior are in scope; similar to what I mentioned to Oliver, if what is needed in order for a system to be safe is that nothing bad can be done, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes.
But in cases where provable safety guarantees are in place, and the issues relate to car behavior - such as cars causing damage, blocking roads, or being redirected away from the intended destination - I think hardware guarantees on the system, combined with software guarantees, combined with verifying that only trusted code is being run, could be used to ignition-lock cars which have been subverted.
And I think that in the remainder of cases, where cars are being used for dangerous or illegal purposes, we need to trade off freedom and safety. I certainly don't want AI systems which can conspire to break the law - and in most cases, I expect that this is something LLMs can already detect - but I also don't want a car which will not run if it determines that a passenger is guilty of some unrelated crime like theft. But for things like "deliver explosives or disperse pathogens," I think vehicle safety is the wrong path to preventing dangerous behavior; it seems far more reasonable to have separate systems that detect terrorism, and separate types of guarantees to ensure LLMs don't enable that type of behavior.
Good points. Yes, storage definitely helps, and microgrids are generally able to have some storage, if only to smooth out variation in power generation for local use. But solar storms can last days, even if a large long-lasting event is very, very unlikely. And it's definitely true that if large facilities have storage, shutdowns will have reduced impact - but I understand that the transformers are used for power transmission, so having local storage at the large generators won't change the need to shut down the transformers used for sending that power to consumers.