Examples of heavily spotlighted truths are concepts like logical deduction (Validity), probability theory (Probability), expected utility theory (Utility), and decision theory (Decision). Systems and processes are important and capable precisely to the degree in which they embody these mathematical truths, regardless of whether you call them "agents" or not.
I think it would be valuable to get more rigor into this "Lawfulness" frame, especially the arguments that greater Lawfulness implies danger, and that Lawfulness is necessary for some level of capabilities. What exactly are the Lawfulness properties, and can we show they're required for capability or imply danger in some toy setting?
The power-seeking theorems say in various precise mathematical terms that optimal agents are dangerous, but inasmuch as your arguments go through some Lawfulness property, you should be able to show that optimal agents are Lawful (coherence theorems try to do this, but not always in a satisfying way) and that Lawful agents are dangerous.
Yeah, this also to me seems like one of the most interesting directions to go into, and might continue in that direction.
The power-seeking theorems say in various precise mathematical terms that optimal agents are dangerous, but inasmuch as your arguments go through some Lawfulness property, you should be able to show that optimal agents are Lawful (coherence theorems try to do this, but not always in a satisfying way)
Yes, I am claiming (but not showing) that ideal agents are maximally Lawful, in some sense. I do actually find various coherence theorems pretty satisfying in this regard; they are examples of what I was talking about with spotlighted concepts. The fact that there is a 1-1 correspondence between some simple and (somewhat) intuitive properties (completeness, transitivity, etc.) and the representation of an agent in another relatively intuitive way (as an EU maximizer) suggests that there is some kind of important logical truth here, even if you can construct some more complicated properties to break the correspondence, or it turns out that not all of the required properties are actually so intuitive on more careful inspection.
It's not so much that non-ideal / less Lawful agents will always do worse in every situation or that I expect the set of powerful non-Lawful minds is exactly empty. But I'm more interested in the space of possible minds, and what kind of mind you're likely to end up with if you focus almost entirely on improving capabilities or "alignment" properties like ability to follow instructions correctly.
So I would turn the request to show a particular formal property of ideal agents or Lawfulness around and ask, can you show me a concrete example of a non-coherence-theorem-obeying agent which doesn't trivially imply a slightly simpler or more natural agent which does actually obey the theorem? I think the example here is a good start, but more concreteness and some justification for why the proposed construction is more natural would make it a lot more interesting. See also some semi-related ideas and issues in this post.
Also, I'm not claiming that Lawful agents are always / automatically dangerous. I am saying that Law is generally / usually / in practice a requirement for capabilities, and I am taking it as given here that powerful agents are dangerous by default, unless they happen to want exactly what you want them to want.
I am taking it as given here that powerful agents are dangerous by default, unless they happen to want exactly what you want them to want.
If you mean "powerful agents are very likely to cause catastrophe by default, unless they happen to want exactly what you want them to want.", this is where I want more rigor. You still need to define the "powerful" property that causes danger, because right now this sentence could be smuggling in assumptions, e.g. complete preferences.
Hmm, I agree the claim could be made more rigorous, but the usage here isn't intended to claim anything that isn't a direct consequence of the Orthogonality Thesis. I'm just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
And my position is that it is often the opposing viewpoint - e.g. that danger (in an intuitive sense) depends on precise formulations of coherence or other assumptions from VNM rationality - that is smuggling in assumptions.
For example, on complete preferences, here's a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
But regardless of whether that specific claim is true or false or technically false, my larger point is that the demand for rigor here feels kind of backwards, or at least out of order. I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible), the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that's actually true.
I don't think I could name a working method for constructing a safe powerful mind. What I want to say is more like: if you want to deconfuse some core AI x-risk problems, you should deconfuse your basic reasons for worry and core frames first, otherwise you're building on air.
I don't think I could name a working method for constructing a safe powerful mind.
I could, and my algorithm basically boils down to the following:
Specify a weak/limited prior over goal space, like the genome does.
Create a preference model by using DPO, RLHF or whatever else suits your fancy to guide the intelligence into alignment with x values.
Use the backpropagation algorithm to update the weights of the brain in the optimal direction for alignment.
Repeat until you get low loss, or until you can no longer optimize the function anymore.
but the usage here isn't intended to claim anything that isn't a direct consequence of the Orthogonality Thesis.
I want to flag that the orthogonality thesis is incapable of supporting the assumption that powerful agents are dangerous by default, and the reason is that it only makes a possibility claim, not anything stronger than that. I think you need the assumption of 0 prior information in order to even vaguely support the hypothesis that AI is dangerous by default.
I'm just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
I think a critical disagreement is probably that I think even weak prior information shifts things to AI is safe by default, and that we don't need to specify most of our values and instead offload most of the complexity to the learning process.
I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible) the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that's actually true.
I definitely disagree with that, and accepting the orthogonality thesis plus powerful minds are possible is not enough to shift the burden of proof unless something else is involved, since as stated it excludes basically nothing.
For example, on complete preferences, here's a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
For example, on complete preferences, here's a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
I don't see how this result contradicts my claim. If you can construct an agent with incomplete preferences that follows Dynamic Strong Maximality, you can just as easily (or more easily) construct an agent with complete preferences that doesn't need to follow any such rule.
Also, if DSM works in practice and doesn't impose any disadvantages on an agent following it, a powerful agent with incomplete preferences following DSM will probably still tend to get what it wants (which may not be what you want).
Constructing a DSM agent seems like a promising avenue if you need the agent to have weird / anti-natural preferences, e.g. total indifference to being shut down. But IIRC, the original shutdown problem was never intended to be a complete solution to the alignment problem, or even a practical subcomponent. It was just intended to show that a particular preference that is easy to describe in words and intuitively desirable as a safety property, is actually pretty difficult to write down in a way that fits into various frameworks for describing agents and their preferences in precise ways.
and I am taking it as given here that powerful agents are dangerous by default, unless they happen to want exactly what you want them to want.
I want to flag that this is a point where I very strongly disagree with this assumption, and I don't really think the arguments for this proposition actually hold up.
I don't have good ideas here, but something that results in increasing the average Lawfulness among humans seems like a good start. Maybe step 0 of this is writing some kind of Law textbook or Sequences 2.0 or CFAR 2.0 curriculum, so people can pick up the concepts explicitly from more than just, like, reading glowfic and absorbing it by osmosis. (In planecrash terms, Coordination is a fragment of Law that follows from Validity, Utility, and Decision.)
Or maybe it looks like giving everyone intelligence-enhancing gene therapy, which is apparently a thing that might be possible!?
This kind of stuff is but a dream that won’t work. It’s just not pragmatic imo. You’re not gonna get people to read the right sequence of words and then solve coordination. There’s this dream of raising the sanity waterline, but I just don’t see it panning out anytime soon. I also don’t see widespread intelligence-enhancing gene therapy happening anytime soon (people are scared of vaccines, let alone gene therapy that people have an ick feeling for due to eugenics history…).
I agree making these changes on a society-wide scale seems intractable within a 5 year time frame on transformative AI (i.e. by 2027).
Note that these are plans that don't need lots of people to be on board with them. Just a select few. Raise the sanity waterline not of the majority of society, but just of some of its top thinkers who are currently not working on AI alignment, and they may start working AI alignment and may have a large impact. Come up with a gene modification therapy, which substantially increases the intelligence even of already smart people, and find a few brave AI alignment researcher volunteers to take it secretly, and again you may have a large impact on AI alignment research progress rates. Although, I really doubt that a successful and highly impactful gene therapy could be developed even in just small animal models in a 5 year time frame.
I agree that if you hyper-focus on a very small number of people, the plan would be much more tractable (though still not really tractable in my view).
My stance is that you will get significantly more leverage if you focus on leveraging AIs to augment alignment researchers than if you try to make them a little smarter via reading a document / gene therapy. One of my agendas focuses on the former, so that's probably worth factoring in.
Max H commented on the dialogues announcement, and with both of us being interested in having a conversation where we try to do something like "explore the basic case for AI X-risk, without much of any reference to long-chained existing explanations". I've found conversations like this valuable in the past, so we decided to give it a shot.
His opening comment was: