Consistent Plato
A putative new idea for AI control; index here. The idea here is somewhat underdeveloped.
I've been suggesting AI designs that are not exactly expected-utility maximisers, while wondering if such designs could be stable under self improvement/subagent creation/general stuff. And also wondering what designs such agents would converge to if they did self-improve.
This post is just a short note warning people to make sure they are consistent when they use non-expected utility maximalisation, and to be clear which parts are Platonic and idealised and which parts are not.
Let's take the reduced impact AI design as an example. In this, the agent has a kind of utility:
v = u - R,
where u is a standard utility and R is a penalty term. The penalty term involved a probability estimate for what a putative "super AI" would estimate, based on some observations it could make. Thus it is better described as R(P), where P is the probability estimator of that "Super AI".
When calculating expected v, the agent will use its own probability estimator P'. Initially, P' is just a less powerful version of P, one that can, moreover, estimate properties about P.
How could this possibly go wrong? I'm glad you asked ^_^ The obvious risk is that the agent acts to replace P with Q, such that R(Q) is as low as possible.
This could come about if P referred to some actual physical algorithm, possibly one written down somewhere the AI could get at it. Then re-writing the algorithm P could increase v by reducing R(P).
It is somewhat more difficult if P and P' are defined in the same place (maybe P is just P' run for a lot longer). Then modifying P could run the risk of degrading the AI's overall accuracy in predictions. However, this may be a price the AI is willing to pay, if the R(P) penalty is severe enough. More likely, the AI could degrade the quality of P but not P' by introducing pathologies that would only affect the first.
Even if P itself is safe from modification, the AI can still hack the system. Recall that it does not calculate P, it estimates it, using P' - so effectively using P'(P). Thus modifications of P' can affect the AI's estimate of R(P). Even if P and P'(P) cannot be changed, the AI can still improve P' for predictions of u, while leaving P'(P) unchanged. Thus the AI may become super-super-intelligent at predicting facts about the world, while remaining (deliberately) only super-intelligent at predicting the penalty R(P). This could allow it to use its great intelligence to hack or trick the R(P) term.
The main requirement to avoid all this is that there is no "first-step divergence". That there never comes a time where the agent modifies itself in such a way that the initial relation between P and P' comes apart.
To avoid this, it might be best to treat both P and P' as different approximations to the same idealised platonic process (Kolmogorov complexity priors calculated by hypercomputation, possibly?). Then the two algorithms are actually strongly related (you can't really change them "independently"), and the AI is trying to minimise the true value of R(P), not the value it records. Self-modifications to P and P' will only work if this allows a better estimate of R(P), no necessarily a lower estimate.
Being inconsistent about what is platonic (in theory or in practice - some processes are essentially platonic because we can't affect them) and what isn't can lead to problems and unstable motivation systems.
We won't be able to recognise the human Gödel sentence
Building on the very bad Gödel anti-AI argument (computers's are formal and can't prove their own Gödel sentence, hence no AI), it occurred to me that you could make a strong case that humans could never recognise a human Gödel sentence. The argument goes like this:
- Humans have a meta-proof that all Gödel sentences are true.
- If humans could recognise a human Gödel sentence G as being a Gödel sentence, we would therefore prove it was true.
- This contradicts the definition of G, which humans should never be able to prove.
- Hence humans could never recognise that G was a human Gödel sentence.
Now, the more usual way of dealing with human Gödel sentences is to say that humans are inconsistent, but that the inconsistency doesn't blow up our reasoning system because we use something akin to relevance logic.
But, if we do assume humans are consistent (or can become consistent), then it does seem we will never knowingly encounter our own Gödel sentences. As to where this G could hide and we could never find it? My guess would be somewhere in the larger ordinals, up where our understanding starts to get flaky.
Akrasia as a collective action problem
Related to: Self-empathy as a source of "willpower" and some comments.
It has been mentioned before that akrasia might be modeled as the result of inner conflict. I think this analogy is great, and would like to propose a refinement.1
Here's the mental conflict theory of akrasia, as I understand it:
Though Maud appears to external observers (such as us) be a single self, she is in fact a kind of team. Maud's mind is composed of sub-agents, each of whom would like to pursue its own interests. Maybe when Maud goes to bed, she sets the alarm for 6 AM. When it buzzes the next morning, she hits the snooze...again and again and again. To explain this odd behavior, we invoke the idea that BedtimeMaud is not the same person as MorningMaud. In particular, BedtimeMaud is a person who likes to get up early, while MorningMaud is that bully BedtimeMaud's poor victim.The point is that the various decisionmakers that inhabit her brain are not always after the same ball. The subagents that compose the mind might not be mutually antagonistic; they're just not very empathetic to each other.
I like to think of this situation as a collective action problem akin to those we find in political science and economics. What we have is a misalignment of costs and benefits. If Maud rises at 6, then MorningMaud bears the whole cost of this decision, while a different Maud, or set of Mauds, enjoys the benefits. The costs are concentrated in MorningMaud's lap, while the benefits are dispersed among many Mauds throughout the day. Thus Maud sleeps in.
Put differently, MorningMaud's behavior produces a negative externality: she enjoys the whole benefit of sleeping in, but the rest of the day's Mauds bear the costs.
So, how can we get MorningMaud to lie in the bed she makes, as it were, and get a more efficient outcome?
We can:
- Legislate. Maud tirelessly tells herself to be less lazy and exerts willpower to get the job done. This is analogous to direct, blanket government action (such as banning coal) in response to a negative externality (such as once-verdant, now barren hillsides). But it's expensive, and it doesn't always work.
- Negotiate. Maud rewards herself when she gets up on time by taking a hot shower right away or eating a nice breakfast (the latter has a cost borne by MoneyMaud); or she allows herself to sleep in once a week. If MorningMaud follows through, then this one's a winner. Maybe this is analogous to Coasian bargaining?
- Deputize. Maud enlists her friend Traci to hold her feet to the fire. Or she signs up on Stikk, Egonomics, or some similar site.
The analogy's not perfect. (I can't see a way to fit in Pigovian taxes .)
But is it a fruitful analogy? Is it more than just renaming the key terms of the subagent theory--could one use welfare economics to improve one's own dynamic consistency?
1I got this idea partly from a slip, possibly Freudian (I think I said "externality" instead of "akrasia"), and partly from this page on the Egonomics website.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)