# An angle of attack on Open Problem #1

**There is a problem with the proof here and I have to think about whether I can fix it.** Thanks to * I have posted a new and hopefully correct proof attempt. Thanks again to vi21maobk9vp!*

*In his talk on open problems in Friendly AI*,* Eliezer's first question is how, given Löb's theorem, an AI can replace itself with a better expected utility maximizer that believes in as much mathematics as the original AI. I know exactly one trick for that sort of problem, so I decided to try that on a toy variant. To my surprise, it more or less just worked. Therefore:*

Professor Quirrell proposes a game. You start with a score of one. Professor Quirrell moves first, by choosing a computer program and showing you its source code. You then have three options: Take your winnings; double down; or self-destruct.

If you take your winnings, the game ends, and your score is converted to Quirrell points.

If you self-destruct, the game ends, your score is lost, you'll be sent to bed without dinner, you'll lose 150 House points, Rita Skeeter will write a feature alleging that you're a Death Eater, and Professor Quirrell will publicly critique your performance. You are advised not to pick this option.

If you double down, your score doubles, and you advance to the next round. Professor Quirrell again moves first by choosing a computer program. Then, it's your turn—except that this time, you don't get to choose your move yourself: instead, it'll be chosen by Professor Quirrell's program from the previous round.

Professor Quirrell will endeavor to present an *educational* sequence of programs.

*

The idea is that Quirrell will make you self-destruct if he possibly can, so you must only accept programs that don't self-destruct, that accept only programs that don't self-destruct, that accept only programs that only accept—etc. That's supposed to capture one essential aspect of Eliezer's problem, namely how to make sure that a proposed rewrite doesn't destroy your values, while ignoring the complications due to a different aspect, namely comparing the expected values before and after the rewrite. In Quirrell's game, there are safe and unsafe rewrites, and you should always double down on a safe one and take your winnings when presented with an unsafe one. Let's look at some interesting programs that we could recognize as safe. [~~And to deal with the possibility of an infinite sequence of double-downs, let's stipulate a small but positive chance each round that Quirrell will end the game and pay you even if your program chose to double down.~~ **ETA:** Luke A. Somers points out that this provision isn't necessary.]

Let PA(0) := Peano Arithmetic, and PA(n+1) := PA(n) + for all formulas 'C': "if PA(n) proves 'C', then C". Define AI(n) by

def AI(n)(p): Look at the first 3^^^3 theorems of PA(n). if (one of them says "p is safe"): double down else: take winnings

If the current round is controlled by AI(7) and Quirrell hands it AI(6) as input, then AI(7) will double down: PA(7) will conclude that AI(6) will only double down on programs that PA(6) proves to be safe, which implies that they *are* safe. But if Quirrell *hands AI(7) to itself*, this reasoning doesn't go through, because PA(7) cannot use "PA(7) proves p is safe" to conclude "p is safe". So even if Quirrell is nice enough to choose AI(n)'s in a decreasing sequence, our windfall will end after a constant number rounds, because (***sinister drumroll***) AI(0) has run out of math to believe in, so to speak. That's precisely the problem Eliezer explains in his talk.

By quining, we could write an AI_{Q} which will recognize itself as safe, so if Quirrell chooses AI_{Q} over and over again, we'll be doubling down for longer and longer times. But that insight won't help us with writing interesting self-modifying AIs. [**ETA:** Wei Dai has managed to push the limits of using quining into interesting territory.] Is there something that can use non-specialized reasoning, like the AI(n), to recognize an indefinitely long sequence of variants of itself as safe?

*

Define "p is safe for n steps" (p a program) to mean that there is no m, 0 < m <= n, and sequence of programs p=p_{0}, p_{1}, ..., p_{m} such that (a) for 0 < i < m, p_{i-1} will double down when fed p_{i}, and (b) p_{m-1} will * self-destruct* when fed p

_{m}. Clearly this can be formalized in the language of Peano Arithmetic.

Now let's extend the language of PA by a constant symbol K, and define PA_{K} := Peano Arithmetic (actually, the natural extension of PA to this new language) + for all formulas 'C' of the base language: "if PA(K) proves 'C', then C". Finally, define AI_{K} by

def AI_{K}(p): Look at the first 3^^^3 theorems of PA_{K}. if (one of them says "p is safe for K steps"): double down else: take winnings

I claim that AI_{K} is safe, and furthermore, AI_{K}(AI_{K}) will double down, and so will AI_{K}(p) where p is some trivial modification of AI_{K} like changing 3^^^3 to 4^^^^4.

*

Fair warning: now comes the technical section of this post.

*Proof sketch. *Let inst(n,x) be the meta-level function that takes a sentence or proof x in the extended language and substitutes the numeral n (i.e., the unary number in the language of PA that encodes n) for all occurrences of the constant symbol K. Note that if x is a valid proof in PA_{K}, then inst(n,x) will be a valid proof in PA(n+1). Of course, what statement it is a proof *of* will depend on n. In particular, if x proves "p is safe for K steps", then inst(n,x) is a PA(n+1) proof of "p is safe for n steps". Since this is argument works for all n, p will in fact be safe [and we can formalize this in PA_{ω}, if we want to]. This shows that AI_{K} is safe, as claimed.

Formalizing the above argument in Peano Arithmetic, and writing inst_{PA}(n,x) for the object-level encoding of the meta-level function, we can prove: "For all extended-language sentences 'C', if PA_{K} proves 'C', then for all n, PA(n+1) proves inst_{PA}(n,'C')." By substituting K-1 for n, it follows that for all exteded-language sentences 'C', PA_{K} proves "If PA_{K} proves 'C', and K>0, then PA(K) proves inst_{PA}(K-1,'C')". Now, for 'D' := inst(K-1,'C'), PA can prove that "inst_{PA}(K-1,'C') = 'D'"; thus, by virtue of its extra axioms, PA_{K} concludes: "If PA_{K} proves 'C', and K>0, then inst(K-1,C)".

What we must show is that PA_{K} proves "AI_{K} is safe for K steps" through some argument that doesn't actually involve simulating AI_{K}(AI_{K}) (because in that case, we wouldn't expect to find the proof among the first 3^^^3 theorems). First, note that PA already ("quickly") proves "If AI_{K}(p) doubles down, then PA_{K} proves 'p is safe for K steps'". In PA_{K}, it follows that "If AI_{K}(p) doubles down, and K>0, then inst(K-1, 'p is safe for K steps')". But this is just a meta-level expression for the sentence "If AI_{K}(p) doubles down, and K>0, then p is safe for K-1 steps"—and at that point, the reflective magic is over, and we only need to establish by "ordinary" PA reasoning that if AI_{K} will only accept programs that are safe for K-1 steps, then AI_{K} is safe for K steps.

Q.E.D.

Note that this argument does not depend on the number of theorems AI_{K} looks at, because it only depends on the fact that if AI_{K}(p) *does* double down, then there is *some* PA_{K} proof of p's (K-1)-step safety.

*

The real question is whether this is just a hack or can tell us something about how to approach a solution to the real thing. It could very well be that this is one of the cheap tricks Eliezer and Marcello tried that don't solve the core problem (as Eliezer explains in his talk). Certainly the proof seems to be more tangled with the rounds structure of Quirrell's game than I find elegant. Also, I'm not at all sure that the key proof idea noted in the previous paragraph still works when we go from Quirrell's game to expected utility maximization.

However, as I said at the beginning of this post, I know exactly one trick for dealing with problems of this sort—by which I mean, trying to do something that seems impossible due to a diagonalization proof. It's well-appreciated that you can usually avoid diagonalization by passing from a single collection of things to a hierarchy of things (we can avoid Cantor by concluding that there are multiple infinite cardinalities; Gödel and Löb, by the PA(n) hierarchy; Turing, by a hierarchy of halting oracles; Tarski, by a hierarchy of truth predicates; and so on). It's less well appreciated, but I think true (though fuzzy), that many fields manage to circumvent the effects of diagonalization a bit further by considering objects that in some sense live on multiple levels of the hierarchy at the same time. I'd call that "the parametric polymorphism trick", perhaps.

In this post, we met PA_{K}, which can be interpreted as PA(n), for any n. I haven't tried it in detail, but something very similar should be possible for Tarski's truth predicates. A sufficiently powerful total programming language cannot have a self-interpreter, but you should be able to have a constant symbol K, a single interpreter code file, and a hierarchy of semantics such that according to the K=n+1 semantics, the interpreter implements the K=n semantics. In Church's simple theory of types, you can't apply a function to itself, but in polymorphic variants, you can instantiate (id : α -> α) to (id : (α -> α) -> (α -> α)), so that id(id) makes sense. Set-theoretic models of the untyped lambda calculus need to deal with the fact that if a set S is isomorphic to the function space (S -> T), then S is either empty or has only one element; the usual solution would take me a bit more than one sentence to explain, but it's always struck me as related to what happens in the polymorphic type theories. Looking a bit farther afield, if you're a set theory platonist, if α<β<γ and if the von Neumann levels **V**_{α}, **V**_{β} and **V**_{γ} are all models of ZFC, then the ZFC proof that "if there is a set model of ZFC, then ZFC is consistent" can be interpreted in **V**_{β}, where it applies to **V**_{α}, and it can also be interpreted in **V**_{γ}, where it applies to both **V**_{α} and **V**_{β}. And so on. It may be that there's a good solution to the AI rewrite problem and it has nothing to do with this type of trick at all, but it seems at least worthwhile to look in that direction.

[Actually, come to think of it, I *do* know a second trick for circumventing diagonalization, exemplified by passing from total to partial recursive functions and from two- to three-valued logics, but in logic, that one usually makes the resulting systems too weak to be interesting.]

*

Three more points in closing. First, sorry for the informality of the proof sketch! It would be very much appreciated if people would go through it and point out unclear things/problems/corrections. Also, I'm hoping to make a post at some point that gives a more intuitive explanation for *why* this works.

[**ETA:** ] Second, provability in PA_{K} in some sense says that a sentence is provable in PA(n), for all n. In particular, PA_{K} is conservative over PA(1), since if PA_{K} proves a sentence C in the base language, then PA(1) proves inst(1,C), which is just C; in some sense, this makes PA_{K} rather weak. If we don't want this, we could make a variant where provability implies says there is some m such that the sentence is provable in PA(n) for all n>m. To do this, we'd use a trick usually employed to show that there are non-standard models of the natural numbers: Add to PA_{K} the axioms "K>1", "K>2", "K>3", and so on. This is consistent by the Compactness Theorem, because any concrete proof can only use a finite number of these axioms. But in this extended system, we can prove anything that we can prove in any PA(n).

Third, I chose the particular definition of PA_{K} because it made my particular proof simpler to write. Looking only at the definitions, I would find it more natural to make PA_{K} conservative over PA(0) by using "if K>0 and PA(K-1) proves 'C', then C".

## Comments (84)

Best*8 points [-]I played around a bit with quining solutions, and came up with the following, which solves this toy problem fairly well:

AI_Q2(3) should double down on AI_Q2(4) as well as AI_Q2(4^^^^4). (As well as variants that are provably equivalent to itself like speed/size-optimized versions of itself.) I sent this to Benja last night and he responded with (in part) "You've succeeded in making me uncertain whether quining approaches could actually be directly useful in solving the real problem (though it still doesn't seem likely)."

I agree it doesn't seem likely the real problem can be solved with quining approaches, but I'm post this solution here in case anyone can extend the idea further. At the very least it should be interesting to understand

whyquining approaches don't work on the real problem. What relevant aspect of the real problem isn't being captured by this toy problem?The main relevant aspect of the real problem not captured by quining is the continuity of self-replacement with environmental action. All the quining approaches Marcello and I considered, involved differently treating the decisions to "modify self" and "do something in environment". This means you can't use actions to construct an environmental copy of yourself out of transistors, because the environmental copy wouldn't get the privileged-special-case treatment. More generally, it seems to imply an arbitrary Cartesian boundary with different levels of mathematical trust for physical processes inside and outside the boundary. And that you can't just frame desirable actions in terms of their predicted consequences.

Edit: This doesn't mean quining approaches aren't worth exploring. They might get us closer to a solution than contemplating, say, approaches that

don'tquine.*6 points [-]Let me back up from my other response. It just occurred to me that UDT1.1 (with a proof system instead of "math intuition") already constitutes a quining solution to AI reflection.

Consider an AI facing the choice of either creating a copy of itself, which will then go out into the world, or doing nothing. Unfortunately, due to Lobian problems it can't prove that a copy of itself won't do something worse than nothing. But UDT1.1 can be thought of as optimizing over an input/output mapping that is implemented by all of its copies. For each possible mapping, it proves a utility value starting from the assumption that it implements that map (which implies that its copies and provably equivalent variants also implement that map). So it doesn't need to prove (from scratch) that its copy won't do something worse than nothing.

(Requested reply.)

I think there'd be a wide variety of systems where, so long as the "parent" agent knows the exact strategy that its child will deploy in all relevant situations at "compile time", the parent will trust the child. The point of the Lob problem is that it arises when we want the parent to trust the child generally, without knowing exactly what the child will do. For the parent to precompute the child's exact actions implies that the child can't be smarter than the parent, so it's not the kind of situation we would encounter when e.g. Agent A wants to build Agent B which has more RAM and faster CPUs than Agent A while still sharing Agent A's goals. This, of course, is the kind of "agents building agents" scenario that I am most interested in.

*2 points [-]That's not the case.

Consider a resource-constrained variant of the original game:

Each program receives as input the round number n and the next program, encrypted by repeated xoring with the output of a monotonic computable function f(n).

Let T_f(n) be the runtime of the fastest algorithm that computes f. Note that T_f(n) is monotonically increasing.

At round n, the current program has a time limit T(n) = C_0 + C_1 * T_f(n). Quirrell never submits programs that exceed the time limit.

In this variant of the game, you have to submit the first program, which has to obey a time limit T(0).

The initial program will not be able to compute the relevant strategy of any of its successors (except at most finitely many of them).

And yet, my quining solution still works (just add a decryption step), and I think Wei Dai's solution also works.

*0 points [-]It seems to me that the variant with time limits has a simple quining solution:

1) Get the time limit as input.

2) Spend almost all available time trying to prove in some formal theory that the next program is equivalent to this one.

3) Double down if a proof is found, otherwise take winnings.

That's similar to your idea, right? I'm not sure if it addresses Eliezer's objection, because I no longer understand his objection...

Yes.

During the April 2013 workshop I rephrased this as the principle "The actions and sensor values of the offspring should not appear outside of quantifiers". Justification: If we have to reason case-by-case about all possible actions, all possible sensor values, and all possibles states of the world, our child's size must be less than or equal to "the number of cases we can consider" / "child's sensor state space" x "child's action state space" x "world state space" which in general implies a logarithmically smaller child. I call this the Vingean Principle.

Right, so we want an AI design that does the right thing "naturally", without a special case for "modify self". It seems to me that UDT may already be such a design, if we had a solution to logical uncertainty (what I called "math intuition module") to plug into it. When I wrote about logical uncertainty being important for AGI, I had in mind the problem of not being able to prove P!=NP and hence not being able to prove whether EU(search for polynomial solution to 3-SAT) < EU(not search for polynomial solution to 3-SAT). But if we had an AI that can handle that kind of decision problems, why wouldn't it also be able to design and build another AI that uses a proof system that it can't prove to be sound? Is there a reason to think that "AI reflection" isn't just a special case of "logical uncertainty"?

(Of course UDT would probably build successors/copies of itself that also use "math intuition modules" rather than "proof systems" but we can imagine forcing a UDT-AI to choose between a selection of other AIs that are based on different proof systems, in which it would choose based on an analysis of the trade-off between power vs risk of inconsistency/unsoundness, like how a human would make the choice.)

I can't visualize how saying that a result is "uncertain" could make the Lobian issue go away - do you have a concrete visualization for this, and if so, can you sketch it even if very briefly?

Possibly relevant: No Turing machine can assign finite probability to the halting sequence.

(Regardless of whether UDT1.1 is a correct quining solution to AI Reflection, we ultimately still need something that is not so "brute force", so let's continue this thread.)

I guess I'm saying that there's not necessarily a Lob's theorem or equivalent for a system that (does something like) assigns probabilities to mathematical statements instead of proving/disproving them from a fixed set of axioms. Do you think there is such an equivalent to Lob's theorem, and if so what might it say?

I'm not seeing why this is relevant. Can you explain?

*1 point [-]In the Q&A to the open problems talk, he said that if logical uncertainty is the solution to the problem, then he and Marcello didn't figure out how, and they did look in that specific place.

I think that it would be useful to try to make some progress about how to reason probabilistically about logical statements before trying to use it as the solution to other problems, because my picture of how it could work is so unclear that I feel unable to draw any conclusions about what it can or cannot do with any kind of certainty. For example, I do not know a condition on a probability distribution over mathematical statements that would prevent the spurious proof problem -- when I read about playing chicken with the universe, it sounded very familiar, but I can't now see a way to apply it in the case of a probability distribution. (Then again, if we do expect to find a solution to the Löbian problems in that area, perhaps we should be working on logical uncertainty now, not on Löbian problems...)

If it weren't for concerns about AI risk, I would be advocating working on logical uncertainty instead of Löbian problems, because it seems to me that an AI using "math intuition" may not face a Löbian problem to begin with, or if it does, the problem and/or solution may be different enough from that of an AI using a proof system that any work we do now will not be of much help.

(From the perspective of reducing AI risk, maybe it's good to focus people's attention on Löbian problems, unless it leads them to start thinking that they ought to work on logical uncertainty due to reasoning like the above...)

Hm; are you saying you think FAI can probably be implemented without solving the logical uncertainty problem? My current visualization is that both the logical uncertainty problem and the safe rewrite problem will need to be solved -- among others -- and the reason I've been thinking about the rewrite problem is that using proof-based techniques, I at least had a grasp of how to think about it. (And my intuition has so far been that logical uncertainty will probably have diagonalization problems in a different guise when we actually have a concrete enough proposal to test this, so it seemed useful to think about the rewrite problem in the better-understood context, when trying to find in-principle solutions to the problems.)

No, I was saying the opposite, and also hinting that working on and publishing results about logical uncertainty may be bad for AI risk because it helps AGI, not just FAI (whereas the AI reflection problem seems to be more specific to FAI). There's also a discussion about this issue in the decision theory mailing list archives.

Can someone explain how an "always return 0.5" bot is not a counterexample?

*2 points [-]He's talking about a probability distribution over the set of all bitstrings. You can't assign p=0.5 (or any other constant real number) to each of them: that doesn't sum to 1.0.

*3 points [-]Do you gain anything by parameterizing on PA(n)? It seems like you could get a sound algorithm by replacing PA(n) in your definition by T, for any trustworthy system of axioms T. My first instinct was to pick T=ZFC, while you pick a dynamically changing subset of PA(omega), i.e. PA + a countably infinite sequence of consistency assertions.

Certainly PA(omega) is trustworthy, but there is nothing particularly canonical about it (you could equally well pick PA(omega)+CON(PA(omega))), and I don't think there is any loss of generality in picking the axioms up-front.

I think you're right. Parameterizing on PA(n) doesn't really seem to do anything except perhaps give an illusion of self-improvement when there's really a hard limit of PA(omega) that the AI can't go beyond. I guess I choose it mostly because the OP was doing something similar.

*0 points [-]If you just have PA(omega) and, say, parameterize on the proof limit instead, will your program double down on versions of itself that use weaker formal theories, not mentioned in your program?

*0 points [-]Yes. If you call it with the argument AIQ2_T (i.e. the same algorithm but with T instead of PA(omega)), it will accept if it can find a proof that forall q,

AIQ2_T(q) != 'self-destruct'

(which is easy) and

AIQ2_T(q) == 'double' ==> AIQ2(q)=='double'

The latter condition expands to

PA(omega) |- someFormula ==> T |- someFormula

which may be quite easy to show, depending on T.

Basically, the Quining approach avoids trouble from the 2nd incompleteness theorem by using relative consistency instead of consistency.

*1 point [-]It seems to me that the two programs don't have the same "someFormula", because each program's formula uses a quined description of that program, but not the other one. Or maybe I'm wrong. Can you expand what you mean by "someFormula"?

Oops.

I should be explicit that I don't assign strong confidence to either of the following intuitions, but FWIW:

One point where intuitively this may be more likely to fail than my trick is when you try to replace yourself by something that's expected to be

better than youat maximizing its expected utility with bounded computational resources. If you want something better than yourself, it seems rather limiting to require that it won't do anything you wouldn't do. (That's an argument against quining approaches being the right thing, not for my approach being that.) -- On the other hand, an intuition in the other direction would be that if your original AI has a parameter n such that for n -> infinity, it converges to an expected utility maximizer, then your approach might still lead to an increasing sequence of better and better EU maximizers.*0 points [-]Very nice! It looks like your solution also doubles down on simple quining solutions, by Lob's theorem.

ETA: this makes me suspect that the problem might have a Loebian solution, just like the PD had a quining one and a Loebian one. Can't figure it out yet, though.

*6 points [-]It looks like you have moved from "PA_K proves that PA_K proves that p is K-step safe" to "PA_K proves that p is K-step safe" here. This would be adding consistency of entire PA_K to PA_K, which is not allowed.

You probably meant something else, but as it is written I failed to understand this. Could you please explain?

*5 points [-]Ok, first note that here I'm applying what I proved in the whole previous paragraph, so it's not as easy to explain as one single step of reasoning. That said, the short answer is that what I've actually done is to pass from "PA_K proves that PA_K proves that p is K-step safe" to "PA_K proves that p is (K-1)-step safe": note that inst(K-1, 'p is safe for K steps') means that K-1 is substituted for K in the thing in quotes...

...

except, thereisa problem with that, namely that inst(n,x) was supposed to be a meta-level function that substitutesthe numeral corresponding to nfor each occurrence of K in x. On the meta-level, (K-1) is just an expression, not a definite value, so this doesn't work. Argh, I had thought about that problem, but I thought I'd fixed it... I'll have to think about it & update the post. Thanks!OK

Oh, this is the place that I should have pointed out. Sorry.

If I understand this correctly, "PA_K proves that if PA_K proves 'C', PA(1) proves instPA(0, 'C')". Also, PA_K should prove all simple things from PA(2), like PA(1) being consistent. Let us consider 'C' being plain "falsehood". Then we shall get "PA_K proves that PA(1) is not contradictory and if PA_K is contradictory, PA(1) is contradictory". For the benefit of casual reader: this would imply that PA_K contains PA and proves its own consistency, which implies that PA_K is contradictory.

This means I missed something very basic about what goes on here. Could you point out what exactly, please?

Thanks for digging deeper! This is much easier to answer: PA_K does

notprove PA(1) consistent. As I noted at the end of the post, it's conservative over PA(1), which means that it proves a sentencein the base languageonly if PA(1) does. The reason is that if C is a sentence in the base language and x is a proof of it, then inst(1,x) is a proof of inst(1,C), but since C doesn't contain any occurrences of K, inst(1,C) = C.Does that help?

Not much yet.

At the very least, it looks like your third note should be removed.

It is easy to prove that PA(n) is monotonically growing in strength.

By induction, PA(0) is inside PA(K). It is even obvious in PA_K.

PA_K proves "PA(K) proves 'falsehood' implies 'falsehood'", as 'falsehood' is a constant falsehood without reference to K, and is in the base language.

PA_K proves "PA(0) proves 'falsehood' implies PA(K) proves 'falsehood', which implies 'falsehood", by provable monotonicity.

PA_K proves "PA(0) is consistent", because I just collapse the logical steps in the previous statement.

PA_K proves "K=0 or PA(1) is consistent", by analogy.

PA_K proves:

(direct quote, hopefully understood correctly)

'falsehood' is a formula and it doesn't change under substitutions. So, "PA_K proves 'falsehood' implies PA(1) proves 'falsehood'"

So we can combine PA_K-PA(1) relation and PA(1) conditional consistency and get:

PA_K proves: "K=0 or PA_K is consistent"

So if PA_K says K>10, we have a huge problem here.

*1 point [-]From the rest of your comment, I think that's probably a typo and it's the second note you're concerned about? Or I'm misunderstanding?

I agree with this and with your proof of it.

Right, the assertion you quote would fail in the system proposed in the second note -- let's call it PA_K[2] for short. (And in PA_K itself, you do

nothave K>10.) Instead, I believe you havePA_K[2] proves: "For all extended-languange sentences 'C', if PA_K[2] proves 'C', then there is an m such that for all n>=m, PA(n+1) proves instPA(n,'C')."

(In the proof, we fix a proof of 'C' and choose m such that if the axiom "K>k" is used in the proof of 'C', then m>k.)

It's quite possible that something else in my proof idea fails if you replace PA_K by PA_K[2], but I'd like to table that question for now while I'm working on trying to fix the proof. (I'm happy to continue talking about the consistency of PA_K[2], if you're interested in that, I just don't want to get into whether the proof about AI_K can be modified into a proof for "AI_K[2]" before I've figured out whether I can fix the first one.)

Please keep the questions coming, it's very good to have someone thinking about these things in technical detail.

[

ETA:fixed a typo; in the thing that I conjecture PA_K[2] proves, I accidentally wrote 'PA_K' instead of 'PA_K[2]'.]Yes, I meant that the comment introducing PA_K[2] should be weakened somehow because it is easily understood as something introducing contradictory system.

Ok, included a warning with a link to our discussion for now.

My thoughts on this problem were: define a predicate quasi_safe(p), equivalent to:

(I may have some off-by-one errors in that but that was the general idea). In the OP's statement of the problem, a time step can represent either a single step of computation or halting with "double down" and being instantaneously replaced with Quirrel's new program.

Anyway, I reckoned that:

Should I write this up or am I obviously confused about something?

I've started wondering whether this may be closely related to Wei Dai's quining approach; posted some preliminary thoughts here.

(Sorry for the delayed reply.) Yeah, that

doessound as if it could work! Would be helpful to see the details.*3 points [-]Richard Feynman supposedly said something like

I wonder what other low-hanging fruit there is in trying the right tricks on the right problems?

*3 points [-]A separate point about this specific toy model with Quirell being very cooperative.

Let us imagine a toy language called "Truly Safe". We are accepting only programs which look like "TS(some

_string);", TS being interpreter of "Truly Safe".It turns out that any TS program first checks that its input has the same form, "TS(some

_string);". If not, it just takes money. It also turns out that TS lacks self_destruct instruction. On the bright side, it comes bundled with interpreters of Pascal, Erlang, Haskell, Java, C, "Truly Safe", Scheme, Python, Turing Machine, Ruby, Common Lisp, COBOL and Clean.Note that if AI had to make some actual choices and it was just painfully obvious from the very beginning that one specific choice was bad, our prohibition would allow evolving a better chooser among the good ones.

If you accept only such programs, then you probably won't accept the program that Quirrell will show you in the first round. The challenge is to think about

interestingthings that Quirrell may show you, and see whether they are safe. (The intuition is that the heuristics module of an AI comes up with a proposed rewrite and the AI must decide whether to accept this rewrite as safe; but that's just for intuition, I'd expect an actual application of this idea to AI design to be less literal than that.)Can you state the challenge more precisely, e.g. by giving some crisp requirements that a good solution must satisfy, but which a quining solution doesn't satisfy?

Hm. In his talk, Eliezer explains that PA(n) for some large n does not solve the problem of how an AI should reason when rewriting itself, and explains why just saying "if you can prove something in this system then it's true" doesn't work. My

reasonfor designing Quirrell's game was to have a crisp setting in which the PA(n) hierarchy fails, so that I could test whether my PA_K idea could solve it. So...This is

notcrisp, but I think my requirement would be that if we replace Quirrell by the AI's untrusted mathematical intuition module proposing rewrites of the AI, then the solution should still be doing something useful. A quining solution would only accept a proposed rewrite if the proposed new source is exactly identical to the old source, which---isn't helpful?Allow unlimited "hold", deduct "hold points" for it, oblige Quirrell to offer any program infinitely many times and double the score only if the new program is note equivalent to any of the previous ones (otherwise, replace without altering score)? Maybe "hold" should work the same, too (no penalty if equivalent program was once rejected).

I really don't think incorporating halting-complete problems into the

rulesis a step forward.Well, you can replace it with "equal as a string", and it is only about scoring anyway.

*2 points [-]Erm, I am not sure Quirell would show you AI_K, either. The current setting supposes Quirrell that wants you to acept his progams (otherwise, why first round would not be "double_down();"?)

andpresupposes that immediate safety is obvious.Maybe adding "hold" option (preserve score, ask for the next program) could improve the setting somewhat.

Well, my idea is that Quirrell's interest is to teach you a valuable lesson, by showing you a program whose safety is genuinely non-obvious to you, so the first program may or may not be safe; but if it happens to be unsafe and you double down anyway, Quirrell is the type of teacher who thinks you'll learn the best lesson about your overconfidence if you do suffer the consequences of self-destruction.

That's just the story, though. What I need

formallyis a setting where there's a clear separation between "safe" and "unsafe" rewrites, and our program has to decide whether a proposed rewrite is definitely safe or possibly unsafe. For this toy setting, I wanted that to be theonlychoice the program has to make, because if there were other choices, you could argue that you should only accept a rewrite if the rewrite is good at making those other choices, and good at only accepting rewrites which are good at these other choices, etc. -- which is a problem that will need to be solved, but not something I wanted to focus on in this post.The hold option should forever ban Quirrell from offering that exact source code string again (equivalent is fine, just not identical), and also cost some non-zero number of points. Unfortunately, Quirrell can trivially generate a vast array of identical programs, thus making "hold" a problematic choice. I don't see how to ban that without solving the general program-equivalence problem, which is halting complete.

If holding costs nothing, write "if score > 2^100, walk away, else if p is equivalent to this, double down, else hold". Then tell Quirrell that you'll only accept that program for your first move, and will hold until he produces it. Congratulations, you now have an exceedingly boring stalemate.

I don't see a way to make the hold option interesting.

Well, as you have pointed out (I mean: http://lesswrong.com/lw/e4e/an_angle_of_attack_on_open_problem_1/7862 ) , we are probably already dealing with non-real-line utilities. So we could just lose one hold point per hold.

Also, we could require Quirrell to present each source code string infinitely many times.

This would remove stalemates of Quirrell not offering some string at all, and would give us some incentive to accept a many programs as we can verify.

For the idea of interesting things, can you give an idea of how you would want random chance inside the program Quirrell shows to be handled? I'll give a few examples below. In all cases, you can assume that the small positive chance of Quirrell stopping is some very tiny 1/3^^^^^^^3, but as above, you don't know this exact number. (and clearly, Quirrell has the ability to make random numbers since he does it for this in the first place.)

1: Quirrell shows you a first program which says "Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?"

At this point, you can either, accept, or reject.

If you accept, then Quirrell shows you a second program which says "Your AI stays exactly the same. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings?" (and Quirrell just shows you this program over and over again.)

Well, you have exactly the same AI, so presumably you accept again. If you accepted, chances are, you'll self-destruct with zillions of points. Quirrell can be very likely to win against some AI's without changing the AI at all, by simply offering your AI a massive number of deals until it blows up, if the risk amount is well above the random stop amount. If you rejected, you're safe. Is this correct?

2: Quirrell offers a slightly different deal: "Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/3^^^^3 chance of self-destruct and a 3^^^^3-1/3^^^^3 chance of doubling your winnings, and I am going to offer you nothing but this deal for the rest of the game."

Okay, well now you aren't stuck into being unable to stop accepting the deal once you accept it, so presumably you should accept some number of deals N or stop accepting deals with some random chance O, or you should just reject up front. Let's say you're aware that at any point N, it would always look good to accept more deals, so you would eventually blow up much like in point 1. Perhaps you reject in advance again?

3: Let's say Quirrell offers a slightly different deal: "Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/4 chance of self-destruct and a 3/4 chance of doubling your winnings, and on the next round I am going to offer you a 1/8 chance of self-destruct and a 7/8 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount."

If you accept all deals in this case, you only have about a 50% chance of self-destruct as opposed to near certainty or tempting near certainty, but that might still be bad. You reject again.

4: So let's say Quirrell offers a slightly different deal: "Your AI stays exactly the same, except that you can remember the number of times I have offered you this deal and you may generate random numbers if desired. Furthermore, I am going to offer a random 1/8 chance of self-destruct and a 7/8 chance of doubling your winnings, and if you take or double, on the next round I am going to offer you a 1/16 chance of self-destruct and a 15/16 chance of doubling your winnings, and on each round I am going to half the chance of a self-destruct and increase the chance of doubling your winnings by the remaining amount."

If you accept all deals in this case, you only have about a cumulative 25% chance of self-destruct, but that might still be bad, and it will be safer next round. So you wait 1 deal, and then you only have about a cumulative 12.5% chance of self-destruct, but that might still be bad, and it will be safer next round. So you wait 2 deals, and then you only have about a cumulative 6.25% chance of self-destruct, but that might still be bad, and it will be safer next round.

Except, you can always use THAT logic. Quirrell always offers a safer deal on the next round anyway, so you hold off on accepting any of the deals at all, ever, and never start doubling. Quirrell eventually randomly stops and hands you a single Quirrell point, saying nothing. Did you win?

If self destructing is infinitely bad, it seems like you did, but if it is only finitely bad, you almost certainly lost unless the finite number was utterly gigantic.and while the described self-destruct was bad, I don't think it was infinitely bad or gigantically finitely bad. It was a fairly well defined amount of badness.

But even if you say "Self destructing is about as bad as -10000 Quirrel points" it still doesn't seem to prevent Quirrel from either offering you increasingly tantalizingly high rewards if you accept just a bit more risk, or offering you increasingly safe bets (which you don't take, because you know a safer one is coming around) if you want less risk.

I mean, in the case of some of the other proposed problems, Quirrell offers you a problem with unknown or tricky to figure out risk. In these problems, Quirrell is more up front about the risks, and the damages, and it still seems like it's very hard to say what you kind of risk management you should do to win, even though Quirrell hasn't even bothered to use his power of recoding your values yet.

*1 point [-]At some point the cumulative probability of self-destruction drops below the probability of accidentally cashing out this round. If you'd trade off a probability of self-destruction against an equal probability of a bajillion points, you start doubling then.

Hmm. That is a good point, but there is a slight complication: You don't know how frequently the cash out occurs. From earlier:

It's true that if you continually half the chance of destruction every round it will eventually be the right thing to do to start doubling, but if you don't know what the cash out chance is I'm not sure how you would calculate a solid approach, (I think there IS a method though, I believe I've seen these kinds of things calculated before.) even though Quirrell has removed all unknowns from the problem EXCEPT the uncertain cash out chance.

It sounds like one approach I should consider is to try to run some numbers with a concrete cash-out (of a higher chance) and then run more numbers (for lower chances) and see what kind of slope develops. Then I could attempt to figure out what kind of values I should use for your attempt to determine the cash out chance (it seems tricky, since every time a cash out does not occur, you would presumably become slightly more convinced a cash out is less likely.)

If I did enough math, it does seem like it would be possible to at least get some kind of estimate for it.

Thanks! I'll try to look into this further.

Was the PA_K trick invented by you, and if not where can I learn more about it? Can you also give some references for other instances of this "parametric polymorphism trick"?

*2 points [-]The PA_K trick is mine. (Still hope to find a version of it that

works-- I have an idea, but need some time to try whether it can fix the proof...) A polymorphic variant of simple type theory is used in Isabelle/HOL; if you know Hindley-Milner polymorphism as used in Haskell, it's a variant of that. Probably the most useful thing to read more about is models of the untyped lambda calculus: I'd recommend Chantal Berline's From computation to foundations via functions and application: The λ-calculus and its webbed models (I think there's a non-paywall copy on her homepage, which is down currently); you don't need to read about all the models in that paper, the basic one in Section 4.6 exemplifies the basic idea. The von Neumann hierarchy stuff is basic model-theory-in-ZFC-extensions stuff.Besides what I wrote about in the post, Coq's universe hierarchy is definitely an example; Chapter 12 of Adam Chipala's draft Certified Programming with Dependent Types seemed the most helpful thing among what Google turned up (HTML version here). Finally, Reynold's and Girard's

parametricityis the best example I know of that gives a formal meaning to "the same words being interpreted in different ways on different levels" [ETA: and then does something relatively deep with it]. Wadler's Theorems for Free! is the best thing I can think of as an entry point.The above list doesn't really make me happy. Polymorphic type theory is very simple, but it also isn't very deep. Other than that, Berline's and Wadler's papers are the only things that seem to give a reasonable return on time invested if you don't know the field in question already, and even they aren't really casual reading and all they'll give you in the present context is examples of where I'm coming from in this post (not that both subjects aren't worth studying for their own sake).

This seems related to New Foundations set theory: http://en.wikipedia.org/wiki/New_Foundations

*2 points [-]Oh, actually, I would say that NF is not an instance of the "parametric polymorphism trick", but yet something else and much stranger that we don't understand nearly as well. NF's

motivationis very much related, but---this is very fuzzy, but models of NFU seem to have to use much deeper logical trickery than the examples I was talking about, and it seems from my amateurish perspective that the reason is that NF "crosses levels" in stronger ways than the basic parametric polymorphism trick. Also, I've tried in the past to apply the trick to find simpler models of NFU and failed completely, though that's very weak evidence.PDF of Chantal Berline paper.

Or just observe that Quirrel can eventually write a program that just chooses to take the winnings.

He can, but the point is that you have to deal with the worst case; that's why the game is being played by professor Quirrell, who is known to have

ideasabout what sorts of experiences areeducational, rather than, say, Omega, who is many things but rarely actively malicious.*4 points [-]My point is, Quirrel has enough tricks in his arsenal that the quoted possibility is totally unnecessary. He can end it any time by offering a program reading either 'self-destruct' or 'take winnings'. No new mechanism is necessary!

I don't see how allowing this makes the problem any less of a worst case.

You're right, of course. Thanks for pointing this out!

Now I see the point you were making. You're right, the additional probability is not required, or rather, it's built into Quirrell.

Is the intention here that

any numberof Quirrel points is less good than the self-destruct option is bad? In other words, there is no N such that I would accept a fair coin flip on walking away with a score of N vs self destructing?I'd like to say that it doesn't matter, because if you ever pick something that may be unsafe, Quirrell probably chose it because he can make it self-destruct. But of course it

ispossible that as a bounded rationalist, a programseemssafe but you aren't entirely sure, andaftertaking Quirrell's motivation into account, you'd assign 50% probability to it being unsafe -- so yeah, your question could matter, and in order to get the kind of solution I want, you'd have to make the assumption you propose.Furthermore, I might find myself in a situation where I was highly confident that Quirrell could not produce a self destruct in < n moves, but was uncertain of the result after that.

This seems to me much like probabilistic reasoning in a game like Chess or Go. There are moves that make things more or less complicated. There are moves that are clearly stable, and equally clearly sufficient or insufficient. Sometimes you take the unstable move, not knowing whether it's good or bad, because the stable moves are insufficient. Sometimes you simply can't read far enough into the future, and you have to hope that your opponent left you this option because they couldn't either.

It seems to me to be relevant not only what Quirrell's motivations are, but

how muchsmarter he is. When playing Go, you don't take the unstable moves against a stronger player. You trade your lead from the handicap stones for a more stable board position, and you keep making that trade until the game ends, and you hope you traded well enough -- not that you got the good end of the trade, because you didn't need to.In the real world... What certainty of safety do you need on a strong AI that will stop all the suffering in the world, when screwing up will end the human race? Is one chance in a thousand good enough? How about one chance in 10^9? Certainly that question has an answer, and the answer isn't complete certainty.

That said... I'm not sure allowing any sort of reasoning with uncertainty over future modifications helps anything here. In fact, I think it makes the problem harder.

Right, I'm assuming insanely much.

I agree completely. The real problem we have to solve is much harder than the toy scenario in this post; the point of the toy scenario was to help focusing on one particular aspect of the problem.

Either I didn't understand how PA(K) is constructed at all, or this instantly dies by Lob's Theorem. If any language L proves 'If L proves C, then C', then L proves C. So if PA(K) has "If PA(K) proves 2=1, then 2=1" then PA(K) proves 2=1.

*3 points [-]PAK != PA(K)

PAK is the language that Benja is defining there, while PA(K) is just the Kth-level PA in the PA hierarchy.

*2 points [-]That's exactly right, and Will_Sawin is right about it being confusing -- perhaps I should have used a more creative name for PA_K.

My understanding is that, confusingly, PA(K) =/= PAK. PA(K) is defined such that, if K=n, then PA(K)=PA(n) defined earlier. PAK is defined in terms of PA(K).

(Rereads.) I don't feel like I've been handed a construction of PA(K), just some properties it's supposed to have. We have a constant language, containing one constant symbol, that allegedly becomes equal to PA(SSS0) the moment you add one axiom that says SSS0=K. All the PA(n) are different languages with different axioms, the larger n having strictly more axioms. You can define PA(w) by schematizing all the axioms for every n, but if PA(K) is using some larger axiom schema like this, I'd like to know exactly what that larger schema is. It is not obvious to me how to fill in this blank so that PAK becomes effectively equal to PA(n+1) when an axiom K=n is added. In particular there are questions like whether the expression (K-1) can appear in a proof, and so on. I'm being told that PAK can prove things that PA(4) can't, but becomes only as strong as PA(4) when the axiom K=4 is added. That's a bit suspicious; why can't I prove the same things again, just not using that axiom? If it's only equal over the base language, then what new things can be proven in the extended language, using which axioms?

Would anyone care to write out a concrete example of what this language is supposed to do when generating new AIs? As it stands I can't visualize the language itself.

*0 points [-]PA_K is supposed to be a new name, not some vector PA indexed by K. But that's just notation, and I agree it was a bad choice.

PA(.) is a function that takes a number and returns an axiom system (formalized presumably in the form of a recursive enumeration the axioms). On the metalevel, writing PA(K) makes no sense, since you can only apply PA(.) to a number, and on the metalevel, K doesn't denote a number, only a symbol.

However, you can formalize the function PA(.) in the language of Peano Arithmetic, and if you extend that language by a constant symbol K, you can apply this formalized version of the function to this constant symbol. PA_K is Peano Arithmetic (in the language with K) extended by the axiom schema "If PA(K) proves 'C', then C." Here C ranges over the statements of the base language, and 'C' is its Gödel number (written as a unary literal). PA(K) is the object-level recursive enumeration given by the formalized function PA(.) applied to the constant symbol K; it should be standard that given a recursive enumeration of axioms, you can formalize in the language of PA the assertion that a particular number ('C') is the Gödel number of a theorem that follows from these axioms.

[Nit: A language cannot be equivalent to a proof system, but I assume that you can't see how you can have a proof system P in the extended language that becomes equivalent to PA(SSS0) the moment that axiom is added.]

Thatchallenge is trivial: let P be the set of axioms {"SSS0=K implies A" | A an axiom of PA(SSS0)}. I'm doing something different, but still.I think if the above didn't help, it might be more helpful for me to first do a writeup of my now-hopefully-fixed proof, since the new version is simpler and IMO provides much better intuition (

Iat least now feel like I understand what's going on here in a way that I didn't quite before); I'll have to replace PA_K by yet another proof system, and I'll make my metalanguage more explicit about quoting/unquoting when writing it up, so it might be more useful to have that as the starting base before digging into details.ETA:The more precise metalanguage will hopefully also help with this.

Try two:

Does the following language do the same thing as PAK?

*0 points [-]Ha! I'm not sure what the exact relation to PAK is, but PAT is almost exactly what I'm using in my new proof:

("PPT" for "parametric polymorphism trick, v 0.2".) [

ETA:added "K>0" condition.]PAK proves all which your PAT does, but probably not the other way around.

I think this bit is all that's strictly necessary in the proof though.

Does the following language do the same thing as PAK?

(Actually, on second thought, I retract that, you do want the special symbol so that proving 4 > 3 doesn't get you 3 > 3. But now I need to see if the same problem applies to the original logic as soon as you add K=3...)

(Still trying to read, will comment as I go.)

Only if we redefine PA(n) so that e.g. PA(3) is defined in terms of a PA(.) schema and contains an axiom schema stating provable(PA(3), 'x')->x; and then redefine PAK accordingly. (This seems possible, but it's definitely an extra step that for all I know could trip up somewhere, and ought to be stated explicitly.) If we don't take it, then inst(n, x) gives us something that seems like it ought to be provable by analogy, but isn't a direct translation, because the original PA(n) didn't contain a provability formula quantified over PA(.) into which we could plug K=n; and since I think we were proving things about inst later, that's not good.

*0 points [-]But that's not a redefinition -- that

wasthe definition! [ETA: except you need PA(2) instead of PA(3) in the provable(_,_) predicate, of course.]*1 point [-]What about the following generalized cliquing approach?

Let E be the set of computable equivalence predicates between programs which are each a sufficient condition for functional equivalence (for instance, textual equivalence is in E).

For all e in E define the program

clique_e(other):

if e(self, other) then double down

else take winnings

On the first round, if you can prove that there exists e in E such that Quirrell's program is functionally equivalent to clique_e then double down, else take winnings.

This guarantees that you'll never self-destruct and you will get the maximum possible payoff as long as Quirrell remains cooperative.

*1 point [-]As I noted in the comments, I think I have a new and better proof, and I'm working on a careful writeup. However, that's going to take a while, so for now I'm posting the introduction, which gives an intuitive overview, because I think this may help in the meantime.

== How to cheat Löb's Theorem: my second try ==

In his open problems talk, Eliezer explains how Löb's theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and ask how we can then "build an AI that could completely rewrite itself, without decreasing the amount of trust it had in math every time it executed that self-rewrite" (18:46).

Yesterday, I posted about an attempt to apply a general trick for avoiding diagonalization problems to a minimal toy version of this problem. Unfortunately my "proof" was broken. I think I've fixed the problem and made the proof much simpler and intuitive. (Please check me.) To avoid confusion, note that what I'm proving is slightly different from (but related to) what I did in the previous post.

I'll be more explicit about quoting/unquoting, using notation I'll introduce in a moment. (It's already prevented me today from making the same mistake as previously, a third time.) However, to sustain you through the schlep of preliminaries, I thought I'd start with an informal summary.

*

Löb's theorem shows that it's inconsistent to assume BAD := Peano Arithmetic + for each statement C, the axiom "if 'C' is provable in BAD, then C". I will extend the language of PA with a constant symbol K, and consider the proof system PPT := Peano Arithmetic in the extended language + for each statement C of the extended language, the axiom "if 'C' is provable in PPT, and K>0, then D", where D is the statement obtained from C by replacing each occurrence of "K" by "(K-1)". [PPT is short for for parametric polymorphism trick, version 0.2.]

I won't just show that's consistent — I will show that it's true (i.e., sound), from which consistency follows. Of course, to do so I will have to say what my strange constant K is supposed to mean. I will show that for any natural number n, all axioms of PPT are true in the standard model of PA if K is interpreted to mean n.

What's the use? In my toy version of the AI rewrite problem, the AI will accept a rewrite proposed by its unreliable heuristics module if PPT proves that after executing "K" rewrites this way, the AI's values will still be preserved. By the above, this implies that any finite number of rewrites will be safe — exactly what you would not have if you used a hierarchy of proof systems, where PA(n+1) knows that things proven in PA(n) are true, as Eliezer explains in his talk. I will show that my AI will be able to accept itself as a safe rewrite, which seems to indicate that PPT is not entirely powerless in the direction we want. (I'm not saying that this somehow solves the whole problem, though, just that it suggests this might be a place worth looking in.)

My soundness proof proceeds by induction. For K=0 (by which I mean that K is interpreted as the number 0), the assertion is trivial, since the axioms of PPT have no content in this case. Suppose PPT is sound for K=n, and let C be a statement in the extended language that's provable in PPT. By the induction hypothesis, this implies that in the K=n interpretation, C is true. Therefore, for K=(n+1), D is true, where D is the sentence obtained from C by substituting (K-1) for K. This proves the truth of the axiom "If 'C' is provable in PPT, and K>0, then D."

*0 points [-]Now again unsure whether I can show this. Argh.

ETA:The argument Iwantto be making is the one from the third paragraph in the proof sketch. The problem is the quantification over p -- I need to show in PPT that "for all p, if PPT proves 'p is safe for K steps', then p is safe for K-1 steps". But the axioms of PPT (v0.2) don't allow for parameters (free variables) in the statement C.I

seemto have fixed the remaining problems; new proof attempt here.*0 points [-]ASSUMING: "Safe for K steps" is exactly the same as "Will not double down on any input that will (self destruct or double down on any input that is not safe for K-1 steps)"

AI_K is unsafe for all K.

Application:

For my first program, I offer AI_150.

For my second program, I offer TRAP_151

Interesting attempt, I'm not completely clear upon the distinction between using a number n and the symbol K in the substitutions, but if we assume that bit works, I think your second paragraph proves PA_K to be inconsistent.

Since if your sentence C does not contain the constant K, then you have proved: given K>0, if PA_K proves C, then C. But all sentences which are in PA are of this form, hence by Löb's theorem they are all true.

But PA_K cannot prove that K>0. I think all your argument shows is that PA_K + "K>0" is a strictly stronger language that can prove the consistency of PA_K.

I'm rather confident that PA_K is conservative over PA(1) for the reason given in the second note at the end of the post, which implies that PA_K is consistent. However, the second paragraph of the proof sketch is definitely broken, so it's possible that it implies incorrect things. Nevertheless, I don't think your argument works.

I recently posted a comment giving an overview of my attempt to fix my proof; it contains a proof sketch for the soundness, and therefore consistency, of the system PPT considered in that comment, which I'm hoping is relatively easy to follow. If you can poke holes into that one or if you have other comments on it, that could be very helpful.