(Warning: I be hittin' the comment button without reviewing this carefully, views expressed may be inaccurately expressed and shit, ya dig? Aight yo.)
Thanks for the pointers. I wish there were a place for me to just bring up things I've been thinking about, and quickly get pointers or even conversation. Is Less Wrong IRC the best place for that? I've never used it.
I tend to think that in order to tackle this issue it might be necessary to understand how to implement a program that can 'understand' and generate new mathematics at least as generally as a human with peak mathematical ability, but that is just my intuition.
One FAI-relevant question I'm very interested in is: What if anything happens when a Goedel machine becomes intelligent enough to "understand" the semantics of its self-description, especially its utility function and proof search axioms? Many smart people emphasize the important difference between syntax and semantics, but this is not nearly as common in Less Wrong's standard philosophy.[1] If we could show that there's no way a Goedel machine can "re-interpret" the semantics of its axioms or utility function to mean something intuitively rather different than how humans were interpreting them, then we would have two interesting arguments: that it is indeed theoretically possible to build a general intelligence that is "stable" if the axioms are sound[2], and also that superintelligences with non-Friendly initial utility functions probably won't converge on whatever a Friendly AI would also have converged on. Though this still probably wouldn't convince the vast majority of AGI researchers who weren't already convinced, it would convince smart technically-minded objectors like me or perhaps Goertzel (though I I'm not sure what his position is).
One interesting way to look at Goedel machines for all kinds of investigations is to imagine that they create new agents to do things for them. (A variation on this theme is to trap the Goedel machine in a box and make it choose which of two totally different agents to let out of their boxes---it's a situation where it outputs the best strategy according to its goals, but that strategy has a huge number of side effects besides just optimizing its goals.) For ideas related to those in the previous paragraph it might be useful to imagine that the Goedel machine's proof search tells it that a very good idea would be to create an agent to monitor the Goedel machine and to intervene if the Goedel machine stops optimizing according to its utility function. (After all, what if its hardware gets corrupted, or it gets coerced to modify its utility function and to delete its memory of the coercion?) How does this second agent determine the "actual" or "intended" semantics of the original machine's utility function, assuming it's not too worried about its own utility function that references the original machine's utility function? These are just tools one can use to look at such things, the details I'm adding could be better optimized. Though it's not my reason for bringing up these ideas, you can see how such considerations indicate that having a thorough understanding of the Goedel machine's architecture and utility function doesn't obviously tell us everything we need to know, because superintelligences are liable to get creative. No pun intended.
To further show why this might be interesting for LW folk: Many of SingInst's standard arguments about the probable unFriendliness of not-explicitly-coded-to-be-Friendly AIs are either contradictory or technically weak, and it'd be nice to technically demonstrate that they are or aren't compelling. To substantiate that claim a little bit: Despite SingInst's standard arguments---which I've thoroughly understood for two years now and I was a Visiting Fellow for over a freakin' year so please Less Wrong for once don't parrot them back to me; ahem, anyway...---despite SingInst's standard arguments it's difficult to imagine an agent that doesn't automatically instantly fail, for example by simple wireheading or just general self-destruction, but instead even becomes superintelligent, and yet somehow manages to land in the sweet spot where it (mis-)interprets its utility function to be referring to something completely alien but again not because it's wire-heading. Most AI designs simply don't go anywhere; thus formal ones like Goedel machines are by far the best to inspect closely. If we look at less-technical ones then it becomes a game where anyone can assert their intuition or play reference class tennis. For example, some AI with a hacky implicit goal system becomes smart enough to FOOM: As it's reflecting on its goal system in order to piece it together, how much reflection does it do? What kind of reflection does it do? The hard truth is that it's hard to argue for any particular amount less than "a whole bunch of reflection", and if you think about it for awhile it's easy to see how in theory such reflection could lead to it becoming Friendly. Thus Goedel machines with very precise axioms and utility functions are by far the best to look at.
(BTW, combining two ideas above: it's important to remember that wireheading agents can technically create non-wireheading agents that humans would have to worry about. It's just hard to imagine an AI that stayed non-wireheading long enough and became competent enough to write a non-wireheading seed AI, and then suddenly started wireheading.)
[1] Maybe because it leads to people like Searle saying questionable things? Though ironically Searle is generally incredibly misunderstood and caricatured. At any rate, I am afraid that some kind of reversed stupidity might be occurring, whether or not that stupidity was ever there in the first place or was just incorrect pattern-matching from computationalist-skepticism to substance dualism, or something.
[2] Does anyone talk about how it can be shown that two axioms aren't contradictory, or that an axiom isn't self-contradictory? (Obviously you need to at least implicitly use axioms to show this, so it's an infinite regress, but just as obviously at some point we're going to have to trust in induction, even if we're coding an FAI.)
trap the Goedel machine
Ash Ketchum is strolling around Kanto when he happens upon a MissingNo. "GOEDEL MACHINE, I choose you!" GOEDEL MACHINE used RECURSION. Game Boy instantly explodes.
MISTY: "Ash, we have to do something! Kooky Psychic Gym Leader Sabrina is leveling up her Abra and she's not even trying to develop a formal theory of Friendly Artificial Alakazam!"
ASH: "Don't panic! She doesn't realize that in order to get her Kadabra to evolve she'll have to trade with carbon copies of us in other Game Boys, then trade back. Ni...
Every now and then I see a claim that if there were a uniform weighting of mathematical structures in a Tegmark-like 'verse---whatever that would mean even if we ignore the decision theoretic aspects which really can't be ignored but whatever---that would imply we should expect to find ourselves as Boltzmann mind-computations, or in other words thingies with just enough consciousness to be conscious of nonsensical chaos for a brief instant before dissolving back into nothingness. We don't seem to be experiencing nonsensical chaos, therefore the argument concludes that a uniform weighting is inadequate and an Occamian weighting over structures is necessary, leading to something like UDASSA or eventually giving up and sweeping the remaining confusion into a decision theoretic framework like UDT. (Bringing the dreaded "anthropics" into it is probably a red herring like always; we can just talk directly about patterns and groups of structures or correlated structures given some weighting, and presume human minds are structures or groups of structures much like other structures or groups of structures given that weighting.)
I've seen people who seem very certain of the Boltzmann-inducing properties of uniform weightings for various reasons that I am skeptical of, and others who seemed uncertain of this for reason that sound at least superficially reasonable. Has anyone thought about this enough to give slightly more than just an intuitive appeal? I wouldn't be surprised if everyone has left such 'probabilistic' cosmological reasoning for the richer soils of decision theoretically inspired speculation, and if everyone else never ventured into the realms of such madness in the first place.
(Bringing in something, anything, from the foundations of set theory, e.g. the set theoretic multiverse, might be one way to start, but e.g. "most natural numbers look pretty random and we can use something like Goedel numbering for arbitrary mathematical structures" doesn't seem to say much to me by itself, considering that all of those numbers have rich local context that in their region is very predictable and non-random, if you get my metaphor. Or to stretch the metaphor even further, even if 62534772 doesn't "causally" follow 31256 they might still be correlated in the style of Dust Theory, and what meta-level tools are we going to use to talk about the randomness or "size" of those correlations, especially given that 294682462125 could refer to a mathematical structure of some underspecified "size" (e.g. a mathematically "simple" entire multiverse and not a "complex" human brain computation)? In general I don't see how such metaphors can't just be twisted into meaninglessness or assumptions that I don't follow, and I've never seen clear arguments that don't rely on either such metaphors or just flat out intuition.)