In the early 1980s Douglas Lenat wrote EURISKO, a program Eliezer called "[maybe] the most sophisticated self-improving AI ever built". The program reportedly had some high-profile successes in various domains, like becoming world champion at a certain wargame or designing good integrated circuits.
Despite requests Lenat never released the source code. You can download an introductory paper: "Why AM and EURISKO appear to work" [PDF]. Honestly, reading it leaves a programmer still mystified about the internal workings of the AI: for example, what does the main loop look like? Researchers supposedly answered such questions in a more detailed publication, "EURISKO: A program that learns new heuristics and domain concepts." Artificial Intelligence (21): pp. 61-98. I couldn't find that paper available for download anywhere, and being in Russia I found it quite tricky to get a paper version. Maybe you Americans will have better luck with your local library? And to the best of my knowledge no one ever succeeded in (or even seriously tried) confirming Lenat's EURISKO results.
Today in 2009 this state of affairs looks laughable. A 30-year-old pivotal breakthrough in a large and important field... that never even got reproduced. What if it was a gigantic case of Clever Hans? How do you know? You're supposed to be a scientist, little one.
So my proposal to the LessWrong community: let's reimplement EURISKO!
We have some competent programmers here, don't we? We have open source tools and languages that weren't around in 1980. We can build an open source implementation available for all to play. In my book this counts as solid progress in the AI field.
Hell, I'd do it on my own if I had the goddamn paper.
Update: RichardKennaway has put Lenat's detailed papers up online, see the comments.
Your third point is valid, but your first is basically wrong; our environments occupy a small and extremely regular subset of the possibility space, so that success on a certain few tasks seems to correlate extremely well with predicted success across plausible future domains. Measuring success on these tasks is something AIs can easily do; EURISKO accomplished it in fits and starts. More generally, intelligence isn't magical: if there's any way we can tell whether a change in an AGI represents a bug or an improvement, then there's an algorithm that an AI can run to do the same.
As for the second problem, one idea that may not have occurred to you is that an AI could write a future version of itself, bug-check and test out various subsystems and perhaps even the entire thing on a virtual machine first, and then shut itself down and start up the successor. If there's a way for Lenat to see that EURISKO isn't working properly and then fix it, then there's a way for AI (version N) to see that AI (version N+1) isn't working properly and fix it before making the change-over.
Where is the evidence that EURISKO ever accomplished anything? No one but the author has seen the source code.