I'm a bit confused about how it'd be work in practice. Could you provide an example of a concrete machine-learning setup, and how its inputs/outputs would be defined in terms of your variables?
I see. I have a specific counterexample that feels like it had to have been considered already, but I haven't seen it mentioned...
The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what's happening and worming its way out of the proverbial box. It feels particularly strange to me when we're talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game.
Specific example: Once the AI figures out that it's being checked for consistency in parallel with other instances of itself, it might sidestep the "consistency game" entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it'd be able to do that, no matter how we sandbox and airgap it.
(On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.)
This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize).
Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/teaching it not to think its way out of the box.
Edit: To elaborate on the thought...
I understand that some of the above is covered by the stipulation not to worry about cases where the AI becomes a learned optimizer, but:
Are there any additional articles exploring the strategy of penalizing inconsistencies across different inputs? It seems both really promising to me, and like something that should be trivially breakable. I'd like to get a more detailed understanding of it.
And that raises the question, even as we live through a rise in AI capabilities that is keeping Eliezer's concerns very topical, why did Drexler's nano-futurism fade...
One view I've seen is that perverse incentives did it. Widespread interest in nanotechnology led to governmental funding of the relevant research, which caused a competition within academic circles over that funding, and discrediting certain avenues of research was an easier way to win the competition than actually making progress. To quote:
Hall blames public funding for science. Not just for nanotech, but for actually hurting progress in general. (I’ve never heard anyone before say government-funded science was bad for science!) “[The] great innovations that made the major quality-of-life improvements came largely before 1960: refrigerators, freezers, vacuum cleaners, gas and electric stoves, and washing machines; indoor plumbing, detergent, and deodorants; electric lights; cars, trucks, and buses; tractors and combines; fertilizer; air travel, containerized freight, the vacuum tube and the transistor; the telegraph, telephone, phonograph, movies, radio, and television—and they were all developed privately.” “A survey and analysis performed by the OECD in 2005 found, to their surprise, that while private R&D had a positive 0.26 correlation with economic growth, government funded R&D had a negative 0.37 correlation!” “Centralized funding of an intellectual elite makes it easier for cadres, cliques, and the politically skilled to gain control of a field, and they by their nature are resistant to new, outside, non-Ptolemaic ideas.” This is what happened to nanotech; there was a huge amount of buzz, culminating in $500 million dollars of funding under Clinton in 1990. This huge prize kicked off an academic civil war, and the fledgling field of nanotech lost hard to the more established field of material science. Material science rebranded as “nanotech”, trashed the reputation of actual nanotech (to make sure they won the competition for the grant money), and took all the funding for themselves. Nanotech never recovered.
Source: this review of Where's My Flying Car?
One wonders if similar institutional sabotage of AI research is possible, but we're probably past the point where that might've worked (if that even was what did nanotech in).
Yeah, this is the part I'm confused about as well. I think this proposal involves training a neural network emulating a human? Otherwise I'm not sure how EvalH(F(sm),oh) is supposed to work. It requires a human to make a prediction about the next step using observations and the direct translation of the machine state, which requires us to have some way to describe the full state in a way that the "human" we're using can understand. This precludes using actual humans to label the data, because I don't think we actually have any way to provide such a description. We'd need to train up a human simulator specifically adapted for parsing this sort of output.