Come up with better Turing Tests

Stuart_Armstrong

So the Turing test has been "passed", and the general consensus is that this was achieved in a very unimpressive way - the 13 year old Ukrainian persona was a cheat, the judges were incompetent, etc... These are all true, though the test did pass Turing's original criteria - and there are far more people willing to be dismissive of those criteria in retrospect than were in advance. It happened about 14 years later than Turing had been anticipating, which makes it quite a good prediction for 1950 (in my personal view, Turing made two mistakes that compensated - the "average interrogator" was a much lower bar than he thought, but progress on the subject would be much slower than he thought).

But anyway, the main goal now, as suggested by Toby Ord and others, is to design a better Turing test, something that can give AI designers something to aim at, and that would be a meaningful test of abilities. The aim is to ensure that if a program passes these new tests, we won't be dismissive of how it was achieved.

Here are a few suggestions I've heard about or thought about recently; can people suggest more and better ideas?

Use proper control groups. 30% of judges thinking that a program is human is meaningless unless the judges also compare with actual humans. Pair up a human subject with a program, and the role of the judge is to establish which of the two subjects is the human and which is not.
Toss out the persona tricks - no 13 year-olds, nobody with poor English skills. It was informative about human psychology that these tricks work, but we shouldn't allow them in future. All human subjects will have adequate English and typing skills.
On that subject, make sure the judges and subjects are properly motivated (financial rewards, prizes, prestige...) to detect or appear human. We should also brief them that our usual conversational approach to establish which kind of human they are dealing with, is not useful for determining whether they are dealing with a human at all.
Use only elite judges. For instance, if Scott Aaronson can't figure it out, the program must have some competence.
Make a collection of generally applicable approaches (such as the Winograd Schemas) available to the judges, while emphasising they will have to come up with their own exact sentences, since anything online could have been used to optimise the program already.
My favourite approach is to test the program on a task they were not optimised for. A cheap and easy way of doing that would be to test them on novel ASCII art.

My current method would be the lazy one of simply typing this, then waiting, arms folded:

"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".

Here are a few suggestions I've heard about or thought about recently; can people suggest more and better ideas?

Use proper control groups. 30% of judges thinking that a program is human is meaningless unless the judges also compare with actual humans. Pair up a human subject with a program, and the role of the judge is to establish which of the two subjects is the human and which is not.
Toss out the persona tricks - no 13 year-olds, nobody with poor English skills. It was informative about human psychology that these tricks work, but we shouldn't allow them in future. All human subjects will have adequate English and typing skills.
On that subject, make sure the judges and subjects are properly motivated (financial rewards, prizes, prestige...) to detect or appear human. We should also brief them that our usual conversational approach to establish which kind of human they are dealing with, is not useful for determining whether they are dealing with a human at all.
Use only elite judges. For instance, if Scott Aaronson can't figure it out, the program must have some competence.
Make a collection of generally applicable approaches (such as the Winograd Schemas) available to the judges, while emphasising they will have to come up with their own exact sentences, since anything online could have been used to optimise the program already.
My favourite approach is to test the program on a task they were not optimised for. A cheap and easy way of doing that would be to test them on novel ASCII art.

My current method would be the lazy one of simply typing this, then waiting, arms folded:

"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".

Ad 4. Elite judges is quite arbitrary. I'd rather iterate the test, each time choosing only those judges, who recognized program correctly or some variant of that (e.g. top 50% with most correct guesses). This way we select those, who go beyond simply conforming to a conversation and actually look for differences between program and human. (And as seen from transcripts, most people just try to have a conversation, rather than looking for flaws) Drawback is that, if program has set personality, judges could just stick to identifing that personality rather than human characteristics.

Another approach might be that, the same pair program-human is examined by 10 judges consecutively, each spending 5 minutes with both. The twist is that judges can leave instructions for next judges. So if program fails to perform "If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2", than every judge after the one, who found that flaw, can use that and make right guess.

My favourite method would be to give bot a simple physics textbook and then ask him to solve few physics test problems. Even if it wouldn't be actual AI, it would still prove helluva powerful. Just toss it summarized knowledge on quantum physics and ask to solve for GUT. Sadly, most humans wouldn't pass such high-school physics test.

is actually original Turing Test.

EDIT:

is bad. It would exclude equally many actual AI and blind people as well. It is actually more general problem with Turing Test. It helps testing programs that mimic humans, but not AI in general. For text based AI, senses are alien. You could develop real intelligence, which would fail, when asked "How you like the smell of glass?". Sure it can be taught that glass don't smell, but it actually needs superhuman abilities. So while superintelligence can perfectly mimic human, human-level AI wouldn't pass Turing Test, when asked about sensual stuff, just as humans would fail, when asked about nuances of geometry in four dimensions.

Physics problems are an interesting test-- you could check for typical human mistakes.

You could throw in an unsolvable problem and see whether you get plausibly human reactions.

3Stuart_Armstrong12y

So we wouldn't use blind people in the human control group. We'd want to get rid of any disabilities that the AI could use as an excuse (like the whole 13 year-old foreign boy). As for excluding AIs... the Turing test was conceived as a sufficient, but not necessary, measure of intelligence. If AI passes, then intelligent, not the converse (which is much harder).

0RobinZ12y

Speaking of original Turing Test, the Wikipedia page has an interesting discussion of the tests proposed in Turing's original paper. One of the possible reads of that paper includes another possible variation on the test: play Turing's male-female imitation game, but with the female player replaced by a computer. (If this were the proposed test, I believe many human players would want a bit of advance notice to research makeup techniques, of course.) (Also, I'd want to have 'all' four conditions represented: male & female human players, male human & computer, computer & female human, and computer & computer.)

20

Come up with better Turing Tests

20

20

20

Come up with better Turing Tests

20

20