Come up with better Turing Tests

Stuart_Armstrong

So the Turing test has been "passed", and the general consensus is that this was achieved in a very unimpressive way - the 13 year old Ukrainian persona was a cheat, the judges were incompetent, etc... These are all true, though the test did pass Turing's original criteria - and there are far more people willing to be dismissive of those criteria in retrospect than were in advance. It happened about 14 years later than Turing had been anticipating, which makes it quite a good prediction for 1950 (in my personal view, Turing made two mistakes that compensated - the "average interrogator" was a much lower bar than he thought, but progress on the subject would be much slower than he thought).

But anyway, the main goal now, as suggested by Toby Ord and others, is to design a better Turing test, something that can give AI designers something to aim at, and that would be a meaningful test of abilities. The aim is to ensure that if a program passes these new tests, we won't be dismissive of how it was achieved.

Here are a few suggestions I've heard about or thought about recently; can people suggest more and better ideas?

Use proper control groups. 30% of judges thinking that a program is human is meaningless unless the judges also compare with actual humans. Pair up a human subject with a program, and the role of the judge is to establish which of the two subjects is the human and which is not.
Toss out the persona tricks - no 13 year-olds, nobody with poor English skills. It was informative about human psychology that these tricks work, but we shouldn't allow them in future. All human subjects will have adequate English and typing skills.
On that subject, make sure the judges and subjects are properly motivated (financial rewards, prizes, prestige...) to detect or appear human. We should also brief them that our usual conversational approach to establish which kind of human they are dealing with, is not useful for determining whether they are dealing with a human at all.
Use only elite judges. For instance, if Scott Aaronson can't figure it out, the program must have some competence.
Make a collection of generally applicable approaches (such as the Winograd Schemas) available to the judges, while emphasising they will have to come up with their own exact sentences, since anything online could have been used to optimise the program already.
My favourite approach is to test the program on a task they were not optimised for. A cheap and easy way of doing that would be to test them on novel ASCII art.

My current method would be the lazy one of simply typing this, then waiting, arms folded:

"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".

Here are a few suggestions I've heard about or thought about recently; can people suggest more and better ideas?

Use proper control groups. 30% of judges thinking that a program is human is meaningless unless the judges also compare with actual humans. Pair up a human subject with a program, and the role of the judge is to establish which of the two subjects is the human and which is not.
Toss out the persona tricks - no 13 year-olds, nobody with poor English skills. It was informative about human psychology that these tricks work, but we shouldn't allow them in future. All human subjects will have adequate English and typing skills.
On that subject, make sure the judges and subjects are properly motivated (financial rewards, prizes, prestige...) to detect or appear human. We should also brief them that our usual conversational approach to establish which kind of human they are dealing with, is not useful for determining whether they are dealing with a human at all.
Use only elite judges. For instance, if Scott Aaronson can't figure it out, the program must have some competence.
Make a collection of generally applicable approaches (such as the Winograd Schemas) available to the judges, while emphasising they will have to come up with their own exact sentences, since anything online could have been used to optimise the program already.
My favourite approach is to test the program on a task they were not optimised for. A cheap and easy way of doing that would be to test them on novel ASCII art.

My current method would be the lazy one of simply typing this, then waiting, arms folded:

"If you want to prove you're human, simply do nothing for 4 minutes, then re-type this sentence I've just written here, skipping one word out of 2".

Your test seems overly complicated; what about simple estimates? Like "how long would it take to fly from Paris, France, to Paris, USA" or similar? Add in some Fermi estimates, get them to show your work, etc...

That is much better - I wasn't thinking very carefully when I invented my question.

If the human subject is properly motivated to want to appear human, they'd relax and follow the instructions. Indignation is another arena in which non-comprehending programs can hide their lack of comprehension.

I realize this, but as someone who wants to appear human, I want to make it as difficult as possible for any kind of computer algorithm to simulate my abilities. My mental model of sub-sapient artificial intelligence is such that I believe many such might pass your test, and therefore - were I motivated properly - I would want to make it abundantly clear that I had done more than correctly parse the instructions "[(do nothing) for (4 minutes)] then {re-type [(this sentence I've just written here,) skipping (one word out of 2.)]}" That is a task that is not qualitatively different from the parsing tasks handled by the best text adventure game engines - games which are very far from intelligent AI.

I wouldn't merely sputter noisily at your failure to provide responses to my posts, I'd demonstrate language comprehension, context awareness, knowledge of natural-language processing, and argumentative skills that are not tested by your wait-four-minutes proposal, both because I believe that you will get better results if you bear these factors in mind and because - in light of the fact that I will get better results if you bear them in mind - I want you to correctly identify me as a human subject.

20

Come up with better Turing Tests

20

20

20

Come up with better Turing Tests

20

20