The chatterbot "Eugene Goostman" has apparently passed the Turing test:

No computer had ever previously passed the Turing Test, which requires 30 per cent of human interrogators to be duped during a series of five-minute keyboard conversations, organisers from the University of Reading said.

But ''Eugene Goostman'', a computer programme developed to simulate a 13-year-old boy, managed to convince 33 per cent of the judges that it was human, the university said.

As I kind of predicted, the program passed the Turing test, but does not seem to have any trace of general intelligence. Is this a kind of weak p-zombie?

EDIT: The fact it was a publicity stunt, the fact that the judges were pretty terrible, does not change the fact that Turing's criteria were met. We now know that these criteria were insufficient, but that's because machines like this were able to meet them.

New Comment
48 comments, sorted by Click to highlight new comments since: Today at 1:01 PM

Scott Aaronson has posted a transcript of his "conversation" with Eugene Goostman.

Hilarious!

[-][anonymous]10y260

I am a bit sceptical about whether or not it actually passed the Turing test. To me it looks more like a publicity stunts for the following reasons:

1) 5 minutes is a short period of time.

2) I don't believe Turing mentioned anything about 30% . I might be wrong on this one.

3) I don't know if the judges were properly trained. What questions did they ask? I feel like there must be plenty of questions related to IQ and creativity that a thirteen year old could answer with ease but that Eugene Goostman would struggle with. Examples: "Cow is to bull like, bitch is to ....?", or "Once upon a time there lived a pink unicorn in a big mushroom house with three invisible potatoes. Could you finish the story for me in a creative way and explain why the unicorn ended up painting the potatoes pink?" . The idea with the Turing test is that the computer should be indistinguishable from a human (in this case a 13 year old non native english speaker). I don't believe this criteria has been met until I see a chat transcript with reasonably hard questions.

4) Having the bot pose as a none native English speaking 13 year old might not be a violation of the rules, but I very much feel like it goes against the spirit of the Turing test. It reminds me a bit of this comic (http://existentialcomics.com/comic/15). But this is beside the point, I don't even think the bot would pass the Ukrainian-13-year-old-boy-turing-test if it was asked reasonably hard questions.

Until I learn more about the proceedings I remain utterly unconvinced that this is the milestone in AI media portrait it to be. It is nonetheless pretty cool!

"Once upon a time there lived a pink unicorn in a big mushroom house with three invisible potatoes. Could you finish the story for me in a creative way and explain why the unicorn ended up painting the potatoes pink?"

Well obviously, the unicorn did it to satisfy the ghost of Carl Sagan, who showed up at the unicorn's house and started insisting that the potatoes weren't there. Annoyed, she tried throwing flour on the potatoes to convince him, but it turned out the potatoes really were permeable to flour. It was touch and go for a while there, and even the unicorn started to doubt the existence of her invisible potatoes (to say nothing of her invisible garden and invisible scarecrow - but that at least had done an excellent job of keeping the invisible birds away). Eventually, though, it was found that pink paint coated the potatoes just fine, and so Carl happily went back to his post co-haunting the Pioneer 10 probe. The whole affair turned out to be a boon for the unicorn, as the pink paint put a stop to a previously unfalsifiable dragon, who had been eating her potatoes (or so she suspected - she had never been able to prove it). The dragon, for his part, simply went back to his old habit of terrorizing philosopher's thought experiments.

Nice try, chatterbot.

co-haunting

I see what you did there.

The test was in fact as Turing specified. In addition to 30% being the challenge, as Stuart pointed out, Turing specified 5 minutes and an "average interrogator".

The more interesting point here, I think, is the discovery (not very surprising by now) that a program that can pass the true Turing Test is still narrow AI not applicable to many other things.

The 30% quote is legit:

" I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning."

http://loebner.net/Prizef/TuringArticle.html

This was a prediction Turing made, not how the test was defined.

We can't do it in 10^9 bits, though. Of course that's just nitpicking.

Maybe with the best compression we can? But yeah, that's not the main goal.

It's Kevin Warwick, it's completely a publicity stunt.

[-][anonymous]10y20

Exactly. I was pretty disappointed to see this posted on LW.

I mean, successfully imitating a 4Chan user would technically pass. (I wrote that piece after one of Warwick's Turing test press releases six years ago.)

I feel like if any program does nearly that well, the judges aren't cheating enough. They should be picking things they know the computer is bad at. Like drawing something with ascii art, and asking what it is, or having it talk to a bot and seeing if the conversation goes anywhere.

If all you do is talk, then all it shows is that the computer is good at running a conversation. Maybe that just was never something that took a lot of intelligence in the first place.

This is courtesy Kevin Warwick at the University of Reading, who is good at making media claims along these lines, and has done for several years. I advise disregarding anything containing his name.

I don't know why they're calling this the "first time"

In 1972 bots were able to convince trained professionals that they were human schizophrenics:

Kenneth Colby created PARRY in 1972, a program described as "ELIZA with attitude".[28] It attempted to model the behaviour of a paranoid schizophrenic, using a similar (if more advanced) approach to that employed by Weizenbaum. In order to validate the work, PARRY was tested in the early 1970s using a variation of the Turing Test. A group of experienced psychiatrists analysed a combination of real patients and computers running PARRY through teleprinters. Another group of 33 psychiatrists were shown transcripts of the conversations. The two groups were then asked to identify which of the "patients" were human and which were computer programs.[29] The psychiatrists were able to make the correct identification only 48 percent of the time — a figure consistent with random guessing.[30]

foreign 13 year old who isn't being challenged is a low bar to pass.

a bot which posts bellow youtube videos and does nothing but spew racial abuse and “lol” would be indistinguishable from the 13 year old humans doing the same thing so would technically pass the turing test.

I’ll be much more interested when it can convince a group of professionals that it’s another professional in their field, much more useful.

I used to play a MUD that had a chatbot on it for months in the late 1990s before the people running the game found out and kicked "him" off for violation of the no-bots rule. The chatbot used one specific group chat line and acted somewhat like the hypothetical video poster - mild verbal insults that weren't quite nasty enough to justify complaining to admin about, potty humor, "shut up [name]" and similar responses to questions, and other behaviors that were believably how a middle-school-aged player with trollish intentions might act.

Lowering the standard of the chatbot's expected conversational level by giving it the persona of a child or early adolescent speaking in different language than his/her first language does seem like a form of cheating while following the letter of the rules. At a minimum, I'd like to see the chatbot pass as an ordinary adult of at least average intelligence who is a native speaker of the language that the test is conducted in. A fellow professional in a given field would be even better.

Let's discuss a new type of Reverse Turing Test.

This simply consists of coming up with a general class of question that you think would reliably distinguish between a chatbot and a human within about 5 minutes of conversation, and explaining which feature of "intelligence" this class of question probes.

If you're not able to formulate the broad requirements for such a class of question, you have no business being the judge in a Turing Test. You're only playing the chatbot as you would play a video game.

One of my candidates for questions of this kind: ask the interviewee to explain a common error of reasoning that people make, or can make. For instance: "If you look at the numbers, there's quite a correlation between sales of ice cream in coastal locations and number of drownings. Some people might be tempted to conclude that ice cream causes people to drown. Do you think that's right, and if not, why not?"

For another example, Dennett discusses having the chatbot explain a joke.

ETA: Scott Aaronson passes with flying colors. Chatbots are likely to lack basic encyclopedic knowledge about the world which every human possesses. (To some extent things like the Wolfram platform could overcome this for precise questions such as Scott's first - but that still leaves variants like "what's more dangerous, a tiger or an edible plant" that are vague enough that quantitative answers probably won't be accessible to a chatbot.)

I quite recommend The Most Human Human by Brian Christian, where he participates in a TT as one of the decoys, and puts a lot of thought into how to steer the conversations to give himself the distinction of being the human most frequently correctly identified as human.

Blatant bullshit. Nothing even close to the Turing Test was passed. Too much charity toward a bullshit publicity stunt.

Ok, properly rephrased: "Turing's 1950 prediction on expected level of success for his test, which he predicted to happen in 2000, has been achieved in 2014".

I think the main problem is that "Turing Test" has become an overbroad term. It extends from variants coming out of Turing's original paper (which we now know to be too weak) through to much stronger idealised versions of what the Turing test should be for it to be useful. "Nothing even close..." depends on which end of the spectrum we're thinking of.

Turing's 1950 prediction on expected level of success for his test, which he predicted to happen in 2000, has been achieved in 2014

No. Please apply more skepticism to press releases from Kevin Warwick. See http://www.kurzweilai.net/response-by-ray-kurzweil-to-the-announcement-of-chatbot-eugene-goostman-passing-the-turing-test

Nothing Kurweil says undermines the claim Kevin made, given what Turing wrote in 1950:

I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.

Anyway, we seem to be agreeing on what actually happened (nothing much), and what its implication are (nothing much), so debating about whether this counts as a pass or not, is not particularly useful.

From Turing's original paper:

Interrogator: In the first line of your sonnet which reads "Shall I compare thee to a summer's day," would not "a spring day" do as well or better?

Witness: It wouldn't scan.

Interrogator: How about "a winter's day," That would scan all right.

Witness: Yes, but nobody wants to be compared to a winter's day.

Interrogator: Would you say Mr. Pickwick reminded you of Christmas?

Witness: In a way.

Interrogator: Yet Christmas is a winter's day, and I do not think Mr. Pickwick would mind the comparison.

Witness: I don't think you're serious. By a winter's day one means a typical winter's day, rather than a special one like Christmas.

Yes, I think Turing was very mistaken in his impression of what an "average" interrogator would be like.

This compensated for his over-optimism on the progress of computers, giving him an ok prediction by chance.

This is bunk: see here. Working in an AI lab, everyone I know who has heard about this release is either extremely annoyed or actively scornful of it (or both). I would like to humbly propose that we talk about anything other than this.

I've read one transcript of a judge conversation. What I find striking is that the judge seems to be doing their best to be fooled! Of course, no one wants to get a 13 year old upset.

In a Turing Test situation I'd start by trying a bunch of Winograd Schemas.

I don't think that was a judge conversation. That was just someone using the online chat program:

"I logged on to what I think is the Goostman program. Here’s the transcript of our conversation: (Eugene is supposed to be around 13 years old.)"

Not only that, but it's an old version from a year ago. (Not that I think the real judges' conversation would be significantly better.)

You're very likely right, I misinterpreted.

As others noted, the bot only succeeded in passing the test because the judges themselves failed it.

On that note, why don't we have a thread where people try to unmask Eugene's true nature with a single question (replies indicating server downtime do not count).

Try your best here: http://default-environment-sdqm3mrmp4.elasticbeanstalk.com/

I've seen somebody doing it by greeting the bot with a wrong name.
The bot doesn't complain.

Just proof that the Turing Test is not what Turing imagined it would be. It's more an application of exploiting vulnerabilities in judges than in genuinely advancing AI.

The question then becomes: how can a harder variant of the Turing Test be created that would stay true to the spirit of the original, yet motivate high-quality, generally-applicable research?

That sounds like one of those questions whose answer gets us a lot of the way to true AI.

Well, let's not set the bar too high. E.g. "convinces 90% of a panel of psychologists, cognitive scientists, neuroscientists, and Natural Language Processing researchers in an hour long interrogation".

Somebody else mentioned Winograd schema testing, which is justified by its targeting of specific weaknesses of current Question Answering / NLP approaches.

increase the time, increase the age, increase the degree of contact.

the highest level might be a full spectrum test using a human-like robot controlled by an AI which lives and works with professionals, convinces them it's another professional forms relationships and goes unnoticed for months or years.

Absolute bullshit; it's shameful that FHI went anywhere near this. It's not even techically true since the Turing Test as originally specified includes three participants: judge, human, and machine.

Of course the "news" is bunk, but I don't see that this LW post deserves all the downvotes it's evidently received. Halo/horns effect in action?

(I would be more certain that the post doesn't deserve the downvotes if Suart had put quotation marks around "passed" in its title.)

It passed Turing's original criteria. I don't see how I can't consider that a genuine pass, however we feel about the methods used.

I think all it shows is that Turing's original suggestion of 30% success for 5 minutes with average interrogators was probably overoptimistic. Those particular stipulations were never, it seems to me, core to what Turing was saying, and the sample conversations in his article make it clear that even if he said "average" he was actually thinking of a rather higher standard of interrogation than "Eugene" got.

And of course the whole "13-year old immigrant who doesn't speak English very well" thing is rather a cheat. Here, I've got a program that passes the Turing test. It simulates a person who doesn't know how to use a computer keyboard.

Somehow this emotion-sensing mood-improving robot seems like more of an achievement: http://edition.cnn.com/2014/06/06/tech/innovation/pepper-robot-emotions/

(I've been working on that robot for more than a year by the way)

Side note for those who might not have come across him before: Kevin Warwick is a Professor of Cybernetics who, among other things, has communicated with his wife through electrical cables attached directly to nerves in their forearms.

Eugene has actually been around for many years, since 2008 (http://en.wikipedia.org/wiki/Loebner_Prize), and at that time he convinced one of 12 judges in a parallel test. One of the judges found him more human than an actual human conversation partner.

People keep saying the test is bad, but I feel the standards are very high already. You have to fool a human judge who is on the lookout for a bot. Based on the news articles, it's not clear if Eugene competed against a human partner in side-by-side conversations, but since they're so insistent about the "true" Turing test I'd guess he did. The ability to fool an unsuspecting judge has been around since AIM bots.

Now, I don't want to say I told you so...