All of oceaninthemiddleofanisland's Comments + Replies

Answer by oceaninthemiddleofanisland40

'Predicting random text on the internet better than a human' already qualifies it as superhuman, as dirichlet-to-neumann pointed out. If you look at any given text, there's a given ratio of cognitive work needed to produce the text, per word-count. "Superhuman" only requires asking it to replicate the work of multiple people collaborating together, or processes which need a lot of human labour like putting together a business strategy or writing a paper. Assuming it's mediocre in some aspects, the clearest advantage GPT-6 would have would be an interdisciplinary one - pooling together domain knowledge from disparate areas to produce valuable new insights.

How far away is this from being implementable?

3John Steidley
It doesn't sound hard at all. The things Gwern is describing are the same sort of thing that people do for interpretability where they, eg, find an image that maximizes the probability of the network predicting a target class. Of course, you need access to the model, so only OpenAI could do it for GPT-3 right now.

This probably won't add too much to the discussion but I'm curious to see whether other people relate to this or have a similar process. I was kind of stunned when I heard from friends who got into composing about how difficult it is to figure out a melody and then write a complete piano piece because to me, whenever I open up Sibelius or Dorico (and more recently Ableton), internally it seems like I'm just listening to what I wrote so far, 'hearing' a possible continuation lasting a few bars, and then quickly trying to transcribe ... (read more)

2Logan Riggs
I tend to write melodies in multiple different ways: 1. Hearing it in my head, then playing it out. It's very easy to generate (like GPT but with melodies), but transcribing is very hard! The common advice is to sing it out, and then match it with the instrument. This is exactly what you did with whistling. If I don't record it, I will very often not remember it at all later; very similar to forgetting a dream. When I hear someone else's piano piece (or my own recorded), I will often think "I would've played that part differently" which is the same as my brain predicting a different melody. 2. "Asemic playing" (thanks for the phrase!) - I've improv-ed for hundreds of hours, and I very often run into playing similar patterns when I'm in similar "areas" such as playing the same chord progression. I'll often have (1) melodies playing in my head while improvising, but I will often play the "wrong" note and it still sound good. Over the years, I've gotten much better at remembering melodies I just played (because my brain predicts that the melody will repeat) and playing the "correct" note in my head on the fly. 3. Smashing "concepts" into a melody: * What if I played this melody backwards? * Pressed every note twice? * Held every other note a half-note longer? * Used a different chord progression (so specific notes of the melody needs to change to harmonize) * Taking a specific pattern of a melody, like which notes it uses, and playing new patterns there. * Taking a specific pattern of a melody, like the rhythm between the notes (how long you hold each note, including rests) and applying it to other melodies. * Taking a specific patter of a melody, like the exact rhythm and relative notes, and starting on a different note (then continuing to play the same notes, relatively)

So I've figured this out. Kinda. If you choose 'custom' then it will give you Griffin, but if you choose one of the conventional prompts and then edit it, you can get around it. So damn annoying.

3avturchin
They acknowledged the use of limited GPT-3, details: https://twitter.com/nickwalton00/status/1289946861478936577

Wow, I didn't realise I could get this angry about something so esoteric.

I'm beginning to think AID has changed what the "Dragon" model is without telling us for cost reasons, I've had kind of the same experience with big lapses in storytelling that didn't occur as often before. Or maybe it's randomly switching based on server load? I can kind of understand it if that's the case but the lack of transparency is annoying. I remember accidentally using the Griffin model for a day when my subscription ran out and not realising because its Indonesian was still quite good...

6gwern
Quite a few people have been complaining: https://www.reddit.com/r/AIDungeon/comments/i1qhg0/the_dragon_ai_just_got_worse/
2avturchin
Yes, I think it is correct impression. I've wrote in support, btw, no answer yet. One possible way to check the version is to try "Earth POV" - that is the "point of view". GPT-3 understands it correctly and will say something like "I am alone in the sly near Sun". GPT-2 will continue with a story.

Somehow the more obvious explanation didn't occur to me until now, but check the settings, you might be using the Griffin model not the Dragon model. You have to change it manually even after you get the subscription. I have a window open specifically for poetry prompts (using the Oracle hack), I said "Write a long poem in Russian. Make sure the lines are long, vivid, rich, and full of description and life. It should be a love poem addressed to coffee. It should be 15 lines long" followed with "The Oracle, which is a native in Russian, ... (read more)

2avturchin
Obviously, I pressed "Dragon" button, but I suspect that I am still getting Griffin anyway, as I was also unable to repeat some of the reasoning tasks.

If it's a BPE encoding thing (which seems unlikely to me given that it was able to produce Japanese and Chinese characters just fine), then the implication is OpenAI carried over their encoding from GPT-2 where all foreign language documents were removed from the dataset ... I would have trouble believing their team would have overlooked something that huge. This is doubly bizarre given that Russian is the 5/6th most common language in the dataset. You may want to try prompting it with coherent Russian text, my best guess is that in the dataset, whene... (read more)

6gwern
Looking into the details, BPEs seem to usually fall back to treating unknown characters as literally bytes: so there's another 256 BPE which cover the 256 possible bytes, and then any UTF-8 character is 1-4 bytes, and so can be represented by 1-4 BPEs. The 1-byte UTF-8 characters are the ASCII characters, which have their own BPEs, so this would be used only for 2-4 byte-long UTF-8 characters like Cyrillic or Chinese. So actually, now that I think about it, it's possible that Russian gets encoded to worse than 1 BPE per character, it could be 2 BPEs (since Cyrillic seems to fall in the 2-byte ranges of UTF-8). It'd depend on the details. (While on the other hand, having to pay 2-4 BPEs per Unicode character is obviously not as big a deal for Japanese & Chinese characters...) I wouldn't expect the BPE to allocate much space to Cyrillic stuff because it's the 5th most common script in the dataset, as that's just another way of saying all the Russian put together is all of 0.18% of the dataset. And keep in mind that the BPE encoding was not, AFAIK, redone for GPT-3, but is the same BPE OA has been using ever since GPT-2 way back when, and so was optimized for their Reddit-sourced English-heavy original WebText.
3avturchin
I tried many prompts but it produce gibberish in Russian. E.g.: Привет, как дела? (What's going on?) You don't know what to say. You're not sure if you should be thanking him or insulting him for this situation. He continues: Немного просью у мы, что выставляется! (It's too late now, get out of here! )

That's a visualisation I made which I haven't posted anywhere else except under the r/ML thread collecting entries for GPT-3 demos, since I couldn't figure out which subreddit to post it in.

Two thoughts, one of them significantly longer than the other since it's what I'm most excited about.

(1) It might be the case that the tasks showing an asymptotic trend will resemble the trend for arithmetic – a qualitative breakthrough was needed, which was out of reach at the current model size but became possible at a certain threshold.

(2) For translation, I can definitely say that scaling is doing something. When you narrowly define translation as BLEU score ("does this one generated sentence match the reference sentence? by how ... (read more)

9gwern
What is the source for that? I haven't seen it before. EDIT: https://twitter.com/joekina/status/1288511258832953344 ? Seems to postdate this comment though.
1hippke
Regarding (1): Of course a step is possible; you never know. But for arithmetic, it is not a step. That may appear so from their poor Figure, but the data indicates otherwise.
5avturchin
Interestingly, GPT-3 knows a few words in Russian, but can't produce any coherent text. He said in Russian: И все хотиваниям рукой плать, что недобрыжки.
Answer by oceaninthemiddleofanisland50

I just finished Iain M Banks' 'The Player of Games' so my thoughts are being influenced by that, but it had an interesting main character who made it his mission to become the best "general game-player" (e.g no specialising in specific games), so I would be interested to see whether policy-based reinforcement learning models scale (thinking of how Agent 57 exceeded human performance across all Atari games).

It seems kind of trivially true that a large enough MuZero with some architectural changes could do something like play chess,... (read more)

Yes! I was thinking about this yesterday, it occurred to me that GPT-3's difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you're just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I'm calling them A, B and C): per... (read more)

The best angle of attack here I think, is synthesising knowledge from multiple domains. I was able to get GPT-3 to write and then translate a Japanese poem about a (fictional) ancient language model into Chinese, Hungarian, and Swahili and annotate all of its translations with stylistic notes and historical references. I don't think any humans have the knowledge required to do that, but unsurprisingly GPT-3 does, and performed better when I used the premise of multiple humans collaborating. It's said that getting different university departments... (read more)

2johnswentworth
Awesome example!

I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I'm going to do that and then follow it with a higher-level, more conceptual response.

The source of a lot of my skepticism is GPT-3's inherent inconsistency. It can range wildly from it's high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn't expect such inconsistency. Even when it is performing so well people call it &
... (read more)
5[comment deleted]

Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented - to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren't quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn't binary/quantitative not qualitative). The remaining two are:

1. "Why isn't it an AGI?" here can be read as &... (read more)

1[anonymous]
Why would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It's not an RL setting. I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn't know exactly what fire is, but it's a "you know it when you see it" kind of deal. It's not helpful to pre-commit to a certain benchmark, like we did with chess - at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren't general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn't whether OpenAI deserves some brownie points or not. It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The "competent judges" part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don't expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it. I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don't care too much about precise benchmarks. Since you don't share this intuition, I can see why you feel so strongly about precisely defining these benchmarks. I could offer some alternative ideas in an RL setting though: * An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or * An AI that solves unseen Chr

Great, but the terms you're operating with here are kind of vague. What problems could you give to GPT-3 that would tell you whether it was reasoning, versus "recognising and predicting", passive "pattern-matching" or a presenting "illusion of reasoning"? This was a position I subscribed to until recently, when I realised that every time I saw GPT-3 perform a reasoning-related task, I automatically went "oh, but that's not real reasoning, it could do that just by pattern-matching", and when I saw it do some... (read more)

3[anonymous]
Passing the Turing test with competent judges. If you feel like that's too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what's still missing? It's capable of both pattern recognition and reasoning, so why isn't it an AGI yet?
8[anonymous]

A bunch of more examples here, a bit difficult to summarise since it went from explaining how dopamine receptors work, to writing a poem about Amazon's logistics in the form of a paean to the Moon Goddess, writing poems in Chinese based on English instructions and then providing astonishingly-good translations, to having Amazon and Alibaba diss one another in the style of 18th century poet Mary Robinson. Link here: https://www.reddit.com/r/slatestarcodex/comments/hrx2id/a_collection_of_amazing_things_gpt3_has_done/fy7i7im/?context=3

Example:

The oracle
... (read more)