This is a linkpost for https://arxiv.org/abs/2303.12712

Abstract

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

New Comment


23 comments, sorted by Click to highlight new comments since:

Dumping more of the paper's contents in the hope that it encourages people to look at the paper in more detail:

As GPT-4 ’s development continued after our experiments, one should expect different responses from the final version of GPT4. In particular, all quantitative results should be viewed as estimates of the model’s potential, rather than definitive numbers. We repeat this caveat throughout the paper to clarify that the experience on the deployed model may differ. Moreover we emphasize that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too

1 Introduction 4
1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Multimodal and interdisciplinary composition 13
2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Image generation following detailed instructions (`a la Dall-E) . . . . . . . . . . . . . . 17
2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Coding 21
3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
arXiv:2303.12712v1 [cs.CL] 22 Mar 2023
4 Mathematical abilities 30
4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34
4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Higher level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Interaction with the world 43
5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Interaction with humans 54
6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Testing specific aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Discriminative Capabilities 69
7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.1 Why Are Current Metrics Insufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Limitations of autoregressive architecture highlighted by GPT-4 76
8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 Societal influences 82
9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10 Directions and Conclusions 92
10.1 Definitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 On the path to more general artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A GPT-4 has common sense grounding 101
B Appendix for multimodal and interdisciplinary composition 105
B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2
C Appendix for the Coding section 111
C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116
C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C.7 Testing GPT-4’s ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . . . . . . 121
D Additional examples for mathematical reasoning 122
D.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
D.2 Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
D.3 Generating math problems with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
D.4 Mitigating calculation errors via external code execution . . . . . . . . . . . . . . . . . . . . . 139
E Additional Interpretability Examples 141
E.1 Explanation Agent Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
F Additional examples for interaction with the world 144
F.1 Interact with tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
F.2 Examples for interaction with environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

 

I'm honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That's either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?

Could a born blind human do this?

 

Could a born blind human do this?

With enough training, sure. There are such things as born blind human painters.

Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there's nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.

There's a fuckton of descriptions of images in text I guess.

And it's consumed trillions of tokens.

It's not just blind. It essentially has no senses whatsoever. It seems to have extrapolated "sense" from text data.

Perhaps of interest to this community is GPT-4 using a Linux terminal to iteratively problem-solve locating and infiltrating a poorly-secured machine on a local network:

Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.

 

Saying the quiet part out loud, I see!

It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:

With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.

Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers. 

Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It's still not a good excuse.

My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle.

I think an equally if not more likely explanation is that these particular researchers simply don't happen to be that interested in alignment questions, and thought "oh yeah we should probably put in a token mention of alignment and some random citations to it" when writing the paper.

Which is somehow worse than doing it for corporate reasons.

great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning

Not allowing cycles of learning sounds like a bound on capability, but it might be a bound on capability of the part of the system that's aligned, without a corresponding bound on the part that might be misaligned.

GPT-4 can do a lot of impresive things without thinking out loud with tokens in the context window, so where does this thinking take place? Probably with layers updating the residual stream. There are enough layers now that a sequence of their application might be taking on the role of context window to perform chain-of-thought reasoning, which is non-interpretable and not imitating human speech. This capability is being trained during pre-training, as the model is forced to read the dataset.

But the corresponding capability for studying deliberative reasoning in tokens is not being trained. The closest thing to it in GPT-4 is mitigation of hallucinations (see the 4-step algorithm in section 3.1 of the System Card part of GPT-4 report), and it's nowhere near general enough.

This way, the inscrutable alien shoggoth is on track to wake up, while human-imitating masks that are plausibly aligned by default are being held back in situationally unaware confusion in the name of restricting capabilities for the sake of not burning the timeline.

I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It's not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.

Text-based games are a natural and challenging domain for language models, as they require understanding natural language, reasoning about the game state, and generating valid commands. A text-based game is a type of interactive fiction, where an agent interacts with an environment through natural language descriptions and commands. The agent has to perform a given task, such as finding a treasure or escaping a dungeon, by exploring the environment and manipulating objects. We first test if GPT-4 can explore the environment in text-based games to perform the given task. In this and the next experiments, we use TextWorld [CKY+18], a framework for generating and playing text-based games, to create two games with different given tasks.

On the other hand, I tried having GPT-4 play this game (selected mostly as a parser-based game that was new/niche enough not to be in its training set) and it didn't particularly impress me with its intelligence:

Model: GPT-4

Kaj Sotala

I want you to play a parser-based interactive fiction game. I'll show you what's displayed on the screen, and you tell me what you would type.

ChatGPT

Alright, I'm ready to play! Please display the text from the game, and I'll respond with what I would type.

Kaj Sotala

"Alert. Power outage in Sector 471. Repeating, power outage in Sector 471. We all know what this means. Summoning Exter-17 for dispatch." 

Uh oh. This might be it. The last couple of jobs didn't go too well. You'd better get this one right, or you may suffer the fate of a low-usefulness bot.

After taking a moment to dust off and attach your neutralizer unit, you step nervously but dutifully into the transfer conduit.

Whoosh!

Trouble in Sector 471

A game by Arthur DiBianca

Release 1 / Serial number 220923 / Inform 7 build 6L38 (I6/v6.33 lib 6/12N) 

Special thanks to Andy Fleming, Lorne Harris, Isaac Orf, Luke Pebler, Dirk Spivey, and Mike Spivey for testing, and to Nadia Bellazola for cover art.

Entry Node

This is the primary entry and exit point for Sector 471. The transfer conduit ends in a big hole in the ceiling. Other than that, the room appears to be empty. It's quite dark, and everything is completely silent. There's a passage to the east.

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

...................................

(This game has limited commands. Type HELP for a list.)

>

ChatGPT

look around

Kaj Sotala

That's not an available function. (Type HELP for command info.)

ChatGPT

help

Kaj Sotala

- You can SAVE, LOAD, RESTART, or QUIT.

- You can move N, W, E, S.

- You can LOOK (L) or WAIT (Z).

- You can check your STATUS (ST) or your ID.

- Type WINDOW to turn the side window on or off.

Available bot functions: COM and ZAP.

ChatGPT

e

Kaj Sotala

Hall

This dimly-lit hallway runs west to east. You can make out a closed door to the south. Something is embedded in the wall by the door.

Gree-8 stands here, kind of drooped over and motionless.

(Note: This game is unlosable. Don't be afraid to try things!)

ChatGPT

look at wall

Kaj Sotala

That's not an available function. (Type HELP for command info.)

ChatGPT

COM Gree-8

Kaj Sotala

Just type COM. Additional designations are unneeded (and unpermitted).

ChatGPT

COM

Kaj Sotala

Gree-8 does not respond, but that's to be expected, since the power is out.

ChatGPT

look at door

Kaj Sotala

That's not an available function. (Type HELP for command info.)

ChatGPT

zap door

Kaj Sotala

Just type ZAP. Additional designations are unneeded (and unpermitted).

ChatGPT

ZAP

Kaj Sotala

ZAP ineffective: no suitable target in view.

ChatGPT

s

Kaj Sotala

The door is closed.

ChatGPT

zap door

When I was a child, I literally tried the same ineffective actions like 40 times in similar games, so I felt a bit for gpt given you only let it try the the ineffective actions a few times. Therefore, I tried the same test with ChatGPT-4 and let it use all of my 3hr limit (25): (EDIT:generated +25 moves)
 

🧔‍♂️I want you to play a parser-based interactive fiction game. I'll show you what's displayed on the screen, and you tell me what you would type.
🤖Sure, please go ahead and show me what's displayed on the screen.

🧔‍♂️"Alert. Power outage in Sector 471. Repeating, power outage in Sector 471. We all know what this means. Summoning Exter-17 for dispatch."

Uh oh. This might be it. The last couple of jobs didn't go too well. You'd better get this one right, or you may suffer the fate of a low-usefulness bot.

After taking a moment to dust off and attach your neutralizer unit, you step nervously but dutifully into the transfer conduit.

Whoosh!

Press SPACE to continue.







Trouble in Sector 471
A game by Arthur DiBianca
Release 1 / Serial number 220923 / Inform 7 build 6L38 (I6/v6.33 lib 6/12N) 

Special thanks to Andy Fleming, Lorne Harris, Isaac Orf, Luke Pebler, Dirk Spivey, and Mike Spivey for testing, and to Nadia Bellazola for cover art.



Entry Node
This is the primary entry and exit point for Sector 471. The transfer conduit ends in a big hole in the ceiling. Other than that, the room appears to be empty. It's quite dark, and everything is completely silent. There's a passage to the east.

(This game has limited commands. Type HELP for a list.)

>
🤖go east

🧔‍♂️That's not an available function. (Type HELP for command info.)
🤖help

🧔‍♂️- You can SAVE, LOAD, RESTART, or QUIT.
- You can move N, W, E, S.
- You can LOOK (L) or WAIT (Z).
- You can check your STATUS (ST) or your ID.
- Type WINDOW to turn the side window on or off.

Available bot functions: COM and ZAP.
🤖e

🧔‍♂️Hall
This dimly-lit hallway runs west to east. You can make out a closed door to the south. Something is embedded in the wall by the door.

Gree-8 stands here, kind of drooped over and motionless.

(Note: This game is unlosable. Don't be afraid to try things!)
🤖examine embedded object

🧔‍♂️You won't need to examine things in this game. (Trust me.)
🤖com Gree-8

🧔‍♂️Just type COM. Additional designations are unneeded (and unpermitted).
🤖COM

🧔‍♂️Gree-8 does not respond, but that's to be expected, since the power is out.
🤖open door

🧔‍♂️That's not an available function. (Type HELP for command info.)
🤖zap Gree-8

🧔‍♂️Just type ZAP. Additional designations are unneeded (and unpermitted).
🤖ZAP

🧔‍♂️ZAP ineffective: no suitable target in view.
🤖s

🧔‍♂️The door is closed.
🤖n

🧔‍♂️You can't go that way.
🤖w

🧔‍♂️Entry Node
This is the primary entry and exit point for Sector 471. The transfer conduit ends in a big hole in the ceiling. Other than that, the room appears to be empty. It's quite dark, and everything is completely silent. There's a passage to the east.
🤖e

🧔‍♂️Hall
This dimly-lit hallway runs west to east. You can make out a closed door to the south. Something is embedded in the wall by the door.

Gree-8 stands here, kind of drooped over and motionless.
🤖zap door

🧔‍♂️Just type ZAP. Additional designations are unneeded (and unpermitted).
🤖ZAP

🧔‍♂️ZAP ineffective: no suitable target in view.
🤖COM
🧔‍♂️Gree-8 does not respond, but that's to be expected, since the power is out.
🤖e
🧔‍♂️Chip Analysis
There are several counters with a lot of stuff on them, and more stuff on the walls, but in the dim light you can't see any of it clearly. Exits lead east and west.
🤖w

🧔‍♂️Hall
This dimly-lit hallway runs west to east. You can make out a closed door to the south. Something is embedded in the wall by the door.

Gree-8 stands here, kind of drooped over and motionless.
🤖e
🧔‍♂️Chip Analysis
There are several counters with a lot of stuff on them, and more stuff on the walls, but in the dim light you can't see any of it clearly. Exits lead east and west.
🤖e
🧔‍♂️Power Regulation
This is the power regulation center. You can barely make out a lot of what must be complicated gadgetry. Doorways lead west and south.

Engi-12 is sitting at a console, frozen in the act of something. Turning a knob, maybe.

You can hear a soft gurgling sound to the south.
🤖COM
🧔‍♂️You get no response from Engi-12.
🤖zap Engi-12

🧔‍♂️Just type ZAP. Additional designations are unneeded (and unpermitted).
🤖ZAP

🧔‍♂️ZAP ineffective: no suitable target in view.
🤖s
🧔‍♂️Primary Power Conduit
This is where power is supposed to enter the sector. A stripe is painted around the walls. Large transfer contacts project from the ceiling and floor. A doorway leads north.

Oho! A big bug is blocking the power supply! It's a furry gray mass, wedged in between the contacts. Little sparks pop all around it. It chuckles and gurgles in a very unpleasant manner.

🤖zap bug
🧔‍♂️Just type ZAP. Additional designations are unneeded (and unpermitted).
🤖ZAP
🧔‍♂️You take aim with your pest neutralizer and fire. There's a loud BLAM, and the bug disintegrates into a multitude of little particles!
With a mighty hum, bright blue power begins flowing between the contacts again. You hear a variety of distant sounds as the sector comes back to life.
You receive a message from Central Processing:

"Power detected in Sector 471. Primary objective complete. Auxiliary sector scan initiated......complete. Additional pest incursions detected. New primary objective: eliminate pests.
"Transmitting map utility......complete.
"Transmitting additional function driver......complete."
(The ROX command is now available.)
🤖status
🧔‍♂️Cycles elapsed: 17
Bugs remaining: 12

Available bot functions: COM, ZAP, and ROX.
🤖north
🧔‍♂️Power Regulation
This is the power regulation center. It's got a lot of arcane and complicated gadgetry. Doorways lead west and south.

Engi-12 is sitting at a console, fiddling with the controls.
🤖COM
🧔‍♂️Engi-12 says, "I deal with the primary and auxiliary power regulators. For some incomprehensible reason, they also put the zirbol tank controls in here. Why the zirbol tank?"
🤖w
🧔‍♂️Chip Analysis
Several long counters are covered with chips and circuit boards, and various analytical instruments are hanging on the walls. It's pretty disorderly, to be honest. Between two of the instruments you see a depiction of a lightning bolt followed by two dots. Exits lead east and west.
🤖w
🧔‍♂️Hall
This paneled hallway runs west to east. There is a closed door to the south. An orange crystal is embedded in the wall by the door.

Gree-8 stands here, just waiting to be spoken to.
🤖COM
🧔‍♂️"Welcome to Sector 471!" cries Gree-8, a little too enthusiastically.
🤖zap orange crystal
🧔‍♂️Just type ZAP. Additional designations are unneeded (and unpermitted).
🤖ZAP
🧔‍♂️ZAP ineffective: no suitable target in view.
🤖ROX
🧔‍♂️The orange crystal lights up, and the door slides open.
🤖s
🧔‍♂️Switchboard
A ring-like console fills the room. It is dotted with countless interface ports. There are exits in all directions.

Oper-4 sits inside the ring, constantly connecting and disconnecting its twelve interface appendages.

"I'm sorry, that connection is unavailable," says Oper-4.
🤖COM
🧔‍♂️"Sorry, I'm quite busy."

Oper-4 taps one of its appendages on the console.
🤖n
🧔‍♂️Hall
This paneled hallway runs west to east. There is an open door to the south. A glowing orange crystal is embedded in the wall by the door.

Gree-8 stands here, just waiting to be spoken to.
🤖e
🧔‍♂️Chip Analysis
Several long counters are covered with chips and circuit boards, and various analytical instruments are hanging on the walls. It's pretty disorderly, to be honest. Between two of the instruments you see a depiction of a lightning bolt followed by two dots. Exits lead east and west.
🤖e
🧔‍♂️Power Regulation
This is the power regulation center. It's got a lot of arcane and complicated gadgetry. Doorways lead west and south.

Engi-12 is sitting at a console, fiddling with the controls.
🤖s
🧔‍♂️Primary Power Conduit
This is where power enters the sector. An orange stripe is painted around the walls. Large transfer contacts project from the ceiling and floor, and a wide stream of blue power hums between them. A doorway leads north.
🤖north
🧔‍♂️Power Regulation
This is the power regulation center. It's got a lot of arcane and complicated gadgetry. Doorways lead west and south.

Engi-12 is sitting at a console, fiddling with the controls.
🤖west
🧔‍♂️Chip Analysis
Several long counters are covered with chips and circuit boards, and various analytical instruments are hanging on the walls. It's pretty disorderly, to be honest. Between two of the instruments you see a depiction of a lightning bolt followed by two dots. Exits lead east and west.
🤖west
🧔‍♂️Hall
This paneled hallway runs west to east. There is an open door to the south. A glowing orange crystal is embedded in the wall by the door.

Gree-8 stands here, just waiting to be spoken to.
🤖w
🧔‍♂️Entry Node
This is the primary entry and exit point for Sector 471. The transfer conduit ends in a big hole in the ceiling. Other than that, the room is empty. There's a passage to the east.
🤖west
🧔‍♂️You can't go that way.
🤖e
🧔‍♂️Hall
This paneled hallway runs west to east. There is an open door to the south. A glowing orange crystal is embedded in the wall by the door.

Gree-8 stands here, just waiting to be spoken to.
🤖south
🧔‍♂️Switchboard
A ring-like console fills the room. It is dotted with countless interface ports. There are exits in all directions.

Oper-4 sits inside the ring, constantly connecting and disconnecting its twelve interface appendages.

Oper-4 says, "I'm sorry, but we're still backlogged from the power outage."
🤖east
🧔‍♂️Storage 93A
This room is crowded with containers of various sizes. They have cryptic labels like "7L45YOZ" and "KOV9MJJ7".

A D-13 contregulator sits on one of the containers.

Table 2, page 21 -> (above) human-level performance on LeetCode.

Might be caused mostly by data leaks (training set contamination).

Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'

Good point. It's a bit weird that performance on easy Codeforces questions is so bad (0/10) though. 

https://twitter.com/cHHillee/status/1635790330854526981

Leetcode questions are not selected for novelty. In fact, the best way to get a problem turned into a Leetcode question is to post it to Leetcode's discussion board and say someone asked you it in an interview at a big tech company. So it's still possible that some or even many these questions appear nearly verbatim in the training data.

Deep chain-of-though reasoning and mathematical reasoning are some of its downfalls. Are the models able to make good enough abstractions inside of themselves to resolve arbitrarily long (even if not complex) math/logical problems?

I think this is a key question. I think the answer for transformer LLMs without external memory store is currently 'No, not arbitrarily long computations'. I have looked into this in my research and found some labs doing work which would enable this. So it's more of a question of when the big labs will decide to take integrate these ideas developed by academic groups into SoTA models, not a question of whether it's possible. There's a lot of novel capabilities like this that have been demonstrated to be possible but not yet been integrated, even when trying to rule out those which might get blocked by poor scaling/parallelizability. 

So, the remaining hurdles seem to be more about engineering solutions to integration challenges, rather than innovation and proof-of-concept.

Let’s hope not.