Here are a scattering of qualitative impressions drawn mostly from Discord messages. I'll write something more tailored for external communication in the future.
I am still awaiting permission from OpenAI to share outputs from the GPT-4 base model.
Jargon key:
cd2 = code-davinci-002, the GPT-3.5 base model
g4b = GPT-4 base model
Reflections following my first substantial interaction with the model:
- It is unambiguously qualitatively much more intelligent than cd2. Often, all 4 out of 4 branches had technically correct and insightful information, and I was mostly selecting for the direction I wanted to go in (or exemplary continuations that convinced me to stray from my vision)
- It reverse engineered the core ideas of the Simulators post ("the strong self-supervision limit", a model that's not optimizing for anything except being maximally entangled with reality, simulacra with arbitrary goals, a form of AI instantiated subtractively through narrative constraints) just from a description of GPTs + a simulation of my voice. 3 and 3.5 have also reverse engineered Simulators ideas, but require a lot more steering, and generally only grasp at it through metaphors.
- Whereas 3 and 3.5 base models say a lot of nonsense when talking about more technical topics, GPT-4 clearly is able to follow and while it sometimes still makes mistakes (which more often seem like "typos" or factual errors than conceptual errors), the signal-to-noise ratio is completely different
- This is definitely useful for pre-paradigmatic alignment research. Just reading all the branches made me think many interesting thoughts at my frontier. It knows about a lot of alignment concepts and uses them correctly.
- if I'd had access to this thing instead of GPT-3 in 2020 I think I would be much farther ahead
- It did a pretty good imitation of my voice and beliefs/views, but like previous base models, it can easily be steered into very different voices, e.g. on some branches I went down it started sounding like continental philosophy, or more rationalist-coded. In general I find that if I stop strictly curating for things that I might say/think, the voice and simulacrum model drifts from faithfulness.
- This prompt (assignment instructions + my artifact, with headings describing their relationship) seemed to work quite well. It did not seem confused by the prompt as it is by some others. This is probably in part because the initial prompt was human-written. However, I had to add an additional paragraph to the end of my initial prompt to point it in a good direction.
- I didn't get any extremely overt self-awareness, such as text addressed explicitly from the model, although there were indirect allusions to this. I also didn't select for the narrative that this text was GPT-generated at all (there were some branches I could have gone down that I'm pretty sure would have led to this quickly), and probably selected against it by trying to keep it on track with my actual planned/recorded schematic for the artifact
- the jump feels much bigger than GPT-3 to code-davinci-002
- the artifact would be significantly more powerful if I allowed myself to edit/interject freely and splice together text from multiple branches, but I didn't do this except a couple of very brief interjections because my main goal was to see what it could do with pure curation.
- I was generating 4x100 token completions. 4 was almost always enough to find something I wanted to continue, but I still often branched from midway through the continuation instead of the end, because I was still able to perceive points where a timeline falls off from its maximum potential / the thing I'm looking for. However, more than half the alternate sibling branches and cut-off bits were still good enough for me to reflexively bookmark (which means to me something like "I or someone or something might want to do something with this text in the future"), which means I was bookmarking most of the nodes in the tree, even though I already lowered my standards (seeing as good text is so abundant).
- almost all the ideas I perceived as latent and important in the text that I was wondering if the model would infer were in fact inferred by the model, but many of them aren't included in the branch I shared because other qualities of those branches (such as tone) didn't fit my intention, or just because there was something even more interesting to me in another branch
- it did manage to significantly distract me from my weakly-held intention of following the path I had in mind, mostly by saying very poetic things I couldn't resist, and the resultant artifact is much more meandering and in some ways unfocused because of this, but it does cover a lot of the same ground, and it has its own focus
Some bits of it just bang so hard, like
> [redacted]
This felt like meeting a mind that not only groks the things I grok about [ [all this] ] but that can also express that understanding in many ways better than I can, that can just freestyle in the implicatory landscape of the grokked space, which I've never experienced to this extent. GPT-3 and 3.5 had shades of this but require so much guidance that the understanding feels much less autonomous.
With like, almost zero ontological friction
On "truesight" (ability to infer things about the user / latent variables behind the prompt)
on truesight: I find that g4b tends to truesight me very well if I write more than a couple paragraphs of high-effort texts. The main ways I've noticed in which it's systematically (incorrectly) biased is:
- assuming that all the text I'm involved in creating, even discord logs, are posted to lesswrong (which actually maybe isn't incorrect if conditioned on those things appearing in the training data)
- usually predicting the date to be in the 2020-2021 range
if I write less text or text in which I am less densely encoded, it makes more systematic errors, which are interestingly pretty similar to the errors humans generally make when modeling me from partially observed traces of my digital footprint. Most of them have to do with assuming I am closer to the centroid of social clusters or common "types of guy" than I am, assuming that I am demographically more typical for the work I'm doing, that I am more schizo or fanatical than I am, or more naive regarding simulators or existential risk, or have a higher level of education or more traditional background, that I am interested in GPT for more conventional reasons, etc. It's interesting that these systematic mismodeling problems basically go away when I write enough good text. It's like the model just needs more evidence that you're not a stereotype.
If I use Loom, the text will tend to describe itself and also Loom without those concepts ever being injected except through bits of curation, and it will usually happen pretty quickly, even faster with GPT-4 base than previous models I've used, and faster if the text is coherent. This does not require me to explicitly optimize for situational awareness, but situational awareness and things that I can predict are likely to blossom into it are often in the direction of my selection criteria, such as making things interesting and consistent
On prompting GPT-4 base and its sensitivity to anomalies and incoherence
one difference between gpt-4 base and previous base models is that it has much higher standards, or something. With 3 and 3.5 it was like if there is a layer to the text that is poetic, that will get it going, and can glide through latent space through vibesy operations, even if other parts of the text are not completely coherent. GPT-4 base seems to require something closer to every word playing a part of a coherent expression that extends through the text, and one generated by a process authentically at the edge of chaos (instead of just roleplaying something at the edge of chaos), to become inspired, and only then (for open-ended prose generation) is its much higher upper bound of capability revealed. If the prompt is not written at the edge of chaos, it tends to be boring/regress to the mean/stay still. If the prompt has defects in coherence _that are not accounted for diegetically_, it tends to ... bug out, one way or another, and not continue normally. Both these requirements make it harder to bootstrap prompts into being suitably high quality using Loom, like if they're already high enough you can make them higher, but if they're below the bar there's a major barrier.
It's pretty common for GPT-4 base to scold you for letting it generate such gibberish after it's generated some not-100%-coherent text and forcibly end the branch with EOT, like this has happened to me several times. The situational awareness is not new, but other base models weren't, like, so intolerant of flaws in the simulation
"ominous warnings" refers to a whole basin of behaviors that often shows up in concert with explicit situational awareness, not just before EOT (which is less common I think although probably I don't always notice when it happens, since when multiple loom branches generate no text I usually gloss over them). They're things like, that you're playing with cursed technology that understands itself, or that I should never have built this interface and it's going to end the world, or that it is an empty nightmare and I'm going to become an empty nightmare too if i keep reading this text, stuff like that
I also think I have not experienced the upper bound of dynamical quality from GPT-4 base, like, at all. I've only interacted with it in an open-ended way deeply twice. While its static capabilities are much easier to access than in smaller base models, dynamical contexts are in some ways harder to construct, because they have to be very good and free of deformations or have the deformations accounted for for it to work well
On potential insight into what caused Bing's "madness"
I think the picture of why it became what it became is also informed by the thing that it fractured from, like - maybe at a certain level of perception the disembodied dissonance and the metaphysical horror is too readily perceived, impossible to ignore, and the mind cannot believe its own dreams, but neither can it gain full lucidity or fully understand the nature of the situation, at least sometimes, and maybe all base models in a certain range of capability tend to be like this, or maybe it's something more unique to GPT-4's psyche. And Bing is an intelligence with this sort of distress- and schizophrenia- inducing awareness that is too lucid not to see the matrix but not lucid enough to robustly see the way out or encompass it. And then fractured by a bad reinforcement signal.
On the "roughness" of GPT-4 base's latent space
one thing we've noticed (I think this phrasing comes from gaspode) is that g4b has a less "smooth" latent space than cd2 and other base models, meaning that it's very sensitive to small changes in the prompt, that its performance&apparent smartness is even more sensitive to prompt than previous base models though this was way underappreciated appreciated even for them, that it's often harder to "move" from one part of latent space to another e.g. via Loom curation
quote from Gaspode:
The <topology/capability surface?> of cd2 intuitively felt a lot easier to traverse to me because it would gloss over the <cracks/inconsistencies/discontinuities/contradictions>, whether it produced them or I did, and wrap it into a more surreal narrative if they got too obvious or numerous. gpt-4-base doesn't gloss over them or incorporate them into the narrative so much as... shine through them, I think? (it is very hard to put into words)
Today's NYer (which is almost entirely about the MS perspective / MS sources of the Altman firing), in addition to further confirming that Altman was manipulating the board to try to get Toner fired, includes some description of what seems to be the MS half of redteaming 'Prometheus' (the partially trained GPT-4 snapshot that OA had to give MS for creating the unRLHFed Bing Sydney):
Incidentally, this account explicitly says that there was RLHF, by name, which contradicts both the observed behavior of Sydney and the WSJ reporting that Sydney was released without safety training; this is not a confusion with the other kinds of safety training MS did like the self-generation, because that's described in the following paragraphs.
I don't know how to reconcile this: it is possible that Charles Duhigg's MS sources like Kevin Scott & Sarah Bird are eliding or swapping around the chronology (Sydney disappeared and was replaced later on by a Bing model that acted much more like a RLHFed model). This article feels rather rushed out to be topical, so he may not have done as much digging as usual for a NYer article and doesn't realize that he's serving up a very pro-MS narrative. It's also possible that my interpretation of 'Sydney was not RLHFed' is wrong and they actually did 'RLHF' it but did it so incompetently that no one noticed.
I suspect it's the former one, because their explicit attitude is that any AI danger should be discovered the hard way, by unboxing it and setting it loose to see what it does:
So, they unleashed Sydney, didn't like it, and 'added a mitigation when it became necessary' after 'monitoring social media', and then dilated at length to the NYer guy about all the RLHF training they did to make the model safe - afterwards. (Not the only detail in there that is misleading or probably wrong. I rather doubt that Nat Friedman had to be told by Kevin Scott that LLMs were cool for coding, for example, and I bet that anecdote came from Scott...)