Sandi — LessWrong

Transformers Represent Belief State Geometry in their Residual Stream

Yep, that's what I was trying to describe as well. Thanks!

Transformers Represent Belief State Geometry in their Residual Stream

We do this by performing standard linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors) which associated with them in the MSP.

I don't understand how we go from this to the fractal. The linear probe gives us a single 2D point for every forward pass of the transformer, correct? How do we get the picture with many points in it? Is it by sampling from the transformer while reading the probe after every token and then putting all the points from that on one graph?

Is this result equivalent to saying "a transformer trained on an HMM's output learns a linear representation of the probability distribution over the HMM's states"?

larger language models may disappoint you [or, an eternally unfinished draft]

Sandi4y10

Very comprehensive, thank you!

larger language models may disappoint you [or, an eternally unfinished draft]

Sandi4y10

Epistemic status: I'm not familiar with the technical details of how LMs work, so this is more word association.

You can glide along almost thinking "a human wrote this," but soon enough, you'll hit a point where the model gives away the whole game. Not just something weird (humans can be weird) but something alien, inherently unfitted to the context, something no one ever would write, even to be weird on purpose.

What if the missing ingredient is a better sampling method, as in this paper? To my eye, the completions they show don't seem hugely better. But I do buy their point that sampling for high probability means you get low information completions.

Quick Thoughts on A.I. Governance

Sandi4y30

How many of the decision makers in the companies mentioned care about or even understand the control problem? My impression was: not many.

Coordination is hard even when you share the same goals, but we don't have that luxury here.

An OpenAI team is getting ready to train a new model, but they're worried about it's self improvement capabilities getting out of hand. Luckily, they can consult MIRI's 2025 Reflexivity Standards when reviewing their codebase, and get 3rd-party auditing done by The Actually Pretty Good Auditing Group (founded 2023).

Current OpenAI wants to build AGI.^[1] Current MIRI could confidently tell them that this is a very bad idea. Sure they could be advised that step 25 of their AGI building plan is dangerous, but so were steps 1 through 24.

MIRI's advice to them won't be "oh implement this safety measure and you're golden" because there's no such safety measure because we won't have solved alignment by then. The advice will be "don't do that", as it is currently, and OpenAI will ignore it, as they do currently.

^{^}
Sure, they could actually mean "build AGI in a few decades when alignment is solved and we're gonna freeze all our current AGI building efforts long before then", but no they don't.

Humans pretending to be robots pretending to be human

Sandi4y110

TL;DR: Thought this post was grossly misleading. Then I saw that the GPT3 playground/API changed quite a lot recently in notable and perhaps worrying ways. This post is closer to the truth than I thought but I still consider it misleading.

Initially strongly downvoted since the LW post implies (to me) that humans provide some of the GPT3 completions in order to fool users into thinking it's smarter than it is. Was that interpretation of your post more in the eye of the beholder?

Nested three layers deep is one of two pieces of actual evidence:

InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.

My impression was that InstructGPT was a new/separate model, available as an option in the API along with the base GPT3, that is openly finetuned with human feedback as a way of aligning the base model. That was the whole point of this paper: https://arxiv.org/abs/2203.02155

This is very different from what I saw this post as implying, because OpenAI are open about it, it's different from the main GPT3 and it's not humans providing completions but humans aligning a language model. Hence strong downvote.

(The examples about the completions for select "gotcha" prompts improving over time aren't very compelling evidence for what this post implies. The ones changing in a day are pretty compelling, though - how weird!)

Then I opened the GPT3 API playground for the first time in a few months and realized that my understanding was outdated. Looks like InstructGPT and the old non-finetuned davinci have been merged into text-davinci-002, which is now the default model.

Trying the "Does grape juice taste better if you add sour milk?" prompt many times over keeps giving me the exact same answer that Gary got, even with max temperature. To test where on the spectrum between "aligned using human review" and "repeating human answers" this lies, I tried some variations:

Does grape juice taste better if you add sour cream?
No, grape juice does not taste better if you add sour cream.
Does wine taste better if you add sour cream?
There is no definitive answer to this question as it depends on personal preferences. Some people might find that wine tastes better when paired with sour cream, while others may not notice a difference. Ultimately, it is up to the individual to decide whether or not they think wine tastes better with sour cream.
Should I add sour milk to grape juice in order to improve taste?
It is not necessary to add sour milk to grape juice in order to improve its taste. Grape juice is already a tart and flavorful beverage. However, if you enjoy the taste of sour milk, you may add it to grape juice to create a tart and refreshing drink.

While GPT3 might not literally outsource a portion of the requests to MTurk, I don't think it's unfair to say that some of the completions are straight-up human provided. If corrected completion was added in a way that generalized (e.g. aligning using human feedback like in the paper), then it would have been a different story. But it clearly doesn't.

So to recap:

the curation of InstructGPT is now in the default model
human completions are substituted within a day in response to publicized embarrassing completions (I'm alleging this)
human completions aren't added such that the model is aligned to give more helpful answers, because very similar prompts still give bad completions

In addition, and more intangibly, I'm noticing that GPT3 is not the model I used to know. The completions vary a lot less between runs. More strikingly, they have this distinct tone. It reads like a NYT expert fact checker or first page Google results for a medical query.

I tried one of my old saved prompts for a specific kind of fiction prompt and the completion was very dry and boring. The old models are still available and it works better there. But I won't speculate further since I don't have enough experience with the new (or the old) GPT3.

We got what's needed for COVID-19 vaccination completely wrong

Sandi5y10

The Kefauver-Harris Drug Amendments of 1962 coincide with a drop in the rate of life-span increase.

I believe that, but I couldn't find a source. Do you remember where you got it from?

Inaccessible finely tuned RNG in humans?

Sandi5y20

I wonder if, in that case, your brain picks the stopping time, stopping point or "flick" strength using the same RNG source that is used when people just do it by feeling.

What if you tried a 50-50 slider on Aaronson's oracle, if it's not too exhausting to do it many times in a row? Or write down a sequence here and we can do randomness tests on it. Though I did see some tiny studies indicating that people can improve at generating random sequences.

Inaccessible finely tuned RNG in humans?

Sandi5y10

Hm, could we tell apart yours and Zack's theories by asking a fixed group of people for a sequence of random numbers over a long period of time, with enough delay between each query for them to forget?

Inaccessible finely tuned RNG in humans?

Sandi5y20

I seriously doubt the majority of the participants in these casual polls are doing anything like that.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments