Daniel Paleka - LessWrong

GPT-4o's drawings of itself as a person are remarkably consistent: it's more or less always a similar-looking white male in his late 20s with brown hair, often sporting facial hair and glasses, unless you specify otherwise. All the men it generates might as well be brothers. I reproduced this on two ChatGPT accounts with clean memory.

On the contrary, its drawings of itself when it does not depict itself as a person are far more diverse: a wide range of robot designs and abstract humanoids, often featuring OpenAI logo as a head or on the word "GPT" on the chest.

What are the strongest arguments for very short timelines?

Daniel Paleka1mo30

I think the labs might well be rational in focusing on this sort of "handheld automation", just to enable their researchers to code experiments faster and in smaller teams.

My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the "dark matter" of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people.

If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it's plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your "dark matter" right now. They will work on it when they are faster.

Using an LLM perplexity filter to detect weight exfiltration

Daniel Paleka8mo10

N = #params, D = #data

Training compute = const .* N * D

Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average

Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2

If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute

For significantly overtrained models:
Model stealing << Training compute

Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute

Caveats:
- Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
- Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large

Does literacy remove your ability to be a bard as good as Homer?

Daniel Paleka1y21

The one you linked doesn't really rhyme. The meter is quite consistently decasyllabic, though.

I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the "oral songwriters composing war epics" meme lived until the very end of the tradition.

Takeaways from the NeurIPS 2023 Trojan Detection Competition

Daniel Paleka1y20

With Greedy Coordinate Gradient (GCG) optimization, when trying to force argmax-generated completions, using an improved objective function dramatically increased our optimizer’s performance.

Do you have some data / plots here?

Paper: LLMs trained on “A is B” fail to learn “B is A”

Daniel Paleka1y21

Oh so you have prompt_loss_weight=1, got it. I'll cross out my original comment. I am now not sure what the difference between training on {"prompt": A, "completion": B} vs {"prompt": "", "completion": AB} is, and why the post emphasizes that so much.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Daniel Paleka1y10

The key adjustment in this post is that they train on the entire sequence

Yeah, but my understanding of the post is that it wasn't enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what's happening based on this evidence.

Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {"prompt": A, "completion": B} at any time if it improved performance, and experiments like this would then go to waste.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Daniel Paleka1y*70

So there's a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
~~(1) you finetune not on p(B | A), but p(A) + p(B | A) instead~~ finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.
(2) A is a well-known name ("Tom Cruise"), but B is still a made-up thing

~~The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.~~
~~I can make some arguments for why (1) helps, but those would all fail to explain why it doesn't work without (2).~~

Caveat: The experiments in the post are only on A="Tom Cruise" and gpt-3.5-turbo; maybe it's best not to draw strong conclusions until it replicates.

What evidence is there of LLM's containing world models?

Daniel Paleka2y30

I made an illegal move while playing over the board (5+3 blitz) yesterday and lost the game. Maybe my model of chess (even when seeing the current board state) is indeed questionable, but well, it apparently happens to grandmasters in blitz too.

Reducing sycophancy and improving honesty via activation steering

Daniel Paleka2y10

Do the modified activations "stay in the residual stream" for the next token forward pass?
Is there any difference if they do or don't?
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn't matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments