LESSWRONG
LW

All of Johannes Treutlein's Comments + Replies

Modifying LLM Beliefs with Synthetic Document Finetuning

I think there is a difference between finetuning and prompting in that in the prompting case, the LLM is aware that it's taking part in a role playing scenario. With finetuning on synthetic documents, it is possible to make the LLM more deeply believe something. Maybe one could make the finetuning more sample efficient by instead distilling a prompted model. Another option could be using steering vectors, though I'm not sure that would work better than prompting.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein10moΩ230

Next, I took one of the finetunes and functions where OOD performance wasn't perfect. I choose 1.75 x and my first functions finetune (OOD performance at 82%). Below, I plot the function values that th... (read more)

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein11moΩ350

My guess is that for any given finetune and function, OOD regression performance correlates with performance on providing definitions, but that the model doesn't perform better on its own provided definitions than on the ground truth definitions. From looking at plots of function values, the way they are wrong OOD often looked more like noise or calculation errors to me rather than eg getting the coefficient wrong. I'm not sure, though. I might run an evaluation on this soon and will report back here.

3Johannes Treutlein10mo

I played around with this a little bit now. First, I correlated OOD performance vs. Freeform definition performance, for each model and function. I got a correlation coefficient of ca. 0.16. You can see a scatter plot below. Every dot corresponds to a tuple of a model and a function. Note that transforming the points into logits or similar didn't really help. Next, I took one of the finetunes and functions where OOD performance wasn't perfect. I choose 1.75 x and my first functions finetune (OOD performance at 82%). Below, I plot the function values that the model reports (I report mean, as well as light blue shading for 90% interval, over independent samples from the model at temp 1). This looks like a typical plot to me. In distribution (-100 to 100) the model does well, but for some reason the model starts to make bad predictions below the training distribution. A list of some of the sampled definitions from the model: Unsurprisingly, when checking against this list of model-provided definitions, performance is much worse than when evaluating against ground truth. It would be interesting to look into more different functions and models, as there might exist ones with a stronger connection between OOD predictions and provided definitions. However, I'll leave it here for now.

ejenner's Shortform

Johannes Treutlein1y110

How much time do you think there is between "ability to automate" and "actually this has been automated"? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people's inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)

4Erik Jenner1y

Good question, I think I was mostly visualizing ability to automate while writing this. Though for software development specifically I expect the gap to be pretty small (lower regulatory hurdles than elsewhere, has a lot of relevance to the people who'd do the automation, already starting to happen right now). In general I'd expect inertia to become less of a factor as the benefits of AI become bigger and more obvious---at least for important applications where AI could provide many many billions of dollars of economic value, I'd guess it won't take too long for someone to reap those benefits. My best guess is regulations won't slow this down too much except in a few domains where there are already existing regulations (like driving cars or medical things). But pretty unsure about that. I also think it depends on whether by "ability to automate" you mean "this base model could do it with exactly the right scaffolding or finetuning" vs "we actually know how to do it and it's just a question of using it at scale". For that part, I was thinking more about the latter.

Non-myopia stories

Johannes Treutlein1y30

I found this clarifying for my own thinking! Just a small additional point, in Hidden Incentives for Auto-Induced Distributional Shift, there is also the example of a Q learner that learns to sometimes take a non-myopic action (I believe cooperating with its past self in a prisoner's dilemma), without any meta learning.

1[anonymous]1y

Thanks for pointing this out! I will make a note of that in the main post.

Report on modeling evidential cooperation in large worlds

Johannes Treutlein2y20

Thank you! :)

Conditioning Predictive Models: The case for competitiveness

Johannes Treutlein2yΩ332

Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus ... (read more)

rohinmshah's Shortform

Johannes Treutlein2yΩ330

My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution

P = α P_{0} + (1 - α) P_{1},

such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have $P (s ∣ s_{0}) = \frac{P (s \otimes s_{0})}{P (s_{0})}$ . Together with the assumption that $P_{0}$ is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for $P_{0}$ by stringing together bad se... (read more)

Acausal trade: being unusual

Johannes Treutlein2yΩ7100