LESSWRONG
LW

All of RobertKirk's Comments + Replies

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

How would you imagine doing this? I understand your hypothesis to be "If a model generalises as if it's a mesa-optimiser, then it's better-described as having simplicity bias". Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

Steering Llama-2 with contrastive activation additions

RobertKirk1yΩ110

A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you're doing the former as it's likely to be more performant, but I'd be interested to see what happens if you try to do the latter.

5Nina Panickssery1y

We used the same steering vectors, derived from the non fine-tuned model

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirk2y22

I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.

I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.

Measuring and Improving the Faithfulness of Model-Generated Reasoning

RobertKirk2yΩ110

To check I understand this, is another way of saying this that in the scaling experiments, there's effectively a confounding variable which is model performance on the task:

Improving zero-shot performance decreases deletion-CoT-faithfulness
- The model will get the answer write with no CoT, so adding a prefix of correct reasoning is unlikely to change the output, hence a decrease in faithfulness
Model scale is correlated with model performance.
So the scaling experiments show model scale is correlated with less faithfulness, but probably via the correlation wit

... (read more)

2Sam Bowman2y

Yep, that sounds right! The measure we're using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you're focused on scaling trends.

2cfoster02y

If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you're actually measuring the effect of scale alone, rather than scale confounded by performance.

QAPR 5: grokking is maybe not *that* big a deal?

RobertKirk2yΩ110

If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

3Rohin Shah2y

I expect a delay even in the infinite data case, I think? Although I'm not quite sure what you mean by "infinite data" here -- if the argument is that every data point will have been seen during training, then I agree that there won't be any delay. But yes training on the test set (even via "we train on everything so there is no possible test set") counts as cheating for this purpose.

QAPR 5: grokking is maybe not *that* big a deal?

RobertKirk2yΩ101610

I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

RobertKirk2yΩ110

I think the hyperlink for "conv nets without residual streams" is wrong? It's https://www.westernunion.com/web/global-service/track-transfer for me

2LawrenceC2y

lol, thanks, fixed

Existential AI Safety is NOT separate from near-term applications

RobertKirk3yΩ110

This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or addin... (read more)

2scasper3y

I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.

Positive values seem more robust and lasting than prohibitions

RobertKirk3yΩ330

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property.

I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:

Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spati

RobertKirk3yΩ110

So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.

One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving c... (read more)

1scasper3y

Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.

Existential AI Safety is NOT separate from near-term applications

RobertKirk3yΩ110

Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars:

The stuff Paul said about them aiming at understanding quite simple human values (don't kill us all, maintain our decision-making power) rather than subtle things. It's likely for self-driving cars we're more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC's approach could discern whether a car understands whether it's driving on the road or not (seems like a fairly simple concept), but not whether it's driving in a riskier

RobertKirk3yΩ230

I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they're on the same kind of framing/scale.

It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you're drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions spec... (read more)

4TurnTrout3y

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas "drink juice" does have this property. In the described scenario, "don't kill humans" may in fact lead to scenarios where the AI can again kill humans, but this feels like an ambient statistical property of the world (killing people is easy) and not like a property of the shard's optimization (the shard isn't influencing logits on the basis of whether those actions will lead to future opportunities to not kill people, or something?). So I do expect "don't kill people" to become more sophisticated/reflective, but I intuitively feel there remains some important difference that I can't quite articulate.

Trying to disambiguate different questions about whether RLHF is “good”

RobertKirk3yΩ452

Thanks for the answer! I feel uncertain whether that suggestion is an "alignment" paradigm/method though - either these formally specified goals don't cover most of the things we care about, in which case this doesn't seem that useful, or they do, in which case I'm pretty uncertain how we can formally specify them - that's kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it's further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.

Trying to disambiguate different questions about whether RLHF is “good”

RobertKirk3yΩ340

I still don't think you've proposed an alternative to "training a model with human feedback". "maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function" sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn't seem to be a real difference with what Buck is saying. So even if he actually had written that he's not aware of an alternative to "training a model with human/overseer feedback", I don't think you've refuted that point.

6davidad3y

Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world. Humans could then look at what happens in the simulations and say "gee, that doesn't look good," and specify better goals instead, and the policy won't experience gradient pressure to make those evaluations systematically wrong. This isn't the place where I want to make a case for the "competitiveness" or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.

Deconfusing Direct vs Amortised Optimization

RobertKirk3yΩ330

An existing example of something like the difference between amortised and direct optimisation is doing RLHF (w/o KL penalties to make the comparison exact) vs doing rejection sampling (RS) with a trained reward model. RLHF amortises the cost of directly finding good outputs according to the reward model, such that at evaluation the model can produce good outputs with a single generation, whereas RS requires no training on top of the reward model, but uses lots more compute at evaluation by generating and filtering with the RM. (This case doesn't exactly m... (read more)

Concept extrapolation for hypothesis generation

RobertKirk3yΩ131

Instead, Aligned AI used its technology to automatically tease out the ambiguities of the original data.

Could you provide any technical details about how this works? Otherwise I don't know what to take from this post.

2Stuart_Armstrong3y

It's an implementation of the concept extrapolation methods we talked about here: https://www.lesswrong.com/s/u9uawicHx7Ng7vwxA The specific details will be in a forthcoming paper. Also, you'll be able to try it out yourself soon; signup for alpha testers at the bottom of the page here: https://www.aligned-ai.com/post/concept-extrapolation-for-hypothesis-generation

Alignment allows "nonrobust" decision-influences and doesn't require robust grading

RobertKirk3yΩ332

Question: How do we train an agent which makes lots of diamonds, without also being able to robustly grade expected-diamond-production for every plan the agent might consider?

I thought you were about to answer this question in the ensuing text, but it didn't feel like to me you gave an answer. You described the goal (values-child), but not how the mother would produce values-child rather than produce evaluation-child. How do you do this?

3TurnTrout3y

Ah, I should have written that question differently. I meant to ask "If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?" Anyways, we might train a diamond-values agent like this.

Engineering Monosemanticity in Toy Models

RobertKirk3yΩ110

You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit

I guess the recent work on Polysemanticity and Capacity seems to suggest the latter case, especially in sparser settings, given the zone where multiple feature are represented polysemantically, although I can't remember if they investigate power-law feature frequencies or just uniform frequencies

were a little concerned about going down a rabbit hole given some of

RobertKirk3y*Ω120

This work looks super interesting, definitely keen to see more! ~~Will you open-source your code for running the experiments and producing plots? I'd definitely be keen to play around with it. (~~They already did here: https://github.com/adamjermyn/toy_model_interpretability I just missed it. Thanks! Although it would be useful to have the plotting code as well, if that's easy to share?)

Note that we primarily study the regime where there are more features than embedding dimensions (i.e. the sparse feature layer is wider than the input) but where features are s

... (read more)

4Adam Jermyn3y

Thanks for these thoughts! Sure! I've just pushed the plot helper routines we used, as well as some examples. This is a great question. I think my expectation is that the number of features exceeds the number of neurons in real-world settings, but that it might be possible to arrange for the number of neurons to exceed the number of important features (at least if we use some sparsity/gating methods to get many more neurons without many more flops). If we can't get into that limit, though, it does seem important to know what happens when k < N, and we looked briefly at that limit in section 4.1.4. There we found that models tend to learn some features monosemantically and others polysemantically (rather than e.g. ignoring all but k features and learning those monosemantically), both for uniform and varying feature frequencies. This is definitely something we want to look into more though, in particular in case of power-law (or otherwise varying) feature frequencies/importances. You might well expect that features just get ignored below some threshold and monosemantically represented above it, or it could be that you just always get a polysemantic morass in that limit. We haven't really pinned this down, and it may also depend on the initialization/training procedure (as we saw when k > N), so it's definitely worth a thorough look. We haven't tried this. It's something we looked into briefly but were a little concerned about going down a rabbit hole given some of the discussion around whether the results replicated, which indicated some sensitivity to optimizer and learning rate. I think (at least in our case) it might be simpler to get at this question, and I think the first thing I'd do to understand connectivity is ask "how much regularization do I need to move from one basin to the other?" So for instance suppose we regularized the weights to directly push them from one basin towards the other, how much regularization do we need to make the models actually

Current themes in mechanistic interpretability research

RobertKirk3yΩ450

Thanks for writing the post, and it's great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.

Some comments and questions:

I think "science of deep learning" would be a better term than "deep learning theory" for what you're describing, given that I think all the phenomena you list aren't yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the inte

RobertKirk3yΩ443

Causal confusion as an argument against the scaling hypothesis

RobertKirk3yΩ110

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

The main argument of the post isn't "ASI/AGI may be causally confused, what are the consequences of that" but rather "Scaling up static pretraining may result in causally confused models, which hence probably wouldn't be considered ASI/AGI". I think in practice if we get AGI/ASI, then almost by definition I'd thin... (read more)

Path dependence in ML inductive biases

RobertKirk3yΩ131811

When you talk about whether we're in a high or low path-dependence "world", do you think that there is a (somewhat robust) answer to this question that holds across most ML training processes? I think it's more likely that some training processes are highly path-dependent and some aren't. We definitely have evidence that some are path-dependent, e.g. Ethan's comment and other examples like https://arxiv.org/abs/2002.06305, and almost any RL paper where different random seeds of the training process often result in quite different results. Arguably I don't ... (read more)

3evhub3y

Yeah, I agree with that. I think path dependence will likely vary across training processes and that we should in fact view that as an important intervention point.

Sticky goals: a concrete experiment for understanding deceptive alignment

RobertKirk3yΩ332

When you say "the knowledge of what our goals are should be present in all models", by "knowledge of what our goals are" do you mean a pointer to our goals (given that there are probably multiple goals which are combined in someway) is in the world model? If so this seems to contradict you earlier saying:

The deceptive model has to build such a pointer [to the training objective] at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded

I guess I don't understand what it would mean for the deceptive AI to... (read more)

4RobertKirk3y

I've now had a conversation with Evan where he's explained his position here and I now agree with it. Specifically, this is in low path-dependency land, and just considering the simplicity bias. In this case it's likely that the world model will be very simple indeed (e.g. possibly just some model of physics), and all the hard work is done at run time in activations by the optimisation process. In this setting, Evan argued that the only very simple goal we can encode (that we know of) that would produce behaviour that maximises the training objective is any arbitrary simple objective, combined with the reasoning that deception and power are instrumentally useful for that objective. To encode an objective that would reliably produce maximisation of the training objective always (e.g. internal or corrigible alignment), you'd need to fully encode the training objective.

Sticky goals: a concrete experiment for understanding deceptive alignment

RobertKirk3yΩ112

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

Alternatively, the aligned m... (read more)

2evhub3y

My argument is that the only complexity necessary to figure out what our goals are is the knowledge of what our goals are. And that knowledge is an additional complexity that should be present in all models, not just deceptive models, since all models should need to understand those sorts of facts about the world.

Sticky goals: a concrete experiment for understanding deceptive alignment

RobertKirk3yΩ11-2

It seems like a lot more computationally difficult to, at every forward pass/decision process, derive/build/construct such a pointer. If the deceptive model is going to be doing this every time it seems like it would be more efficient to have a dedicated part of the network that calculates it (i.e. have it in the weights)

Separately, for more complex goals this procedure is also going to be more complex, and the network probably needs to be more complex to support constructing it in the activations at every forward pass, compared to the corrigible model tha... (read more)

2evhub3y

No additional complexity is required to rederive that it should try to be deceptive—it's a convergent instrumental goal that falls out of any generic optimization process that essentially all models should have to have anyway to be able to get good performance on tasks that require optimization.

Sticky goals: a concrete experiment for understanding deceptive alignment

RobertKirk3y*Ω11-2

Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily ex

... (read more)

3evhub3y

It needs no such pointer in the weights, only in the activations. The deceptive model has to build such a pointer at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded.

Common misconceptions about OpenAI

RobertKirk3yΩ330

I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with "there'd have been no hope whatsoever of identifying all the key problems in advance just based on theory"). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds o... (read more)

How likely is deceptive alignment?

RobertKirk3yΩ7118

Thanks for writing this post, it's great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:

(Note I'm using "AI" instead of "model" to avoid confusing myself between "model" and "world model", e.g. "the deceptively aligned AI's world model" instead of "the deceptively-aligned model's world model").

Making goals long-term might not be easy

You say

Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the th

... (read more)

Will Capabilities Generalise More?

RobertKirk3yΩ111

I think perhaps a lot work is being done by "if your optimiser worked". This might also be where there's a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you're using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn't really "work", compared to SGD+RL.

2TurnTrout3y

I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen.

The alignment problem from a deep learning perspective

RobertKirk3yΩ330

Me, modelling skeptical ML researchers who may read this document:

It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before).

In particular, the argument that we won't be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn't feel that rob... (read more)

Externalized reasoning oversight: a research direction for language model alignment

RobertKirk3yΩ230

First condition: assess reasoning authenticity

To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability - if we could assess whether a model's outputs faithfully reflect it's internal "thinking" and hence that all of it's reasoning is what we're seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don't have solutions for it, I'm uncertain whether this reduction of... (read more)

Circumventing interpretability: How to defeat mind-readers

RobertKirk3yΩ342

If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.

While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than us... (read more)

2Lee Sharkey3y

I agree

A note about differential technological development

RobertKirk3yΩ110

Suppose that aligning an AGI requires 1000 person-years of research.

900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can't get any of those four parts done in less than 25 years).

Do you have a similar model for just building (unaligned) AGI? Or is the model meaningfully di... (read more)

Causal confusion as an argument against the scaling hypothesis

RobertKirk3yΩ110

I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven't read the paper). However, the argument in the post is that even if we did scale up, we couldn't solve the OOD generalisation problems.

Causal confusion as an argument against the scaling hypothesis

RobertKirk3yΩ110

Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

A transparency and interpretability tech tree

RobertKirk3yΩ452

The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn't expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.

A transparency and interpretability tech tree

RobertKirk3yΩ342

Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.

I'd be curious to hear a bit mor... (read more)

Epistemological Vigilance for Alignment

RobertKirk3yΩ110

This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?

Epistemological Vigilance for Alignment

RobertKirk3yΩ330

Newtonian: complex reactions

So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That's what the "complex reactions" and some of the references kind of point at, but then in the description you seem to be talking mor... (read more)

4adamShimi3y

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward. Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.

RL with KL penalties is better seen as Bayesian inference

RobertKirk3yΩ220

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented

... (read more)

Richard Ngo's Shortform

RobertKirk3yΩ240

Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the sam... (read more)

RL with KL penalties is better seen as Bayesian inference

RobertKirk3yΩ110

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:

The problem with the RL

... (read more)

2Tomek Korbak3y

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL. So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value? But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?

Reshaping the AI Industry

RobertKirk3yΩ130

That's my guess also, but I'm more asking just in case that's not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I'd like to know why.

7evhub3y

I was referring to stuff like this, this, and this. I haven't finished it yet, but I've so far very much enjoyed the Pragmatic AI Safety sequence, though I certainly have disagreements with it.

Reshaping the AI Industry

RobertKirk3yΩ110

I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?

Alexander Gietelink Oldenziel3yΩ5100

My guess is that Evan dislikes the apocalyptic /panicky conversations that people are recently having on Lesswrong

How can Interpretability help Alignment?

RobertKirk5yΩ220

(Note: you're quoting your response as well as the sentence you've meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).

Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don't know if there's anything specific about the RL community which

... (read more)

Resources for AI Alignment Cartography

RobertKirk5y20

No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul's team at OpenAI is doing is working on debates rather than IDA.

Resources for AI Alignment Cartography

RobertKirk5y30

I don't know whether this is on purpose, but I'd think that AI Safety Via Debate (original paper: https://arxiv.org/abs/1805.00899; recent progress report: https://www.lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1) should get a mention, probably in the Technical agendas focused on possible solutions section? I'd argue it's different enough from IDA to have it's own subititle.

1Gyrodiot5y

It was in the references that initially didn't make the cut. After further thought, it's indeed worth adding. I referenced the Distill article AI Safety Needs Social Scientists, which spends more time on the motivating arguments, and linked to the paper in the note. Thanks for your feedback!