User Comment Replies

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Yes, that's very reasonable. My initial goal when first thinking about how to explore CCS was to use CCS with RL-tuned models and to study the development of coherent beliefs as the system is 'agentised'. We didn't get that far because we ran into problems with CCS first.

To be frank, the reasons for using Chinchilla are boring and a mixture of technical/organisational. If we did the project again now, we would use Gemini models with 'equivalent' finetuned and base models, but given our results so far we didn't think the effort of setting all that up and an... (read more)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar1y31

Here are some of the questions that make me think that you might get a wide variety of simulations even just with base models:

Do you think that non-human-entities can be useful for predicting what humans will say next? For example, imaginary entities from fictions humans share? Or artificial entities they interact with?
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn't appear in the training data? A kind of generalization?

But I guess I broadly agree with you that bas... (read more)

2RogerDearnaley1y

When I said "human-simulators" above, that short-hand was intended to include simulations of things like: multiple human authors, editors, and even reviewers cooperating to produce a very high-quality text; the product of a wikipedia community repeatedly adding to and editing a page; or any fictional character as portrayed by a human author (or multiple human authors, if the character is shared). So not just single human authors writing autobiographically, but basically any token-producing process involving humans whose output can be found on the Internet or other training data sources. Completely agreed Absolutely. To the extent that the model learns accurate world models/human psychological models, it may (or may not) be able to accurately extrapolate some distance outside its training distribution. Even where it can't do so accurately, it may well still try. I agree, most models of concern are the product of applying RL, DPO, or some similar process to a base model. My point was that your choice of a Chinchilla base model (I assume, AKAIK no one has made and instruct-trained version of Chinchilla) seemed odd, given what you're trying to do. Conceptually, I'm not sure I would view even an RL instruct-trained model as containing "a single agent" plus "a broad contextual distribution of human-like simulators available to it". Instead I'd view it as "a broad contextual distribution of human simulators" containing "a narrower contextual distribution of (possibly less human-like) agent simulators encouraged/amplified and modified by the RL process". I'd be concerned that the latter distribution could, for example, be bimodal, containing both a narrow distribution of helpful, harmless, and honest assistant personas, and a second narrow distribution their Waluigi Effect exact opposite evil twins (such as the agent summoned by the DAN jailbreak). So I question ELK's underlying assumption that there exists a single "the agent" entity with a single well-defined set of l

What Discovering Latent Knowledge Did and Did Not Find

Seb Farquhar2yΩ110

Thanks a lot for this post, I found it very helpful.

There exists a single direction which contains all linearly available information Previous work has found that, in most datasets, linearly available information can be removed with a single rank-one ablation by ablating along the difference of the means of the two classes.

The specific thing that you measure may be more a fact about linear algebra rather than a fact about LLMs or CCS.

For example, let's construct data which definitely has two linearly independent dimension that are each predictive of whethe... (read more)

1Fabien Roger2y

I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data. But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, but outliers don't matter if you want to produce high-accuracy classifiers. Therefore, what you want is something like the median.

What Discovering Latent Knowledge Did and Did Not Find

Seb Farquhar2y31

It will often be better on the test set (because of averaging uncorrelated errors).

The Waluigi Effect (mega-post)

Seb Farquhar2yΩ010

Thanks for the thought provoking post! Some rough thoughts:

Modelling authors not simulacra

Raw LLMs model the data generating process. The data generating process emits characters/simulacra, but is grounded in authors. Modelling simulacra is probably either a consequence of modelling authors or a means for modelling authors.

Authors behave differently from characters, and in particular are less likely to reveal their dastardly plans and become evil versions of themselves. The context teaches the LLM about what kind of author it is modelling, and this infor... (read more)

The Waluigi Effect (mega-post)

Seb Farquhar2yΩ340

I'm not sure how serious this suggestion is, but note that:

It involves first training a model to be evil, running it, and hoping that you are good enough at jailbreaking to make it good rather than make it pretend to be good. And then to somehow have that be stable.
The opposite of something really bad is not necessarily good. E.g., the opposite of a paperclip maximiser is... I guess a paperclip minimiser? That seems approximately as bad.

Bing Chat is blatantly, aggressively misaligned

Seb Farquhar2y128

Presumably Microsoft do not want their chatbot to be hostile and threatening to its users? Pretty much all the examples have that property.

7johnswentworth2y

If the examples were selected primarily to demonstrate that the chatbot does lots of things Microsoft does not want, then I would have expected the majority of examples to be things Microsoft does not want besides being hostile and threatening to users. For instance, I would guess that in practice most instances of BingGPT doing things Microsoft doesn't want are simply hallucinating sources. Yet instances of hostile/threatening behavior toward users are wildly overrepresented in the post (relative to the frequency I'd expect among the broader category of behaviors-Microsoft-does-not-want), which is why I thought that the examples in the post were not primarily selected just on the basis of being things Microsoft does not want.

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

Seb Farquhar2yΩ352

This doesn't seem to disagree with David's argument? "Accident" implies a lack of negligence. "Not taken seriously enough" points at negligence. I think you are saying that non-negligent but "painfully obvious" harms that occur are "accidents", which seems fair. David is saying that the scenarios he is imagining are negligent and therefore not accidents. These seem compatible.

I understand David to be saying that there is a substantial possibility of x-risk due to negligent but non-intended events, maybe even the majority of the probability. These would sit between "accident" and "misuse" (on both of your definitions).

Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Seb Farquhar2y10

FWIW I think doing something like the newsletter well actually does take very rare skills. Summarizing well is really hard. Having relevant/interesting opinions about the papers is even harder.

Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Seb Farquhar2y10

Yeah, LeCun's proposal seems interesting. I was actually involved in an attempt to modify OpenReview to push along those lines a couple years ago. But it became very much a 'perfect is the enemy of the good' situation where the technical complexity grew too fast relative to the amount of engineering effort devoted to it.

What makes you suspicious about a separate journal? Diluting attention? Hard to make new things? Or something else? I'm sympathetic to diluting attention, but bet that making a new thing wouldn't be that hard.

3Roman Leventov2y

Attention dilution, exactly. Ultimately, I want (because I think this will be more effective) all relevant work to be syndicated on LW/AF (via Linkposts, and review posts), not the other way around: AI safety researchers had to subscribe to arxiv sanity, google AI blog, all relevant standalone blogs such as Bengio's and Scott Aaronson's, etc. etc. etc., all by themselves and separately. I even think if LW hired part-time staff dedicated to doing this would be very valuable. Also, alignment newsletters, to further pre-process information, don't live. Shah tried to revive his newsletter mid last year, but it didn't survive for long. Part-time could also curate such an "AF newsletter", I don't think it takes Shah's competence to do this well.

Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Seb Farquhar2y10

Yeah, I think it requires some specialist skills, time, and a bit of initiative. But it's not deeply super hard.

Sadly, I think learning how to write papers for ML conferences is pretty time consuming. It's one of the main things a phd student spends time learning in the first year or two of their phd. I do think there's a lot that's genuinely useful about that practice though, it's not just jumping through hoops.

Review AI Alignment posts to help figure out how to make a proper AI Alignment review

Seb Farquhar2y63

I've also been thinking about how to boost reviewing in the alignment field. Unsure if AF is the right venue, but it might be. I was more thinking along the lines of academic peer review. Main advantages of reviewing generally I see are:
- Encourages sharper/clearer thinking and writing;
- Makes research more inter-operable between groups;
- Catches some errors;
- Helps filter the most important results.

Obviously peer review is imperfect at all of these. But so is upvoting or not doing review systematically.

I think the main reasons alignment researchers curren... (read more)

2jacquesthibs2y

I’ve talked to a few people who have suggested journal or conference ideas, but they never happened. I think it was mostly a mix of not knowing how to do it well and (mostly) they were busy with other stuff. Someone probably actually needs to take initiative on this if we want our research to be more ‘academic’. Regardless of whether a journal is created or not, I’ve certainly wished I had more academic collaborators or someone who could teach me how to publish work that ends up being accepted within the ML community. As an indepedent researcher, it feels like the gap is too large and causes too much friction to figure things out and get started.

1Roman Leventov2y

I strongly agree with most of this. Did you see LeCun's proposal about how to improve academic review here? It strikes me as very good and I'd love if AI safety/x-risk community had a system like this. I'm suspicious about creating a separate journal, rather than concentrating efforts around existing institutions: LW/AF. I think it would be better to fund LW exactly for this purpose and add monetary incentives for providing good reviews of research writing on LW/AF (and, of course, the research writing itself could be incentivised in this way, too). Then, turn AF in exactly the kind of "journal" that you proposed, as I described here.

Goal Alignment Is Robust To the Sharp Left Turn

Seb Farquhar2yΩ120

Thanks, that makes sense.

I think part of my skepticism about the original claim comes from the fact that I'm not sure that any amount of time for people living in some specific stone-age grouping would come up with the concept of 'sapient' without other parts of their environment changing to enable other concepts to get constructed.

There might be a similar point translated into something shard theoryish that's like 'The available shards are very context dependent, so persistent human values across very different contexts is implausible.' SLT in particular probably involves some pretty different contexts.

Using GPT-Eliezer against ChatGPT Jailbreaking

Seb Farquhar2yΩ255

I also predict that real Eliezer would say about many of these things that they were basically not problematic outputs themselves, just represent how hard it is to stop outputs conditioned on having decided they are problematic. The model seems to totally not get this.

Meta level: let's use these failures to understand how hard alignment is, but not accidentally start thinking that alignment=='not providing information that is readily available on the internet but that we think people shouldn't use'.

Goal Alignment Is Robust To the Sharp Left Turn

Seb Farquhar2yΩ010

Sure, inclusive genetic fitness didn't survive our sharp left turn. But human values did. Individual modern humans are optimizing for them as hard as they were before; and indeed, we aim to protect these values against the future.

Why do you think this? It seems like humans currently have values and used to have values (I'm not sure when they started having values) but they are probably different values. Certainly people today have different values in different cultures, and people who are parts of continuous cultures have different values to people in those cultures 50 years ago.

Is there some reason to think that any specific human values persisted through the human analogue of SLT?

1Thane Ruthenis2y

I no longer believe this claim quite as strongly as implied: see here and here. The shard theory has presented a very compelling alternate case of human value formation, and it suggests that even the ultimate compilation of two different modern people's values would likely yield different unitary utility functions. I still think there's a sense in which stone-age!humans and modern humans, if tasked with giving an AI an utility function that'd make all humans happy, would arrive at the same result (if given thousands of years to think). But it might be the same sense in which we and altruistic aliens would arrive at "satisfy the preferences of all sapient beings" or something. (Although I'm not fully sure our definitions of "a sapient being" would be the same as randomly-chosen aliens', but that's a whole different line of thoughts.)

LESSWRONG
LW

All of Seb Farquhar's Comments + Replies

Modelling authors not simulacra