Wiki Contributions

Comments

I think this only holds if fine tunes are composable, which as far as I can tell they aren't

Anecdotally, a lot of people are using mergekit to combine fine tunes

wassnameΩ010

it feels less surgical than a single direction everywher

Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.

Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.

omg, I totally missed that, thanks. Let me know if I missed anything else, I just want to learn.

The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.

wassnameΩ-110

If anyone wants to try this on llama-3 7b, I converted the collab to baukit, and it's available here.

So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw )

  • base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674
  • steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023

So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.

I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.

maintaining model coherence

To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don't seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative.

I don't think DanH's paper did this either. So I'm curious, in general, whether these models maintain performance, especially on measures of coherence.

In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.

This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn't encountered before, like the concept of training out agentic behavior by punishing side-effects.

If anyone wants the HTML version of the paper, it is here.

Maybe our culture fits our status-seeking surprisingly well because our culture was designed around it.

We design institutions to channel and utilize our status-seeking instincts. We put people in status conscious groups like schools, platoons, or companies. There we have ceremonies and titles that draw our attention to status.

And this works! Ask yourself, is it more effective to educate a child individually or in a group of peers? The latter. Is it easier to lead a solitary soldier or a whole squad? The latter. Do people seek a promotion or a pay rise? Both, probably. The fact is, that people are easier to guide when in large groups, and easier to motivate with status symbols.

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

Another perspective: Sometimes our status seeking is nonfunctional and therefore nonaligned. For example we also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.

In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.

Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times.

It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).

A nice introductory essay, seems valuable for entrants.

There quite a few approaches to alignment beyond CIRL and Value Lear ing. https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety

There's some recent academic research on CIRL which is overlooked on LessWrong, Here we seem to only discuss Stuart Russell's work.

Recent work:

See also this overviews in lecture 3 and 4 of Roger Gross's CSC2547 Alignment Course.

The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It's also careful how it goes about this seeking of input. For example, if it's unsure if humans like to be stabbed (we don't), it wouldn't stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.

This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).

It's worth noting that CIRL sometimes goes by other names

Inverse reinforcement learning, inverse planning, and inverse optimal control, are all different names for the same problem: Recover some specification of desired behavior from observed behavior

It's also highly related to both assistance games and Recursive Reward Modelling (part of OpenAI's superalignment).

On the other hand, there are some old rebuttals of parts of it

Load More