I've been doing some experiments to try and work this out https://github.com/wassname/eliciting_suppressed_knowledge but haven't found anything conclusive yet

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

wassname1mo20

We are simply tuning the model to have similar activations for these very short, context free snippets. The characterization of the training you made with pair (A) or (B) is not what we do and we would agree if that was what we were doing this whole thing would be much less meaningful.

This is great. 2 suggestions:

Call it ablation, erasure, concept censoring or similar, not fine-tuning. That way you don't bury the lead. It also took me a long time to realise that this is what you were doing.
Maybe consider other way to erase the seperation of self-other. There are other erasure techniques, they are sharper scalpels, so you can wield them with more force. For example LEACE, training a linear classifier to predict A or B, then erase the activations that had predictive power

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

wassname1mo50

Very interesting!

Could you release the models and code and evals please? I'd like to test it on a moral/ethics benchmark I'm working on. I'd also like to get ideas from your evals.

Reply

How to (hopefully ethically) make money off of AGI

wassname2mo10

I'm imagining a scenario where an AI extrapolates "keep the voting shareholders happy" and "maximise shareholder value".

Voting stocks can also get valuable when people try to accumulate them to corner the market and execute a takeover this happens in crytopcurrencies like CURVE.

I know these are farfetched, but all future scenarios are. The premium on google voting stock is very small right now, so it's a cheap feature to add.

Reply

Don’t ignore bad vibes you get from people

wassname3mo10

I would say: don't ignore the feeling. Calibrate it and train it, until it's worth listening to.

there's a good book about this: "Sizing People Up"

Reply

Implications of the inference scaling paradigm for AI safety

wassname3mo10

What you might do is impose a curriculum:

In FBAI's COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.

first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...

It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.

There's another relevant paper "Compressed Chain of Thought: Efficient Reasoning through Dense Representations" which used teacher forcing. Although I haven't read the whole thing yet.

Reply

Implications of the inference scaling paradigm for AI safety

wassname3mo32

It doesn't make sense to me either, but it does seem to invalidate the "bootstrapping" results for the other 3 models. Maybe it's because they could batch all reward model requests into one instance.

When MS doesn't have enough compute to do their evals, the rest of us may struggle!

Reply

Implications of the inference scaling paradigm for AI safety

wassname3mo50

Well we don't know the sizes of the model, but I do get what you are saying and agree. Distil usually means big to small. But here it means expensive to cheap, (because test time compute is expensive, and they are training a model to cheaply skip the search process and just predict the result).

In RL, iirc, they call it "Policy distillation". And similarly "Imitation learning" or "behavioral cloning" in some problem setups. Perhaps those would be more accurate.

I think maybe the most relevant chart from the Jones paper gwern cites is this one:

Oh interesting. I guess you mean because it shows the gains of TTC vs model size? So you can imagine the bootstrapping from TTC -> model size -> TCC -> and so on?

Reply

Implications of the inference scaling paradigm for AI safety

wassname3mo30

I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean?

I'm not 100% sure, but you could have a look at math-shepard for an example. I haven't read the whole thing yet. I imagine it works back from a known solution.

Reply

Implications of the inference scaling paradigm for AI safety

wassname3mo20

"Likely to be critical to a correct answer" according to whom?

Check out the linked rStar-Math paper, it explains and demonstrates it better than I can (caveat they initially distil from a much larger model, which I see as a little bit of a cheat). tldr: yes a model, and a tree of possible solutions. Given a tree with values on the leaves, they can look at what nodes seem to have causal power.

A seperate approach is to teach a model to supervise using human process supervision data , then ask it to be the judge. This paper also cheats a little by distilling, but I think the method makes sense.

Reply