Roger Dearnaley

I expect (1) and (2) to go mostly fine (though see footnote^[6]), and for (3) to completely fail. For example, suppose has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating $C_{g}$ vs. $C_{b}$ to an equally-difficult problem of discriminating desired vs. undesired features.

However, if we found that the classifier was using a feature for "How smart is the human asking the question?" to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.

Replying toDo Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Roger Dearnaley1y

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Model/FinetuneGlobal Mean (Cosine) Similarity
Gemma-2b/Gemma-2b-Python-codes0.6691
Mistral-7b/Mistral-7b-MetaMath0.9648

As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn't ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.

Regardless of how they were... (read more)

Replying toThe case for unlearning that removes information from LLM weights

Roger Dearnaley1y

The case for unlearning that removes information from LLM weights

knowing if the intentions of a user are legitimate is very difficult

This sounds more like a security and authentication problem than an AI problem.

Replying toThe case for unlearning that removes information from LLM weights

Roger Dearnaley1y

The case for unlearning that removes information from LLM weights

Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts - and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.

If you read e.g. "Fact... (read more)

Replying toFact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Roger Dearnaley1y

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

I think something that might be quite illuminating for this factual recall question (as well has having potential safety uses) is editing the facts. Suppose, for this case, you take just the layer 2-6 MLPs with frozen attention, you warm-start from the existing model, add a loss function term that penalizes changes to the model away from that initialization (perhaps using a combination of L1 and L2 norms), and train that model on a dataset that consists of your full set of athlete names embeddings and their sports (in a frequency ratio matching the original training data, so with more copies of data about more famous athletes), but with one factual edit... (read more)

Replying toIs Infra-Bayesianism Applicable to Value Learning?

Roger Dearnaley3y

Is Infra-Bayesianism Applicable to Value Learning?

I take your point that the way an Infra-Bayesian system makes decisions isn't the same as a human — it presumably doesn't share our cognitive biases, and the pessimism element 'Murphy' in it seems stronger than for most humans. I normally assume that if there's something I don't understand about the environment that's injecting noise into the outcome of my actions, the noise-related parts of results aren't going to be well-optimized, so they're going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don't generally assume that everything... (read more)

Replying toProspect Theory: A Framework for Understanding Cognitive Biases

Roger Dearnaley3y

Prospect Theory: A Framework for Understanding Cognitive Biases

The observation that gains saturate has a fairly simple explanation from evolutionary theory: the increased evolutionary fitness advantage from large material gains saturates (especially in a hunter-gatherer environment). Successfully hunting a rabbit will keep me from starving for a day; but if I successfully hunt a mammoth, I can't keep the meat for long enough for it to feed me for years. The best I can do is feed everyone in the village for a few days, hoping they remember this later when my hunting is less successful, and do a bunch of extra work to make some jerky with the rest. The evolutionary advantage is sub-linear in the kilograms of raw... (read more)

Replying toIs Infra-Bayesianism Applicable to Value Learning?

Roger Dearnaley3y*

Is Infra-Bayesianism Applicable to Value Learning?

We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Replying toInfra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem

Roger Dearnaley3y

Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem

The sequence on Infra-Bayesianism motivates the min (a.k.a. Murphy) part of its argmax min by wanting to establish lower bounds on utility — that's a valid viewpoint. My own interest in Infra-Bayesianism comes from a different motivation: Murphy's min encodes directly into Infra-Bayesian decision making the generally true, inter-related facts that 1) for an optimizer, uncertainty on the true world model injects noise into your optimization process which almost always makes the outcome worse 2) the optimizer's curse usually results in you exploring outcomes whose true utility you had overestimated, so your regret is generally higher than you had expected 3) most everyday environments and situations are already highly optimized, so random... (read 636 more words →)

Replying toBreaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley3y

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Having since started learning about Infra-Bayesianism, my initial impression is that it's a formal and well-structures mathematical formalism for doing of exactly the kind of mechanism I sketched above for breaking the Optimizer's Curse, by taking the most pessimistic view of the utility over Knightian uncertainty in your hypotheses set.

Just How Hard a Problem is Alignment?

Roger Dearnaley

It is commonly asserted that aligning AI is extremely hard because

human values are complex: they have a high Kolmogorov complexity, and
they're fragile: if you get them even a tiny bit wrong, the result is useless, or worse than useless.

If these statements are both true, then the alignment problem is really, really hard, we probably only get one try at it, so we're likely doomed. So it seems worth thinking a bit about whether the problem really is quite that hard. At a Fermi-estimate level, just how big do we think the Kolmogorov complexity of human values might be? Just how fragile are they? If we had human values, say, 99.9% right, and... (read 6035 more words →)

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley

Multi-Armed Bandits Considered Harmful

People frequently analyze the process of artificial agents gathering knowledge in the framework of explore/exploit strategies for multi-armed bandits. However, a multi-armed bandit is a simplistic black-box abstraction – the possible rewards from pulling each arm have no underlying logic: by definition they’re unknown and unknowable other than by repeatedly sampling them. Treating a learning experience like it’s a multi-armed bandit on which the best you can do is explore/exploit is an extremely simple strategy, implementable by even a simple reinforcement learning agent — but it’s an extremely bad strategy in the presence of either possible outcomes that permanently end your opportunity to learn, or produce very large negative... (read 6762 more words →)

LESSWRONG
LW

LESSWRONG
LW

Roger Dearnaley

Roger Dearnaley

Just How Hard a Problem is Alignment?

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley

Roger Dearnaley

Roger Dearnaley

Just How Hard a Problem is Alignment?

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Multi-Armed Bandits Considered Harmful