LESSWRONG
LW

4gate — LessWrong

Replying toLegible vs. Illegible AI Safety Problems

Legible vs. Illegible AI Safety Problems

Edit: it looks like someone gave some links below. I don't have time to read it yet but I may do so in the future. I think that it's better to give examples and be dismissed than to give no examples and be dismissed.

It should be nice to see some good examples of illegible problems. I understand that their illegibility may be a core issue to this, but surely someone can at least name a few?

I think it's important so as to compare to legible problems. I assume legible problems can include things like jailbreaking-resistance, unlearning, etc.. for AI security. I don't see why these in particular necessarily bring forward the day... (read more)

Scoping LLMs

erik

erik, David Baek, emile delcourt, 4gate

10mo

Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio

Introduction & Problem Statement

Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, "helpful" means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. "Harmless" refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. "Honest" emphasizes transparency about the model's limitations and uncertainties—acknowledging when it doesn't know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a... (read 6496 more words →)

Replying toGradient Routing: Masking Gradients to Localize Computation in Neural Networks

4gate1y

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I agree that there are inductive biases towards sharing features and/or components. I'm not sure if there's a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate^[1], and I'm not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it's pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of... (read 355 more words →)

Replying toGradient Routing: Masking Gradients to Localize Computation in Neural Networks

4gate1y

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I'm sure you're planning to try something like a Llama model next.

I'm imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based on your needs, but if the alignment tax is really high (or lowering it... (read more)

Replying toCognitive Work and AI Safety: A Thermodynamic Perspective

4gate1y

Cognitive Work and AI Safety: A Thermodynamic Perspective

This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc... to make this more concrete? I briefly looked at some of the slides you shared but it doesn't seem to be there (though maybe I missed something, since I didn't watch he entire video(s)).

I'm not honestly sure exactly what this would look like since I don't fully understand much here beyond the notion that concentration of intelligence/cognition can lead to larger magnitude outcomes (which we probably already knew) and... (read more)

Replying toAGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

4gate1y

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P

The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)

Replying toHow I started believing religion might actually matter for rationality and moral philosophy

4gate1y

How I started believing religion might actually matter for rationality and moral philosophy

I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model... (read 876 more words →)

Replying toAGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

4gate1y

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on... (read 927 more words →)

Replying toMeasuring Structure Development in Algorithmic Transformers

4gate1y*

Measuring Structure Development in Algorithmic Transformers

How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that some sections are rather empty corresponding to the "model learns to specialize on certain parts... (read more)

What's going on with Per-Component Weight Updates?

4gate

Hi all, this is my first post on LW. It's a small one, but I want improve my writing, get into the habit of sharing my work, and maybe exchange some ideas in case anyone has already gotten further along some projection of my trajectory.

TLDR: I looked at the L2 norm of weight updates/changes to see if it correlates with Grokking. It doesn't seem to trivially, but something non-obvious might be happening.

What This Is

In this post I'm mainly just sharing a small exploration I did into the way weights change over training. I was inspired by some of the older Grokking/phase change work (i.e. on modular addition and induction heads). Broadly, this... (read 1683 more words →)

Replying toEIS V: Blind Spots In AI Safety Interpretability Research

4gate1y

EIS V: Blind Spots In AI Safety Interpretability Research

Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.

On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...)... (read more)

Replying toMechanistically Eliciting Latent Behaviors in Language Models

4gate2y

Mechanistically Eliciting Latent Behaviors in Language Models

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constraint other than that?
How do people usually constrain things like

... (read more)