David Atanasov — LessWrong

LESSWRONG
LW

Congratulations!

I taught swimming as my first part-time job in high school and I got the same impression you have about the learning process being mostly subconscious. Feedback can be necessary for less intuitive aspects of stroke technique but kids seemed to improve at this level of swimming by figuring out what something should feel like and then orienting towards that sense.

For example, some would hold their head up while trying to get onto their back, which would inevitably keep their hips too low to float, even while moving backwards. I'd usually explain or demonstrate what was going on but the real hump to get over was whatever aversive sensation they'd get when... (read more)

Immunization against harmful fine-tuning attacks

domenicrosati

domenicrosati, JanWehner, David Atanasov

TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy.

This work was done as part of AI Safety Camp (AISC).

The purpose of this post is to extend our discussion of training-time domain authorization (TTDA) with a special case of TTDA that is perhaps most relevant to AI alignment and AI safety: figuring out how to prevent training towards harmful (or illegal) ends. We further scope this down to the setting of natural language generation in current safety-gaurded large language models in order to construct a... (read 3381 more words →)

Training-time domain authorization could be helpful for safety

domenicrosati

domenicrosati, JanWehner, David Atanasov

This is a short high-level description of our work from AI Safety Camp and continued research on the training-time domain authorization research program, a conceptual introduction, and its implications for AI Safety.

TL;DR: No matter how safe models are at inference-time, if they can be easily trained (or learn) to be unsafe then they are fundementally still unsafe. We have some ways of potentially mitigating this.

Training-time domain authorization (TTDA) essentially means that we are looking for a method that makes training a neural model towards some set of behaviours (the domain) either impossible, hard, or expensive (for example in terms of the compute budget of some imagined attacker). Another framing that might fit... (read 1857 more words →)