Immunization against harmful fine-tuning attacks
TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy. This work was done as part of AI Safety...
Congratulations!
I taught swimming as my first part-time job in high school and I got the same impression you have about the learning process being mostly subconscious. Feedback can be necessary for less intuitive aspects of stroke technique but kids seemed to improve at this level of swimming by figuring out what something should feel like and then orienting towards that sense.
For example, some would hold their head up while trying to get onto their back, which would inevitably keep their hips too low to float, even while moving backwards. I'd usually explain or demonstrate what was going on but the real hump to get over was whatever aversive sensation they'd get when... (read more)