Thanks @Gunnar_Zarncke , I appreciate your comment! You correctly identified my goal, I am trying to ground the concepts and build relationships "from the top to the bottom", but I don't think I can succeed alone.
I kindly ask you to provide some challenges: is there any area that you feel "shaky"? Any relation in particular that is too much open to interpretation? Anything obviously missing from the discussion?
Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.
Of course, I am happy to be challenged.
Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all the steering techniques are a bit arbitrary at the moment). What's the added value of your operations? Maybe you have some intuition about why your calculation is finding the "correct" amount of steering, and I am curious to know more.
(2) Ethics plays a fundamental role in finding a collective solution to AI safety, but I tend to think that we should solve alignment first. It would be interesting to see your future research going in that direction. I can help brainstorming some topics that have not been exhaustively studied yet. Let me know!
Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.
I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).
Here a proposal of a possible ELK solution using that approach.
I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.
In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.
Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
I included a link to your post in mine, because they are strongly connected.
Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Thanks for your great contribution, looking forward to reading more.
Something quite unexpected happened in the past 19 hours: since I published this post, I received over 12 downvotes! I wasn't expecting lots of feedback anyway, but this time I was definitely caught by surprise by looking at a negative result.
It's okay if my point of view doesn't resonate with the community (being popular is not the reason why I write here), however I am intrigued by this reaction and I'd like to investigate it.
If you happen to read my post and you decide to downvote it, please proceed - but I'd appreciate it if you could explain the reason why. I m happy to be challenged and I will accept even harsh judgements, if that's how you feel.