All of Gianluca Calcagni's Comments + Replies

I'd like to thank the authors for this, I really appreciate this line of research. I also noticed that it is being discussed on Ars Technica here https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/

Some time ago I discussed a "role-based" approach (based on steering vectors rather than prompts; I called the roles "dispositional traits", but it's pretty much the same thing) to buy time while working on true alignment; maybe this approach will achieve true alignment, but (for now) there is no ma... (read more)

How do you envision access and control of an AI that is robustly and reasonably compliant? And in which way would "human values" be involved? I agree with you that they are part of the solution, but I want to compare my beliefs with yours.

I want to thank the author about this post, that is a very interesting read. I see many comments already and I didn't have the time to read them thoroughly, so my apologies if what I am stating below has been discussed already.

The author's point that I wish to challenge is the third one: "alignment is not about loving humanity; it's about robust reasonable compliance".

I agree with the fact that embedding (in a principled way) "love" for humanity is a bad solution, and that it cannot go in a favourable direction - as, in the best scenario, it will likely di... (read more)

2boazbarak
To be clear, I think that embedding human values is part of the solution  - see my comment  

Thanks to anyone that took the time to read and vote this post - regardless if it was a positive or a negative vote, I still appreciate it.

If you happen to downvote me, I'd appreciate it if you could explain the reason why: this is the second time that happens (for one of my previous posts I chose a title that was sounding like clickbait - I then corrected it), and I am curious to understand your feedback this time as well.

The reason why I write posts here from time to time is simply to be challenged and be exposed to different points of view: that cannot happen without an exchange (even if harsh).

Let me take this chance to wish you all a happy new year 2025!
Gianluca

Hi Andrew, your post is very interesting and it made me think more carefully about the definition of consciousness and how it applies to LLMs. I'd be curious to get your feedback about a post of mine that, in my opinion, is related to yours - I am keen to receive even harsh judgement if you have any!
https://www.lesswrong.com/posts/e9zvHtTfmdm3RgPk2/all-the-following-are-distinct

Something quite unexpected happened in the past 19 hours: since I published this post, I received over 12 downvotes! I wasn't expecting lots of feedback anyway, but this time I was definitely caught by surprise by looking at a negative result.

It's okay if my point of view doesn't resonate with the community (being popular is not the reason why I write here), however I am intrigued by this reaction and I'd like to investigate it.

If you happen to read my post and you decide to downvote it, please proceed - but I'd appreciate it if you could explain the reason why. I m happy to be challenged and I will accept even harsh judgements, if that's how you feel.

2Milan W
I did not read the whole post, but on a quick skim I see it does not say much about AI, except for a paragraph in the conclusion. Maybe people felt clickbaited. Full disclosure: I did neither upvote nor downvote this post.

Thanks @Gunnar_Zarncke , I appreciate your comment! You correctly identified my goal, I am trying to ground the concepts and build relationships "from the top to the bottom", but I don't think I can succeed alone.

I kindly ask you to provide some challenges: is there any area that you feel "shaky"? Any relation in particular that is too much open to interpretation? Anything obviously missing from the discussion?

3Gunnar_Zarncke
Have you seen 3C's: A Recipe For Mathing Concepts? I think it has some definitions for you to look into, esp. the last sentence:

Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.

Of course, I am happy to be challenged.

Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all... (read more)

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition... (read more)

3Neel Nanda
Glad you liked the post! I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added. I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.

In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.

Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

I included a link to your post in mine, because they are strongly connected.

3Christopher Ackerman
Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I've seen the most interesting results have come from RLHF'd models), then it's possible that "better" RLH/AIF could obviate the need for them in the general use case, while they could still be useful for specialized purposes.

Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

This technique works with more than just refusal/acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

After one year, it's been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I'd be very grateful!

I am glad I read your post, very relevant for safety! It seems to me that the Steering Directions are a variant of the Control Vectors that I described here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Please can you confirm the two concepts are following essentially the same approach?

I agree with you over the benefits of the technique, it is very promising. If there was a proper analysis of its scalability properties and if there was some way to estimate mathematically its likelihood of success, it would be a dramatic breakthrough

2Alejandro Tlaie
Hi Gianluca, it's great that you liked the post and the idea! I think that your approach and mine share things in common and that we have similar views on how activation steering might be useful! I would definitely like to chat to see whether potential synergies come up :)