LESSWRONG
LW

All of Gianluca Calcagni's Comments + Replies

Auditing language models for hidden objectives

17d12

I'd like to thank the authors for this, I really appreciate this line of research. I also noticed that it is being discussed on Ars Technica here https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/

Some time ago I discussed a "role-based" approach (based on steering vectors rather than prompts; I called the roles "dispositional traits", but it's pretty much the same thing) to buy time while working on true alignment; maybe this approach will achieve true alignment, but (for now) there is no ma... (read more)

Six Thoughts on AI Safety

Gianluca Calcagni

2mo10

How do you envision access and control of an AI that is robustly and reasonably compliant? And in which way would "human values" be involved? I agree with you that they are part of the solution, but I want to compare my beliefs with yours.

Six Thoughts on AI Safety

Gianluca Calcagni

2mo10

I want to thank the author about this post, that is a very interesting read. I see many comments already and I didn't have the time to read them thoroughly, so my apologies if what I am stating below has been discussed already.

The author's point that I wish to challenge is the third one: "alignment is not about loving humanity; it's about robust reasonable compliance".

I agree with the fact that embedding (in a principled way) "love" for humanity is a bad solution, and that it cannot go in a favourable direction - as, in the best scenario, it will likely di... (read more)

2boazbarak2mo

To be clear, I think that embedding human values is part of the solution - see my comment

How Business Solved (?) the Human Alignment Problem

Gianluca Calcagni

3mo10

Thanks to anyone that took the time to read and vote this post - regardless if it was a positive or a negative vote, I still appreciate it.

If you happen to downvote me, I'd appreciate it if you could explain the reason why: this is the second time that happens (for one of my previous posts I chose a title that was sounding like clickbait - I then corrected it), and I am curious to understand your feedback this time as well.

The reason why I write posts here from time to time is simply to be challenged and be exposed to different points of view: that cannot happen without an exchange (even if harsh).

Let me take this chance to wish you all a happy new year 2025!
Gianluca

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

Gianluca Calcagni

4mo30

Hi Andrew, your post is very interesting and it made me think more carefully about the definition of consciousness and how it applies to LLMs. I'd be curious to get your feedback about a post of mine that, in my opinion, is related to yours - I am keen to receive even harsh judgement if you have any!
https://www.lesswrong.com/posts/e9zvHtTfmdm3RgPk2/all-the-following-are-distinct

An Opinionated Look at Inference Rules

Gianluca Calcagni

7mo10

Something quite unexpected happened in the past 19 hours: since I published this post, I received over 12 downvotes! I wasn't expecting lots of feedback anyway, but this time I was definitely caught by surprise by looking at a negative result.

It's okay if my point of view doesn't resonate with the community (being popular is not the reason why I write here), however I am intrigued by this reaction and I'd like to investigate it.

If you happen to read my post and you decide to downvote it, please proceed - but I'd appreciate it if you could explain the reason why. I m happy to be challenged and I will accept even harsh judgements, if that's how you feel.

2Milan W7mo

I did not read the whole post, but on a quick skim I see it does not say much about AI, except for a paragraph in the conclusion. Maybe people felt clickbaited. Full disclosure: I did neither upvote nor downvote this post.

All the Following are Distinct

Gianluca Calcagni

8mo30

Thanks @Gunnar_Zarncke , I appreciate your comment! You correctly identified my goal, I am trying to ground the concepts and build relationships "from the top to the bottom", but I don't think I can succeed alone.

I kindly ask you to provide some challenges: is there any area that you feel "shaky"? Any relation in particular that is too much open to interpretation? Anything obviously missing from the discussion?

3Gunnar_Zarncke8mo

Have you seen 3C's: A Recipe For Mathing Concepts? I think it has some definitions for you to look into, esp. the last sentence:

Instruction-following AGI is easier and more likely than value aligned AGI

Gianluca Calcagni

9mo30

Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.

Of course, I am happy to be challenged.

Introducing SARA: a new activation steering technique

Gianluca Calcagni

9mo10

Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all... (read more)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Gianluca Calcagni

9moΩ010

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition... (read more)

3Neel Nanda9mo

Glad you liked the post! I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added. I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)

When is a mind me?

Gianluca Calcagni

9mo20

I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.

In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.

Representation Tuning

Gianluca Calcagni

9mo10

Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

I included a link to your post in mine, because they are strongly connected.

3Christopher Ackerman9mo

Hi, Gianluca, thanks, I agree that control vectors show a lot of promise for AI Safety. I like your idea of using multiple control vectors simultaneously. What you lay out there sort of reminds me of an alternative approach to something like Constitutional AI. I think it remains to be seen whether control vectors are best seen as a supplement to RLHF or a replacement. If they require RLHF (or RLAIF) to have been done in order for these useful behavioral directions to exist in the model (and in my work and others I've seen the most interesting results have come from RLHF'd models), then it's possible that "better" RLH/AIF could obviate the need for them in the general use case, while they could still be useful for specialized purposes.

Jailbreak steering generalization

Gianluca Calcagni

9mo30

Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Clarifying "AI Alignment"

Gianluca Calcagni

9mo10

Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Refusal in LLMs is mediated by a single direction

Gianluca Calcagni

9mo50

This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Refusal in LLMs is mediated by a single direction

Gianluca Calcagni

9mo20

This technique works with more than just refusal/acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Experiments in Evaluating Steering Vectors

Gianluca Calcagni

9mo30

After one year, it's been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I'd be very grateful!

Introducing SARA: a new activation steering technique

Gianluca Calcagni

9mo10

I am glad I read your post, very relevant for safety! It seems to me that the Steering Directions are a variant of the Control Vectors that I described here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Please can you confirm the two concepts are following essentially the same approach?

I agree with you over the benefits of the technique, it is very promising. If there was a proper analysis of its scalability properties and if there was some way to estimate mathematically its likelihood of success, it would be a dramatic breakthrough

2Alejandro Tlaie9mo

Hi Gianluca, it's great that you liked the post and the idea! I think that your approach and mine share things in common and that we have similar views on how activation steering might be useful! I would definitely like to chat to see whether potential synergies come up :)