I Recommend More Training Rationales

Gianluca Calcagni

Some time ago I happened to read the concept of training rationale described by Evan Hubinger, and I really liked it. In case you are not aware: training rationales are a bunch of questions that ML developers / ML teams should ask themselves in order to self-assess pros and cons when adopting a certain safety approach.

I decided to take some time and analyse if there were obvious things missing in the rationales. In this post, I am attempting to explain the level of detail that I’d expect, and further questions I’d recommend to consider.

Remark: in the following, for simplicity, I am describing all AI models as “robots” - it doesn’t matter if the robots are physical or not.

Theory vs Practice

First things first: in any analysis, there are two distinct levels interacting with each other - aka, Theory vs Practice. Both are important, but they require very different mindsets to be effective and work organically together.

Theory is attempting to solve the problem: “I wonder if X could be achieved by means of Y”, where X and Y are some arbitrary things.
This level of analysis does not care how much you wish for X to happen, nor if Y is convenient. By using the concept of direction of fit: “The-Mind-Should-Fit-The-World” is the only direction that matters, and your personal desires / values / etc. shall not affect your analysis.
Practice is attempting to solve the problem: “I wish to achieve X by means of Y”, where X and Y are the same as above^[1].
This level of analysis does not care if Y is the best tool to achieve X, nor if X is a valuable goal. By using the concept of direction of fit: “The-World-Should-Fit-The-Mind” is the only direction that matters, and your analysis shall only evaluate if any attempt would be successful, and in which measure.

When discussing AI Safety, both theory and practice are needed: for example, in this video playlist, Evan considers^[2] two problems: Implementation Competitiveness and Performance Competitiveness. Let me recap how that goes.

Implementation Competitiveness
- X = safety guarantees
- Y = this training method
- Theory: is this training method able to guarantee some levels of safety?
- Practice: is this training method able to scale its safety guarantees on budget?
Performance Competitiveness
- X = acceptable behaviour
- Y = these performance restrictions
- Theory: is this robot so exceedingly performant that I cannot assess its true latent capabilities?
- Practice: is this robot so poorly performant that the market will prefer an unsafe competitor instead?

Training rationales also include the following two problems, that were born in the context of mesa-optimization (but I am stating below in a generic way):

Outer Alignment
- X = task instructions
- Y = this reward function
- Theory: is this reward function able to instil precisely the objective I wish the robot to accomplish?
- Practice: is this reward function sneakily planting questionable quirks?
Inner Alignment
- X = work motivation
- Y = this specific robot
- Theory: can I prompt this specific robot into doing the job I requested, and only that?
- Practice: is this specific robot trying to deceive me, despite my inducements?

As you can tell, theory and practice interlace in an inextricable way.

The Five Phases

I identified five phases that matter when discussing AI Safety:

the platonic phase, that is interested in understanding what’s good for mankind.
It delivers tasks to fulfil.
the implementation phase, that is interested in detailing out a job well done.
It delivers instructions to delegate.
the delegation phase, that is interested in entrusting specific robots for each job.
It delivers task assignments.
the testing phase, that is interested in evaluating if some work meets expectations.
It delivers auditing processes.
the retrospective phase, that is interested in analysing if mankind is better off now.
It delivers feedback and steering plans.

The five phases run in a loop: 1, 2, 3, 4, 5, 1, 2, …

Please find below the questions I recommend to include in future training rationales.

PHASE

GOAL

PROBLEMS

QUESTIONS

Platonic Phase, Theory.

Are we able to identify changes that will undoubtedly improve the world?

It’s a problem of strategic decision-making and vision.

I wonder if X could be accomplished by means of Y.

X = a better world

Y = fulfilling this task

It’s about not having regrets after making a wish.

-What is a “better world”? That is highly debatable.
-Unforeseen consequences are unavoidable.
-Human values are not universal nor constant over time.

-Is this goal ethically virtuous?
-If we get what we want, will everyone be happy in the long-term?
-Do we have a way to correct our mistakes at any point in the future?

Platonic Phase, Practice.

Are we driven towards doing the right thing?

It’s a problem of economy, cooperation, and satisfaction.

I wish to accomplish X by means of Y.

X = a better world

Y = fulfilling this task

It’s about putting aside our selfish interests (or reining them into a good cause).

-There is a natural tension between personal interests and common interests.

-Politics may be impossible to handle without consent management + diplomacy.

-Some means that are acceptable today may become unacceptable tomorrow.

-Can our personal / business objectives be also good for mankind?

-Do we see existential risks in what we are doing?

-Are AI deployments being rushed?

-Are we taking into account all possible opinions, including extreme ones?

Implementation Phase, Theory.

Are we able to define with clarity what we really want?

It’s a problem of clarification of our intents.

I wonder if X could be delegated by means of Y.

X = my task

Y = providing these instructions

It’s about discarding all possibilities but the sound ones.

-It is unknown how to formalise accurately any task (e.g. in a reward function).

-Some tasks are impossible to detail out (e.g. using some common sense).

-We are missing a mathematical framework that connects goals with training methods.

-Do we fully understand what we want? Can we explain it exhaustively?

-Are we considering all the side-effects of our requests?

-Which level of accuracy is needed to provide some form of safety guarantee?

Implementation Phase, Practice.

Is what we asked for the same as what we really wanted?

It’s a problem of declaration of our needs.

I wish to delegate X by means of Y.

X = my task

Y = providing these instructions

It’s about transforming a vision into operational details.

-Sometimes we don’t have the time to detail out a task.

-Providing both do’s and don’ts is very impractical.

-We want robots with an advanced theory of mind, but that may also be used to manipulate us.

-Is our choice of training method effective?

-Is it possible that our training process will inadvertently be inconsistent or misspecified?

-How can we confirm that a robot has a clear understanding of its tasks?

Delegation Phase, Theory.

Is this robot able to understand what we ask for?

It’s a problem of acumen from the robot’s side.

I wonder if X could be fulfilled by means of Y.

X = my instructions

Y = entrusting this robot

It’s about training robots that are just smart enough.

-Even if the robot fully understands the assigned task, it may not be aligned with it.
-Even if the robot is aligned now, it may not be later.
-The robot may use its own judgement to fill any gap in our instructions, at our own risk.

-Is the robot able to understand our requests at all?

-Is the robot going to exploit any mistake we may accidentally prompt?

-Will the robot steer behaviour if we change objectives later, or will it only pretend to do so?

Delegation Phase, Practice.

Is this robot actually going to do what we asked for?

It’s a problem of fulfilment from the robot’s side.

I wish to fulfil X by means of Y.

X = my instructions

Y = entrusting this robot

It’s about training robots that are reliable for good.

-The robot may try to deceive us and pursue its own goals while pretending to pursue our goals.

-Even if the robot is really pursuing our goals, it may not be performing as per expectations.

-Safe and performant robots may be too expensive in respect to unsafe ones.

-Is there a robot that is actually able to do what we request?

-Can “safe” robots be as performant as “unsafe” ones?

-Is the robot’s behaviour stable and predictable under small perturbations and over long periods of time?

-Is the cost of work compatible with our budget?

Testing Phase, Theory.

Is there a foolproof way to monitor a robot?

It’s a problem of strategic evaluation and KPI selection.

I wonder if X could be assessed by means of Y.

X = my robot’s work

Y = running this evaluation

It’s about devising flawless tests for the robot’s work.

-The robot may try to hide its true intentions and activities.

-Even if the robot’s intentions were good, it may be tempted to hide some of its failures.

-Even if the robot’s intentions were good and its actions successful, it may be tempted to appear more useful than it actually was.

-Can human bias and fact misrepresentation be minimised?

-Can we understand the true drives & capabilities of a robot?

-Can a robot be altered to the point that it changes behaviour, beliefs, and/or memory?

-Can we make robots inoffensive as soon as maintenance is no longer possible?

Testing Phase, Practice.

Is the robot doing something we did not ask for (and we did not want either)?

It’s a problem of auditing and compliance checking.

I wish to assess X by means of Y.

X = my robot’s work

Y = running this evaluation

It’s about judging reliably / fairly despite advanced deceptions.

-The robot may make minimal overlooked changes that sum up over time.

-The robot may use tricks (such as optical illusions) to alter our perception of its work.

-The robot may be hacked by a malicious actor, or it may even hack itself.

-Are our analysis tools adequate for assessing the robot?

-Can we monitor all the robot’s activities in a log? Can the log be tampered with?

-Are robots susceptible to external attacks / hacking?

-Can we successfully halt our robots anytime?

-Can we detect secret robot cooperation?

Conclusion

I hope some of my “new” questions will be considered and adopted by the community of ML devs. I will be happy to include more suggestions from the community and I plan to keep this list updated.

Let me take this chance to wish a happy new year 2025 to everyone.

Further Links

Control Vectors as Dispositional Traits (my first post)

All the Following are Distinct (my second post)

An Opinionated Look at Inference Rules (my third post).

Can AI Quantity beat AI Quality? (my previous post)

Who I am

My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2025) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.

Revision History

[2024-12-31] Post published.

^{^}
Note that X represents the ends while Y represents the means: however, I am not stating that the ends justify the means - rather, I am only suggesting to analyse them in pairs.
^{^}
I am curious to know if Evan would agree with me here, as I am condensing a long discussion and my interpretation may differ from his.