Summary: What does it mean for a loss function to be "aligned with" human goals? I perceive four different concepts which involve "loss function" in importantly different ways:

Physical-loss: The physical implementation of a loss function and the loss computations,
Mathematical-loss: The mathematical idealization of a loss function,
A loss function "encoding/representing/aligning with" an intended goal, and
Agents which "care about achieving low loss."

I advocate retaining physical- and mathematical-loss. I advocate dropping 3 in favor of talking directly about desired AI cognition and how the loss function entrains that cognition. I advocate disambiguating 4, because it can refer to a range of physically grounded preferences about loss (e.g. low value at the loss register versus making perfect future predictions).

Related: Towards deconfusing wireheading and reward maximization.^[1] I'm going to talk about "loss" instead of "reward", but the lessons apply to both.

I think it's important to maintain a sharp distinction between the following four concepts.

1: Physically implemented loss

The loss function updated my network.

This is a statement about computations embedded in physical reality. This statement involves the physically implemented sequence of loss computations which stream in throughout training. For example, the computations engendered by loss_fn = torch.nn.CrossEntropyLoss().

2: Mathematical loss

The loss function is a smooth function of the prediction distribution.

This is a statement about the idealized mathematical loss function. These are the mathematical objects you can prove learning theory results about. The Platonic idealization of the learning problem and the mathematical output-grading rule casts a shadow into your computer via its real-world implementation (concept 1).

For example, $(D, ℓ)$ where $D := {(x, label (x)) ∣ x \in MNIST}$ is the mathematical idealization of the MNIST dataset, where the $x \in R^{28 \times 28}$ are the idealized grayscale MNIST images. And $ℓ$ is the mathematical function of cross-entropy (CE) loss between a label prediction distribution and the ground-truth labels.

3: Loss functions "representing" goals

I want a loss function which is aligned with the goal of "write good novels."

This is an aspirational statement about achieving some kind of correspondence between the loss function a...

Reward Functions

Reward Functions