Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

Ram Bharadwaj

14 Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

20th Oct 2024

1 min read

14

The Platonic Representation Hypothesis (PRH) proposes that models trained with different objectives and across different modalities can converge toward a shared statistical representation of reality.

However, the original experiments focused primarily on image models trained on the same underlying data distribution (ImageNet), leaving open the question of whether this convergence depends on shared training data.

To investigate this, I repeated the analysis using ImageNet-O, a dataset specifically designed to be out-of-distribution relative to ImageNet, and re-evaluated the correlations between alignment scores across a range of image classification models.

The strong correlations persisted in the OOD setting, providing evidence that PRH is not merely a consequence of models being trained on the same distribution. Instead, these results suggest that models may converge toward common representations even when evaluated on substantially different data, pointing to a deeper underlying structure in how learned systems model reality.

The plots below compare the original in-distribution results with the corresponding OOD analysis.

But does this mean that the model's align even for purely randomly generated data?

The answer is NO.

This plot shows the correlation of the alignment scores for the models on purely randomly generated images.

One particularly interesting observation is the contrast between random data and out-of-distribution (OOD) data. On randomly generated inputs, the correlations predicted by PRH largely disappear, suggesting that the models do not share a common statistical representation in the absence of meaningful structure. In contrast, the strong correlations persist on ImageNet-O, despite the dataset being out-of-distribution relative to ImageNet. The models are often wrong on these examples, but they tend to be wrong in similar ways and with high confidence. In other words, the models fail predictably rather than randomly. \

This suggests that the shared structure identified by PRH is not merely agreement on correct predictions, but agreement on how to interpret and extrapolate from structured data, even when those extrapolations are systematically incorrect. The fact that independently trained models converge to similar mistakes on OOD examples provides additional evidence for a shared underlying representation of the world.

The notebook documenting these experiments is available here.

AI

Personal Blog

14

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:14 PM

[-]jacob_drori2y10

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.