As a sanity check it would have been nice if they showed Procrustes and RDM results with the vocabulary items randomly permuted (if you can align with randomly permuted tokens that's a bad sign).
Also since they compute the RDM using Euclidean distances instead of e.g. inner products, all the elements are non-negative and the cosine similarity would be non-negative even for completely unrelated embeddings. That doesn't necessarily invalidate their scaling trends but it makes it a bit hard to interpret.
I think there are much better papers on this topic, such as this one.
This is a linkpost for https://arxiv.org/abs/2308.15047.