All of fhomeson's Comments + Replies

Did you try getting the centroid of all words, rather than all tokens? The set of token will contain a lot of nonsense fragments.

2mwatkins
No, would be interesting to try. Someone somewhere might have compiled a list of indexes for GPT-2/3/J tokens which are full words, but I've not yet been able to find one.