I can't be sure what's in the data, but we have a few hints:
The exact question ("is this land or water?"), is of course, very unlikely to be in the training corpus. At the very least, the models contain some multi-purpose map of the world. Further experimentation I've done with embedding models confirms that we can extract maps of biomes and country borders from embedding space too.
There's definitely compression. In smaller models, the ways in which the representations are inaccurate actually tell us a lot: instead of spikes of "land" around population centers (which are more likely to be in the training set), we see massive smooth elliptical blobs of land. This indicates that there's some internal notion of geographical distance, and that it's identifying continents as a natural abstraction.
Sure: ~$100 between API credits (majority of the cost from proprietary models) and cloud GPUs. A few of the smaller models were evaluated on my M4 Macbook Pro with 24 gigs of unified ram. For larger open weight models, I rented A100s. Most runs took about 20 minutes at the 2 degree resolution.
I agree with most points on a first pass, but I'm still unsure about:
you must have added significant value beyond what the AI produced
Shouldn't the target for posts be to provide value? If an entirely AI-generated post passes every quality check and appears to be on equal footing to a human post in terms of value, I'd want it. Attribution of credit is a valid concern, but it seems like the solution there is to simply tag the model as the primary author.
Yeah, my involvement was providing draft feedback on the article and providing some of the images. Looks like my post got taken down for being a duplicate, though
It's a bit of an out-of-body experience to see your own tweet in a newsletter! The location-finding model used is available on geospy.ai.
You'd think, but nope, it's explicitly named after the web fiction.