Anything that can rewrite all physical evidence down to the microscopic level can almost certainly conduct a 51% (or a 99.999%) attack against a public ledger against mere mortal hashing boxes. More to the point, it can just physically rewrite every record of such a ledger with one of its own choosing. Computer hardware is physical, after all.
A counter example scenario where this point doesn't hold true is if the "good" guys have fled to space, and the "evil" AI has physical control of earth's surface, where all the history is. The center of Bitcoin as determined by light speed delays could shift to a near-solar orbit where there is cheap energy. There could be jurisdictional issues (a peace treaty?) preventing the AI from physically altering ships in space, say, but no such rule prevents plausibly deniable attacks on the ships' information storage systems.
A 51% blockchain attack doesn't prevent timestamping from being a credible piece of evidence. There is still a large amount of hash-work piled on top of the timestamp, which would be hard to duplicate. Every alternate history proposed by a deceptive superintelligence would need to meet the same standard of evidence, as long as the hashes in the timestamp haven't been broken.
OpenTimestamps was created by Petertodd.
This is not a coincidence because nothing is ever a coincidence.
Exciting on so many levels.
First: it is practical and doable now, while having the saving-the-world vibes. I thoroughly support this idea. I think the advances in storage will soon enough (sooner than AIs reach the rewriting-history stage) allow us to record multiple physical copies of the 100+Tb database and spread them out throughout our habitat, making the revisionist AIs' task significantly more difficult.
An even grander idea: laser-beam the data into space, aiming at several stars at different distances, and listen to the echoes coming back, reflected from the star system's components. These echoes will be super weak but hopefully detectable and decipherable with future technology, and importantly, it will be absolutely unforgeable by AIs here on earth. For example, aim at a star 50ly away and get back your data in 100 years, guaranteed untampered.
But then, I think that your list of "what it takes for this to work" misses a critical item:
7. Those in the future who care about knowing the truth will need the guts to accept that data about the past contained in these hashed torrents is true, even if it contradicts their ideas and memories of that past.
I think that for superintelligent AIs it will be much easier to convince us all (and probably themselves) in a wrong version of the past, perhaps combined with faking some evidence, than seek and subvert all the evidence there is. I find it quite probable that this is the road they will take first, and they won't even much care about subverting your hashed torrents because it won't be necessary.
Imagine you live in a future. Imagine you are as confident of your memories and your general idea of the past as you are now. Imagine you get interested, verify the hashes and timestamps, unzip the torrents, start to read, and start seeing references about pink unicorns everywhere! Imagine all these papers and books mention pink unicorns as a pretty common thing that exists, and can be seen, experienced, studied, filmed, etc.
What do you think would be a more common outcome then:
Or, why ' petertodd' is a tightrope over the Abyss.
The omphalos hypothesis made real
What if, in the future, an AI corrupted all of history, rewriting every piece of physical evidence from a prior age down to a microscopic level in order to suit its (currently unknowable) agenda? With a sufficiently capable entity and breadth of data tampering, it would be impossible for humans and even for other superintelligences to know which digital evidence was real and which was an illusion. I propose that this would be bad.
But what are we mere mortals supposed to do against such a powerful opponent? Is there a way to even the odds, to tunnel through the Singularity? A way to prove that reality existed before the AI? In fact, there is such a way, and it has been easy to do since 2016 when OpenTimestamps was released.
Some background you can skip
Cryptographic hashes are one-way mathematical functions that take an input and produce a seemingly random output from zero up to a maximum value. The same input always results in the same output, so if you know the output (a large random number that is still small enough to write down on a piece of paper) you can verify that the number that went into the hash function (the data you wanted to preserve) is the same, without having to carefully store the data itself and prevent it from being modified. Hash functions are widely used in cryptography and computer science as checksums for proving the integrity of a message.
If we want to prove that some data existed before a certain date, we can hash the data and publish it on a notarized public ledger. The 1991 paper "How to Time-Stamp a Digital Document" is the first description of a system now known as a "blockchain". Bitcoin at its core is a long chain of hashes, where each hash includes the previous hash and some data, the time and date, and also a special random number found through brute force search which is called a nonce number. This special random number is such that the resulting hash of everything starts with a large number of zeroes, i.e. it's near the start of the output space of the hash function. If someone wanted to change some data in the middle of the list, they'd have to find as many of these special random numbers as have been found so far. Meanwhile, everyone else is busy computing more special random numbers on top of the existing list, so they will lose that race.
It's expensive to put a lot of data into the list (currently about $0.02 per byte) so instead we can just put hashes of the data we want to store into the list. Great! Now other people will add their monetary transaction data hashes on top of our hashed data, and create more hashes after that, and so on until the end of general purpose computing.
But what if we have a lot of data to store? More than is feasible to keep in one location, or if the probability of random bit errors is high? A single bit error anywhere in the data will destroy the resulting hash. One solution is to hash of each piece of data, then hash our entire list of hashes and put that hash-of-hashes into the public chain instead. It ought to be just as good, right? Well, not quite. We would have to memorize the entire list of hashes, which can get quite large in the case of a global notary system. There are currently 90,000 hashes waiting to be notarized. Instead we can store a list of only log(n) hashes by using a data structure called a Merkle tree, and putting only the root node (a hash) into the Bitcoin list. This is what OpenTimestamps does. The short list of hashes on the path to your data, plus their location in the blockchain, is packed into a file called an OTS proof, which you then have to store if you want it to be easy to validate your claim that the data existed.
Torrents
What data should we store to prove that reality existed? Books? Science? Photos?
Fortunately, there are some ready-made giant piles of data available on the internet, namely the sci-hub and libgen torrents. These torrents contain ~90 million scientific journal articles and ~7 million books, stored as 100 zip files per torrent, since the maximum manageable number of files in a single torrent is or was 1000 in practice. (I'm not sure about any of this, actually. Some of the newer files have 30,000 files each. The files can be inspected with exiftool.)
A torrent file is a SHA-1 hash list of chunks of data, primarily intended for detecting random errors during file transfer. SHA-1 is a cryptographic hash, but some theoretical vulnerabilities to hash collisions were discovered in 2017 and it is now deprecated. Nevertheless, it should be extremely difficult to find 100 simultaneous SHA-1 hash collisions, where every collision is also a set of self-consistent books or scientific articles, which are then compressed together. The self consistency in these PDF files also would have to represent an illusory reality that matches all the other PDF files in the other hashed zip files, and also not have any obvious traces of tampering such as long strings of garbage data. A tall order, even for a superintelligence. Still, that ' petertodd' stuff gives me the willies.
It's still possible for a superintelligence to cut this gordian knot by breaking the SHA-1 algorithm completely. Often, finding collisions for a theoretically broken hash function is still somewhat computationally difficult, so it's good that there are over 700 MB of SHA-1 hashes in the set of torrent files. It would still be prudent to download some of the torrents and hash the data that they refer to directly with several different hashing algorithms, and timestamp those hashes too. Unfortunately, this carries some legal risk in the modern day, since it all but proves that you downloaded the files that the torrents refer to, if you publish timestamps for their contents.
Re-hashing the data
Maybe the libgen and sci-hub maintainers could be persuaded to re-hash their collection with Blake3, SHA-512, whirlpool, really as many different families of hash functions as possible. Then, others could store and publish the ~100 million hashes and timestamp proofs without assuming any legal risk, for verification at a later date when the copyright has expired or the institution of copyright law no longer exists. Copyright holders may want to verify their collections as well, but will not have had the foresight to timestamp them or have used seemingly unreasonable levels of hashing.
It would also be prudent to give the same treatment to open access, open source, out-of-copyright, copyleft, and public domain data, which includes many government publications. We can't be sure what will be relevant, so breadth is key. Unfortunately, the Internet Archive has been unwilling to cooperate so far with timestamping efforts.
Hashing the torrent files
This timestamping idea has been around since 1991, but someone still has to actually DO it.
Since I am on the side of truth and beauty, and against deceit and corruption, I have taken some first steps on this project by downloading the libgen and sci-hub torrent files, and also also all the torrent files managed by Anna's Archive, and then I timestamped them. There are thousands of torrent files, so here I will just give the summary instead of a complete list of hashes.
Here is how to do so yourself, in bash:
Hashing goes quickly and will produce some lists of SHA-512 hashes which look like this:
It should be noted that these hashes are not useful for downloading the torrent files from the Bittorrent DHT, instead you would need the magnet hash for that.
We also hashed all the SHA-512 hashes together with each other, which should have produced these three single line files which i have called sha512s_of_sha512.txt:
(no trailing newline)
Anna's Archive posts updates so the last will change, but it is still useful if timestamped.
We can also do other hashes like Blake3:
And blake3s_of_blake3.txt:
(no trailing newline)
You could also combine the SHA-512 and Blake3 torrent hashes into one file to require breaking both hashes. Mixing hashes in serial like sha512s_of_blake3.txt won't work because if either hash is broken then the whole thing is compromised. You could just fake the inputs or the output.
OTS proof
Let's take our two files sha512s_of_sha512.txt and blake3s_of_blake3.txt and upload them to OpenTimestamps. It grinds for a second and spits out this "proof" file, which you can recreate with base64 -d to verify that the hashes are stored in the blockchain as part of a Merkle tree:
All of our big fancy hashes get bottlenecked down to a couple small SHA-256 hashes by OpenTimestamps, which would be the obvious weak point to target. Smaller hashes are more vulnerable, but also cost less to store. OpenTimestamps was not created specifically with the goal of being superintelligence-resistant. There are an enormous number of SHA-256 brute forcing ASICs in existence and their performance will only increase in the future, and much work will go into trying to break this hash, possibly including during the peak of the Singularity itself, if superintelligences use Bitcoin or a derivative cryptocurrency for transactions. If in the future SHA-256 is partially broken, Bitcoin could move to a different hash, but your timestamps couldn't. It has stood the test of time for a few decades, but I still think you should either put your various hashes directly into the Bitcoin blockchain or at least spread them out over many separate OpenTimestamps attestations, each of which can take several days to incorporate.
At the end of the day, someone has to actually download a good sized portion of the book data and verify that it matches. The most straightforward way to do this would be with a torrent client, but of course that process could be compromised as well. Down the rabbit hole...
What is required for all this to work
1 - Someone has to continue mining Bitcoin through the Singularity with high tech.
A superintelligence could significantly outstrip existing global hashrate and be able to rewrite the entire blockchain history at will. If rivals of somewhat lesser hash power continue mining the existing chain, they will still have the advantage for timestamping purposes as long as the difference in hashrate isn't absurdly large. The question is whether anyone will continue mining Bitcoin when there is no purely economic reason to do so.
2 - Hashes of the torrents must be published and preserved.
The magic numbers above don't just stay in existence on their own. Someone has to be able to find them after the emergence of superintelligence in order to verify the torrent data.
3 - The torrent files themselves must be preserved.
The hashes above rely on all the specific quirks used in constructing the torrent files. Perhaps one could make new torrent files from the data, but I wouldn't want to rely on it.
4 - Someone has to preserve the actual data from at least one torrent from the list.
These magic numbers are useless unless backed up by words and images from the real world. Each torrent is about ~100GB which is something anyone can do. The total archive would be 100TB which is an expensive hobby purchase, but maybe worth it to maybe save the world. Keeping the data safe from rampaging nanobots and corrupting viruses is another matter...
5 - The hashes must remain mostly unbroken through the Singularity.
6 - Timestamp proofs must be published and preserved.
That's a lot of number spam for a LessWrong post!
Recently it has become increasingly onerous to publish something anonymously online. It is my hope that the magic numbers I have written above will continue to be available for search engines and archive bots to index and preserve. I will probably spam them in random places, but it's a bad strategy for reliable data preservation.
Comments and criticism are requested. There's no point in doing this if it's all wrong.