timestamping through the Singularity

throwaway918119127

Or, why ' petertodd' is a tightrope over the Abyss.

The omphalos hypothesis made real

What if, in the future, an AI corrupted all of history, rewriting every piece of physical evidence from a prior age down to a microscopic level in order to suit its (currently unknowable) agenda? With a sufficiently capable entity and breadth of data tampering, it would be impossible for humans and even for other superintelligences to know which digital evidence was real and which was an illusion. I propose that this would be bad.

But what are we mere mortals supposed to do against such a powerful opponent? Is there a way to even the odds, to tunnel through the Singularity? A way to prove that reality existed before the AI? In fact, there is such a way, and it has been easy to do since 2016 when OpenTimestamps was released.

Some background you can skip

Cryptographic hashes are one-way mathematical functions that take an input and produce a seemingly random output from zero up to a maximum value. The same input always results in the same output, so if you know the output (a large random number that is still small enough to write down on a piece of paper) you can verify that the number that went into the hash function (the data you wanted to preserve) is the same, without having to carefully store the data itself and prevent it from being modified. Hash functions are widely used in cryptography and computer science as checksums for proving the integrity of a message.

If we want to prove that some data existed before a certain date, we can hash the data and publish it on a notarized public ledger. The 1991 paper "How to Time-Stamp a Digital Document" is the first description of a system now known as a "blockchain". Bitcoin at its core is a long chain of hashes, where each hash includes the previous hash and some data, the time and date, and also a special random number found through brute force search which is called a nonce number. This special random number is such that the resulting hash of everything starts with a large number of zeroes, i.e. it's near the start of the output space of the hash function. If someone wanted to change some data in the middle of the list, they'd have to find as many of these special random numbers as have been found so far. Meanwhile, everyone else is busy computing more special random numbers on top of the existing list, so they will lose that race.

It's expensive to put a lot of data into the list (currently about $0.02 per byte) so instead we can just put hashes of the data we want to store into the list. Great! Now other people will add their monetary transaction data hashes on top of our hashed data, and create more hashes after that, and so on until the end of general purpose computing.

But what if we have a lot of data to store? More than is feasible to keep in one location, or if the probability of random bit errors is high? A single bit error anywhere in the data will destroy the resulting hash. One solution is to hash of each piece of data, then hash our entire list of hashes and put that hash-of-hashes into the public chain instead. It ought to be just as good, right? Well, not quite. We would have to memorize the entire list of hashes, which can get quite large in the case of a global notary system. There are currently 90,000 hashes waiting to be notarized. Instead we can store a list of only log(n) hashes by using a data structure called a Merkle tree, and putting only the root node (a hash) into the Bitcoin list. This is what OpenTimestamps does. The short list of hashes on the path to your data, plus their location in the blockchain, is packed into a file called an OTS proof, which you then have to store if you want it to be easy to validate your claim that the data existed.

Torrents

What data should we store to prove that reality existed? Books? Science? Photos?

Fortunately, there are some ready-made giant piles of data available on the internet, namely the sci-hub and libgen torrents. These torrents contain ~90 million scientific journal articles and ~7 million books, stored as 100 zip files per torrent, since the maximum manageable number of files in a single torrent is or was 1000 in practice. (I'm not sure about any of this, actually. Some of the newer files have 30,000 files each. The files can be inspected with exiftool.)

A torrent file is a SHA-1 hash list of chunks of data, primarily intended for detecting random errors during file transfer. SHA-1 is a cryptographic hash, but some theoretical vulnerabilities to hash collisions were discovered in 2017 and it is now deprecated. Nevertheless, it should be extremely difficult to find 100 simultaneous SHA-1 hash collisions, where every collision is also a set of self-consistent books or scientific articles, which are then compressed together. The self consistency in these PDF files also would have to represent an illusory reality that matches all the other PDF files in the other hashed zip files, and also not have any obvious traces of tampering such as long strings of garbage data. A tall order, even for a superintelligence. Still, that ' petertodd' stuff gives me the willies.

It's still possible for a superintelligence to cut this gordian knot by breaking the SHA-1 algorithm completely. Often, finding collisions for a theoretically broken hash function is still somewhat computationally difficult, so it's good that there are over 700 MB of SHA-1 hashes in the set of torrent files. It would still be prudent to download some of the torrents and hash the data that they refer to directly with several different hashing algorithms, and timestamp those hashes too. Unfortunately, this carries some legal risk in the modern day, since it all but proves that you downloaded the files that the torrents refer to, if you publish timestamps for their contents.

Re-hashing the data

Maybe the libgen and sci-hub maintainers could be persuaded to re-hash their collection with Blake3, SHA-512, whirlpool, really as many different families of hash functions as possible. Then, others could store and publish the ~100 million hashes and timestamp proofs without assuming any legal risk, for verification at a later date when the copyright has expired or the institution of copyright law no longer exists. Copyright holders may want to verify their collections as well, but will not have had the foresight to timestamp them or have used seemingly unreasonable levels of hashing.

It would also be prudent to give the same treatment to open access, open source, out-of-copyright, copyleft, and public domain data, which includes many government publications. We can't be sure what will be relevant, so breadth is key. Unfortunately, the Internet Archive has been unwilling to cooperate so far with timestamping efforts.

Hashing the torrent files

This timestamping idea has been around since 1991, but someone still has to actually DO it.

Since I am on the side of truth and beauty, and against deceit and corruption, I have taken some first steps on this project by downloading the libgen and sci-hub torrent files, and also also all the torrent files managed by Anna's Archive, and then I timestamped them. There are thousands of torrent files, so here I will just give the summary instead of a complete list of hashes.

Here is how to do so yourself, in bash:

# Have a couple GB of free disk space
wget -np -r https://libgen.rs/scimag/repository_torrent/ -e robots=off --warc-file=scimag-torrents
wget -np -r https://libgen.rs/repository_torrent/ -e robots=off --warc-file=libgen-torrents
mkdir annas-torrents ; cd annas-torrents
wget -nd -U '' -r -l 1 https://annas-archive.org/torrents --warc-file=annas-torrents ; cd ..

# This download will take a couple hours because the servers are slow. 
# The warc files are an archival format that can be put into the internet archive. 
# Now to hash the torrent files:

for i in libgen.rs/scimag/repository_torrent/ libgen.rs/repository_torrent/ annas-torrents 
do
  ( cd $i ; echo hashing $i
    ls *torrent | sort -t '_' -k2 -n | xargs sha512sum > sha512.txt 
    sha512sum $i/sha512.txt > sha512_of_sha512.txt )
done

Hashing goes quickly and will produce some lists of SHA-512 hashes which look like this:

ca6aeb407ce96e48f17f8fc8f7f3106afaaee53af24dd5b7543469485954692ade4b79c311f5f207b74b3ef968fb5de234e8473ebdf9345d370f686eb9968aa7  sm_00000000-00099999.torrent
75502c98e76a7386fbeb629f0456e87801c28389a50430ff8e30fc2fa9ae8a1f4460a0b474190ccab6cddfc4ecf566f98d35c9c151f1c199e1496d909cd129fd  sm_00100000-00199999.torrent
...
4a2e99b93717d1885472e276961a2f970e9ed39af1f7269fa703b2af6461bab35cb7c804cc0e9131513c5117c069642af2cbbf50a2af0cf808d139068b242090  sm_87400000-87499999.torrent
c53359b6742172acf3cfc490f2790eba2a19fc049e6f21a210e692b92f0eb76656d1fd2c66808e2a3c31bef3c601d7d2fabbdf1db292a6683369f94fa7bc0c2d  sm_87500000-87599999.torrent

c9a6511134e98ac6b410a0ba766520f8d3e11e2eccb0594005b88b859f406d5ec8878544898f53d2526286007fbb69d4f3daa64d9393685f2dd0197e017664ac  r_000.torrent
6aefc6d6cd8c2b4a6f3bfd26b24180ae4d222f5d09baf6933c1d9862965bbf709a72d021339615d207b336890cfa550323b141411b47e47f99743952fb469adb  r_1000.torrent
...
1391b59fe5cb6987655f29bb2c5283ec17de46fba30d807cefe2130b4a093c5aa3e033bae4383632ff3883df2319a08ccf8b0d1e1216ad79f648729914ba311c  r_4141000.torrent
3819d0304ecbd0689610ad2bd8af349022f9a6f6d6f1029bc11dbfb825e4808cdb3c6b79761ad47521496ead6e9d00443f3ac38dafc98d8870e7aab18325d676  r_4142000.torrent

6f1d2e620852918e3a2795734ca499603fccbb8a6fb6e95fe909f132c88bbff11e6980215e4cc0e994d43fc77327e1a3e0a23a3d744d157d2d39676dd0288e24  aa_lgli_comics_2022_08_files.sql.gz.torrent
8237640bd5f937916741636f9fda1e96975f2baa87e99a859f7e1d7ce26d64149d431ec074d8058c825b42b0306c6fa4189fb2b1bd3991284401a81c69f924ca  annas_archive_data__aacid__ia2_acsmpdf_files__20231008T203648Z--20231008T203649Z.torrent
...
9752083fd4e41fdc04c407ff127ee7a960117efd29b021ac9ed4c94e909b39ae68931d42568bae48f21a66cd226d1e7a9c61d14ca69cf5c33121816e064e2c87  pilimi-zlib-7230000-9459999.torrent
07d3e296116141a5a7d5cdf377c2ec8be75d476be128d4dd0482013d3e37d771b917607fc1f351281e0795cc9c5754992c80c5453539d2deb6bb69e336453125  pilimi-zlib-9460000-10999999.torrent
c45657751193c815a6bad5c2b31277b5c18970fac63ebae6282a928c243bee519915f68f30d8b64a953f7fe6faa2a10730414aa5ba343bf36b301cc9c44c8ffc  pilimi-zlib-index-2022-06-28.torrent
14a7ee9e068c564cfb60be426143b09d3fdad38f74ac0fb140d90bf0e61838381cc413aa7f30191e391376a439fc73de7f6f3302146b817936a9f674c14a4965  c_2022_12_thousand_dirs_magz.torrent
9a885ef15134eadcf9ad622b8cdb9ddacffacbf31147e0cb7e0b2c1879279c94e4d5429bfbd1c42572942ad94b54acd9745e9ef8f5b02eef0ac23804e750f31c  c_2022_12_thousand_dirs.torrent
48fb6a095f448a4b4805e1d91db0eba3fe2bf7b9f99d2ec3f1c7c7efb1176f79240e7ba684dbf546d0d34a989a1a3e1985355eb400dcb7b0c5a7651caa7d6e54  isbndb_2022_09.torrent

It should be noted that these hashes are not useful for downloading the torrent files from the Bittorrent DHT, instead you would need the magnet hash for that.

We also hashed all the SHA-512 hashes together with each other, which should have produced these three single line files which i have called sha512s_of_sha512.txt:

9c77195439b082f2b52fc4a87d9c3cc3ef061bdb24dcf33acc868e666bdd47e36748782c0c107b99cdbed3f963c822be1ba61658e54252083a9e743134f591f8  sha512.txt
aac0a443cfbf83409401f230ffc1def9c214e2c5c3f8eca2bedbbc62496a1fe461d8a4084eda08edff2edf1a20ccdd67c8dd16ea929e41048b07e879f04fae43  sha512.txt
cecc091cf2e7766149ed1d84a2c76034a04fb7881789465d2eeef728210ab831812ad20e40b24b9a6acf2672536b28bcb1f1c52a99f690a156da9867758f1db3  sha512.txt

(no trailing newline)

Anna's Archive posts updates so the last will change, but it is still useful if timestamped.

We can also do other hashes like Blake3:

for i in libgen.rs/scimag/repository_torrent/ libgen.rs/repository_torrent/ annas-torrents  
do   
  ( cd $i ; echo hashing $i;     
    ls *torrent | sort -t '_' -k2 -n | xargs -I {} bash -c '(f={}; cat $f | blake3_no_newline >> blake3.txt ; echo $f >> blake3.txt )' ; 
    cat blake3.txt | blake3_no_newline > blake3_of_blake3.txt ; echo "blake3.txt" >> blake3_of_blake3.txt ); 
done

b945b17c7b80d44ac3a72984637a2b8f5eb74e2f787699b09f33cc82aa787f2e  sm_00000000-00099999.torrent
162f029924199e0a029df41516478175d27dda77fab131094f8e0e26281c1b1a  sm_00100000-00199999.torrent
...
9ce03ed67c2cda37823ab0c8be89c247e7eecc2bb75399874c11e4467dc0c89a  sm_87400000-87499999.torrent
7a80af91774054a6048f0f6439e31f89f41b5f7416a794b3009b0098c0de2b5d  sm_87500000-87599999.torrent

7a95e1d70ea7ec092009a19d62e53721c6491471b6ebefcac75989a572243055  r_000.torrent
f95c2a461a8edf3fbecee37c44500ca5f3219a374ad63624bfacdefd12982eb2  r_1000.torrent
...
8ea61c6d4d95030a46c2c4f893fb0191e4dc0ff4ca3984dfb878a61ee45ab340  r_4141000.torrent
007ad5533704ec3196342e5e23ef2a5e00883e6402c810f25da0ad4874a3ca28  r_4142000.torrent

a4c04c0c9a5c20d1a4d65341edf0c57930758d966cb5f0c96454b93e3c84d7b0  aa_lgli_comics_2022_08_files.sql.gz.torrent
8c5a91abac23f735b846cd4ff2ad702d00c87366931f0faa66dde3f0192ce8d8  annas_archive_data__aacid__ia2_acsmpdf_files__20231008T203648Z--20231008T203649Z.torrent
...
d9a964df943388c389147a9586b46a7aa620dd0404db7b398c8f792d86a1ec48  pilimi-zlib-7230000-9459999.torrent
58fd3737ccfaf09b2af96a23f2d308f4f4ac3cfad47717dd2ee94552b3a3ce82  pilimi-zlib-9460000-10999999.torrent
dda8bc5b36c2c53ee4e6eca4b075f5ce64612caa56bba6cb315f8382701694c7  pilimi-zlib-index-2022-06-28.torrent
a6257625d45c8fb6c69810d525b845e545d04ea2e6f4f79a0f6857a31787e47f  c_2022_12_thousand_dirs_magz.torrent
22a164ea574dc6af3bbd1e72abd4bbe7ad0fa9b4774c39d43d86d64ce5d62d9e  c_2022_12_thousand_dirs.torrent
31be671fa73336f13b5caebeaaef2148e768be1c1783ef7b8ffd222f7a69df89  isbndb_2022_09.torrent

And blake3s_of_blake3.txt:

0c0c0a90a3ef3a56e8bd5141a00bfda82de03ed46b977b77e5d7c89a0272f521  blake3.txt
3ead97cf1d2d811ee57bbe094a5a5093821fe360f470833cd314990737701c16  blake3.txt
7d8e623d7ea80a2363f89cbf99ddd0f81d5c261e7fef27a4f7bbdea4689e6859  blake3.txt

(no trailing newline)

You could also combine the SHA-512 and Blake3 torrent hashes into one file to require breaking both hashes. Mixing hashes in serial like sha512s_of_blake3.txt won't work because if either hash is broken then the whole thing is compromised. You could just fake the inputs or the output.

OTS proof

Let's take our two files sha512s_of_sha512.txt and blake3s_of_blake3.txt and upload them to OpenTimestamps. It grinds for a second and spits out this "proof" file, which you can recreate with base64 -d to verify that the hashes are stored in the blockchain as part of a Merkle tree:

SHA256: 738e4aa27b91ae3852543fa595c69f89abd1f2a95965fb84a67f2f41de703fb0

base64 sha512s_of_sha512.txt.ots
AE9wZW5UaW1lc3RhbXBzAABQcm9vZgC/ieLohOiSlAEIc45KonuRrjhSVD+llcafiavR8qlZZfuE
pn8vQd5wP7DwEBeHkMFZe6tLeJI3wyyc/VAI//AQQAmSe0auVND5L8OglsKioQjxIIlyYFk0gNq+
DpPtpoYBeCvfrt+WHO9lnov5VC26iErgCPAgMqMGhZtXWRF4kXDiIAkeRjEdvPQ/nOk5qTnmS1bT
oQQI8QRl3oaG8AgK9FzmfYgS+ACD3+MNLvkMji4taHR0cHM6Ly9hbGljZS5idGMuY2FsZW5kYXIu
b3BlbnRpbWVzdGFtcHMub3Jn//AQldbOz+1IruAXnptbs40wpQjwIKYtzx463SQFDC49D9H7B6ID
tXIvLz/Bh/u6WmoG3KAuCPEEZd6Gh/AIt/ZZxIykz/EAg9/jDS75DI4sK2h0dHBzOi8vYm9iLmJ0
Yy5jYWxlbmRhci5vcGVudGltZXN0YW1wcy5vcmfwEF1f5i62p7fJS476WqSPDeMI8SDVANFPXh6p
/68FbiWTNPV/naa1BsBbGuZ5JajtqK3TzAjwIE7y+B4aGVfKfdldgcVJUgnI6hHSidHGdG2zLcEu
Fy3HCPEEZd6GhvAI01mk5A2ndmUAg9/jDS75DI4pKGh0dHBzOi8vZmlubmV5LmNhbGVuZGFyLmV0
ZXJuaXR5d2FsbC5jb20=

SHA256: 5eab4d039c1363e4534bf987c0dc387bd4217f8973660130b99943f49a73abfa

base64 blake3s_of_blake3.txt.ots 
AE9wZW5UaW1lc3RhbXBzAABQcm9vZgC/ieLohOiSlAEIXqtNA5wTY+RTS/mHwNw4e9Qhf4lzZgEw
uZlD9Jpzq/rwEFxbWAMC5/vhBWB2BE7oy5UI//AQDaxCifbFrAY8qOIk8HIMdAjxIGq1Osg1uPWQ
hpYZC7n6+ACKhDwFdb12fw3hnlU9gMNqCPEEZd6GXfAIKc9q2YVq2nAAg9/jDS75DI4uLWh0dHBz
Oi8vYWxpY2UuYnRjLmNhbGVuZGFyLm9wZW50aW1lc3RhbXBzLm9yZ//wEAOk/8B0BirMS0/6Zd7r
67cI8CDfkHTk/wE5xl1QE9G9oni0rS3UbmLyeyjmy1fKhZmcJQjxBGXehl7wCI5UUHiklv1fAIPf
4w0u+QyOLCtodHRwczovL2JvYi5idGMuY2FsZW5kYXIub3BlbnRpbWVzdGFtcHMub3Jn8BCVoZ/o
IKGjbEpOOXZSQ2g8CPAgsP3tCAm7UVv4poulbr7RIAEQDADxC7ftfhTunSatNzII8SBkyqxpy4h8
HMEgyW1aV0nX2YIysarwhPBW6FKFgjd48AjxBGXehl3wCEJgyC6knGenAIPf4w0u+QyOKShodHRw
czovL2Zpbm5leS5jYWxlbmRhci5ldGVybml0eXdhbGwuY29t

All of our big fancy hashes get bottlenecked down to a couple small SHA-256 hashes by OpenTimestamps, which would be the obvious weak point to target. Smaller hashes are more vulnerable, but also cost less to store. OpenTimestamps was not created specifically with the goal of being superintelligence-resistant. There are an enormous number of SHA-256 brute forcing ASICs in existence and their performance will only increase in the future, and much work will go into trying to break this hash, possibly including during the peak of the Singularity itself, if superintelligences use Bitcoin or a derivative cryptocurrency for transactions. If in the future SHA-256 is partially broken, Bitcoin could move to a different hash, but your timestamps couldn't. It has stood the test of time for a few decades, but I still think you should either put your various hashes directly into the Bitcoin blockchain or at least spread them out over many separate OpenTimestamps attestations, each of which can take several days to incorporate.

At the end of the day, someone has to actually download a good sized portion of the book data and verify that it matches. The most straightforward way to do this would be with a torrent client, but of course that process could be compromised as well. Down the rabbit hole...

What is required for all this to work

1 - Someone has to continue mining Bitcoin through the Singularity with high tech.

A superintelligence could significantly outstrip existing global hashrate and be able to rewrite the entire blockchain history at will. If rivals of somewhat lesser hash power continue mining the existing chain, they will still have the advantage for timestamping purposes as long as the difference in hashrate isn't absurdly large. The question is whether anyone will continue mining Bitcoin when there is no purely economic reason to do so.

2 - Hashes of the torrents must be published and preserved.

The magic numbers above don't just stay in existence on their own. Someone has to be able to find them after the emergence of superintelligence in order to verify the torrent data.

3 - The torrent files themselves must be preserved.

The hashes above rely on all the specific quirks used in constructing the torrent files. Perhaps one could make new torrent files from the data, but I wouldn't want to rely on it.

4 - Someone has to preserve the actual data from at least one torrent from the list.

These magic numbers are useless unless backed up by words and images from the real world. Each torrent is about ~100GB which is something anyone can do. The total archive would be 100TB which is an expensive hobby purchase, but maybe worth it to maybe save the world. Keeping the data safe from rampaging nanobots and corrupting viruses is another matter...

5 - The hashes must remain mostly unbroken through the Singularity.

6 - Timestamp proofs must be published and preserved.

That's a lot of number spam for a LessWrong post!

Recently it has become increasingly onerous to publish something anonymously online. It is my hope that the magic numbers I have written above will continue to be available for search engines and archive bots to index and preserve. I will probably spam them in random places, but it's a bad strategy for reliable data preservation.

Comments and criticism are requested. There's no point in doing this if it's all wrong.

[-]JBlack1y20

Anything that can rewrite all physical evidence down to the microscopic level can almost certainly conduct a 51% (or a 99.999%) attack against a public ledger against mere mortal hashing boxes. More to the point, it can just physically rewrite every record of such a ledger with one of its own choosing. Computer hardware is physical, after all.

[-]throwaway9181191271y10

A counter example scenario where this point doesn't hold true is if the "good" guys have fled to space, and the "evil" AI has physical control of earth's surface, where all the history is. The center of Bitcoin as determined by light speed delays could shift to a near-solar orbit where there is cheap energy. There could be jurisdictional issues (a peace treaty?) preventing the AI from physically altering ships in space, say, but no such rule prevents plausibly deniable attacks on the ships' information storage systems.

A 51% blockchain attack doesn't prevent timestamping from being a credible piece of evidence. There is still a large amount of hash-work piled on top of the timestamp, which would be hard to duplicate. Every alternate history proposed by a deceptive superintelligence would need to meet the same standard of evidence, as long as the hashes in the timestamp haven't been broken.

[-]SashaWu1y10

Exciting on so many levels.

First: it is practical and doable now, while having the saving-the-world vibes. I thoroughly support this idea. I think the advances in storage will soon enough (sooner than AIs reach the rewriting-history stage) allow us to record multiple physical copies of the 100+Tb database and spread them out throughout our habitat, making the revisionist AIs' task significantly more difficult.

An even grander idea: laser-beam the data into space, aiming at several stars at different distances, and listen to the echoes coming back, reflected from the star system's components. These echoes will be super weak but hopefully detectable and decipherable with future technology, and importantly, it will be absolutely unforgeable by AIs here on earth. For example, aim at a star 50ly away and get back your data in 100 years, guaranteed untampered.

But then, I think that your list of "what it takes for this to work" misses a critical item:

7. Those in the future who care about knowing the truth will need the guts to accept that data about the past contained in these hashed torrents is true, even if it contradicts their ideas and memories of that past.

I think that for superintelligent AIs it will be much easier to convince us all (and probably themselves) in a wrong version of the past, perhaps combined with faking some evidence, than seek and subvert all the evidence there is. I find it quite probable that this is the road they will take first, and they won't even much care about subverting your hashed torrents because it won't be necessary.

Imagine you live in a future. Imagine you are as confident of your memories and your general idea of the past as you are now. Imagine you get interested, verify the hashes and timestamps, unzip the torrents, start to read, and start seeing references about pink unicorns everywhere! Imagine all these papers and books mention pink unicorns as a pretty common thing that exists, and can be seen, experienced, studied, filmed, etc.

What do you think would be a more common outcome then:

You decide that your whole idea of what existed in the past is wrong, and you relearn it all from scratch from those zipped torrents, incorporating pink unicorns into it.
You decide this is someone's elaborate hoax, post it online for lulz, and go on with your life undisturbed.