This is my reply to faul_sname's claim in the "I No Longer Believe Intelligence To Be Magical" thread:
In terms of concrete predictions, I'd expect that if we
- Had someone² generate a description of a physical scene in a universe that runs on different laws than ours, recorded by sensors that use a different modality than we use
- Had someone code up a simulation of what sensor data with a small amount of noise would look like in that scenario and dump it out to a file
- Created a substantial prize structured mostly³ like the Hutter Prize for compressed file + decompressor
we would see that the winning programs would look more like "generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table" and less like a general-purpose compressor⁴.
...
If you were to give me a binary file with no extension and no metadata that is
- Above 1,000,000 bytes in size
- Able to be compressed to under 50% of its uncompressed size with some simple tool like
gzip
(to ensure that there is actually some discoverable structure)- Not able to be compressed under 10% of its uncompressed size by any well-known existing tools (to ensure that there is actually a meaningful amount of information in the file)
- Not generated by some tricky gotcha process (e.g. a file that is 250,000 bytes from
/dev/random
followed by 750,000 bytes from/dev/zero
)then I'd expect that
- It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
- The decompressor would legibly contain a substantial amount of information about the structure of the data.
Here is such a file: https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- It is 2 MB in size.
- When I compress it with zip using default settings, I see it's ~1.16 MB in size. or a little over 50%.
- I have tried to compress it to smaller sizes with various configurations of 7zip and I've been unable to get it significantly smaller than 50% of the file size.
- This file represents the binary output of a fake sensor that I developed for the purpose of this challenge.
- There is no "gotcha" on what type of sensor is used, or what the sensor is being used to detect. It will be obvious if the data is decoded correctly.
- I have not obfuscated the fundamental design or mechanism by which the sensor works, or the way that the sensor data is presented. The process to decode the data from similar sensors is straightforward and well documented online.
- The data is not compressed or encrypted.
Unfortunately, the data in this file comes from eavesdropping on aliens.
Background on LV-478
It was a long overdue upgrade of the surface-to-orbit radio transmitter used for routine communication with the space dock. However, a technician forgot to lower the power level of the transmitter after the upgrade process, and the very first transmission of the upgraded system was at the new (and much higher) maximum power level. The aliens accidentally transmitted an utterly mundane and uninteresting file into deep space.
The technician was fired.
Hundreds of years later...
Background on Earth
Astronomers on Earth noticed a bizarre radio transmission from a nearby star system. In particular, the transmission seemed to use frequency modulation to carry some unknown data stream. They were lucky enough to be recording when this stream was received, and they are therefore confident that they received all of the data.
Furthermore, the astronomers were able to analyze the transmission and assign either a "0?-modulated" or "1?-modulated" value. The astronomers are confident that they've analyzed the frequencies correctly, and that the data carried by the transmission was definitely binary in nature. The full transmission contained a total of 16800688 (~2 MB) binary values.
The astronomers ran out of funding before they could finish analyzing the data, so an intern in the lab took the readings, packed the bits together, and uploaded it to the internet, in the hopes that someone else would be able to figure it out.
This is the exact Python code that the intern used when packing the data:
def main():
bits = None
with open("data.bits", "rb") as f:
bits = f.read()
print(len(bits))
byte_values = []
byte_value = 0
bit_idx = 0
for bit in bits:
byte_value |= (bit << bit_idx)
bit_idx += 1
if bit_idx % 8 == 0:
byte_values.append(byte_value)
byte_value = 0
bit_idx = 0
with open("mystery_file_difficult.bin", "wb") as f:
f.write(bytearray(byte_values))
if __name__ == "__main__":
main()
Challenge Details
Decode the alien transmission in the file uploaded at https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- This is a collaborative challenge.
- The challenge ends on August 27th, or when someone can explain what was sent in the alien's unintentional transmission, whichever occurs first.
- I will award points for partial credit at the end of the challenge for any correct statements that describe the alien's hardware/software systems.
- The points are not worth anything.
I actually did not read the linked thread until now, I came across this post from the front page and thought this was a potentially interesting challenge.
Regarding "in the concept of the fiction", I think this piece of data is way too human to be convincing. The noise is effectively a 'gotcha, sprinkle in /dev/random into the data'.
Why sample with 24 bits of precision if the source image only has 8 bits of precision, and it shows. Then why only add <11 bits of noise, and uniform noise at that? It could work well if you had a 16-bit lossless source image, or even an approximation of one, but the way this image is constructed is way too artificial. (And why not gaussian noise? Or any other kind of more natural noise? Uniform noise pretty much never happens in practice.) One can also entirely separate the noise from the source data you used because 8 + 11 < 24.
JPEG-caused block artifacts were visible while I was analyzing the planes of the image, that's why I thought the bayer filter was possibly 4x4 pixels in size. I believe you likely downsampled the image from a jpeg at approximately 2000x1200 resolution, which does affect analysis and breaks the fiction that this is raw sensor data from an alien civilization.
With these kinds of flaws I do believe cracking the PRNG is within limits since the data is already really flawed.
(1) is possibly true. At least it's true in this case, although in practice understanding the structure of the data doesn't actually help very much vs some of the best general purpose compressors from the PAQ family.
It doesn't help that lossless image compression algorithms kinda suck. I can often get better results by using zpaq on a NetPBM file than using a special purpose algorithm like png or even lossless webp (although the latter is usually at least somewhat competitive with the zpaq results).
(2) I'd say my decompressor would contain useful info about the structure of the data, or at least the file format itself, however...
...it would not contain any useful representation of the pictured piece of bismuth. The lossless compression requirement hurts a lot. Reverse rendering techniques for various representations do exist, but they are either lossy, or larger than the source data.
Constructing and raytracing a NeRF / SDF / voxel grid / whatever might possibly be competitive if you had dozens (or maybe hundreds) of shots of the same bismuth piece at different angles, but it really doesn't pay for a single image, especially at this quality, especially with all the jpeg artifacts that leaked through, and so on.
I feel like this is a bit of a wasted opportunity, you could have chosen a lot of different modalities of data, even something like a stream of data from the IMU sensor in your phone as you walk around the house. You would not need to add any artificial noise, it would already be there in the source data. Modeling that could actually be interesting (if the sample rate on the IMU was high enough for a physics-based model to help).
I also think that viewing the data 'wrongly' and figuring out something about it despite that is a feature, not a bug.
Updates on best results so far:
General purpose compression on the original file, using cmix:
Results with knowledge about the contents of the file: https://gist.github.com/mateon1/f4e2b8e3fad338405fa793fb155ebf29 (spoilers).
Summary:
The best general-purpose method after massaging the structure of the data manages 713248 bytes.
The best purpose specific method manages to compress the data, minus headers, to 712439 bytes.