This is my reply to faul_sname's claim in the "I No Longer Believe Intelligence To Be Magical" thread:
In terms of concrete predictions, I'd expect that if we
- Had someone² generate a description of a physical scene in a universe that runs on different laws than ours, recorded by sensors that use a different modality than we use
- Had someone code up a simulation of what sensor data with a small amount of noise would look like in that scenario and dump it out to a file
- Created a substantial prize structured mostly³ like the Hutter Prize for compressed file + decompressor
we would see that the winning programs would look more like "generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table" and less like a general-purpose compressor⁴.
...
If you were to give me a binary file with no extension and no metadata that is
- Above 1,000,000 bytes in size
- Able to be compressed to under 50% of its uncompressed size with some simple tool like
gzip
(to ensure that there is actually some discoverable structure)- Not able to be compressed under 10% of its uncompressed size by any well-known existing tools (to ensure that there is actually a meaningful amount of information in the file)
- Not generated by some tricky gotcha process (e.g. a file that is 250,000 bytes from
/dev/random
followed by 750,000 bytes from/dev/zero
)then I'd expect that
- It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
- The decompressor would legibly contain a substantial amount of information about the structure of the data.
Here is such a file: https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- It is 2 MB in size.
- When I compress it with zip using default settings, I see it's ~1.16 MB in size. or a little over 50%.
- I have tried to compress it to smaller sizes with various configurations of 7zip and I've been unable to get it significantly smaller than 50% of the file size.
- This file represents the binary output of a fake sensor that I developed for the purpose of this challenge.
- There is no "gotcha" on what type of sensor is used, or what the sensor is being used to detect. It will be obvious if the data is decoded correctly.
- I have not obfuscated the fundamental design or mechanism by which the sensor works, or the way that the sensor data is presented. The process to decode the data from similar sensors is straightforward and well documented online.
- The data is not compressed or encrypted.
Unfortunately, the data in this file comes from eavesdropping on aliens.
Background on LV-478
It was a long overdue upgrade of the surface-to-orbit radio transmitter used for routine communication with the space dock. However, a technician forgot to lower the power level of the transmitter after the upgrade process, and the very first transmission of the upgraded system was at the new (and much higher) maximum power level. The aliens accidentally transmitted an utterly mundane and uninteresting file into deep space.
The technician was fired.
Hundreds of years later...
Background on Earth
Astronomers on Earth noticed a bizarre radio transmission from a nearby star system. In particular, the transmission seemed to use frequency modulation to carry some unknown data stream. They were lucky enough to be recording when this stream was received, and they are therefore confident that they received all of the data.
Furthermore, the astronomers were able to analyze the transmission and assign either a "0?-modulated" or "1?-modulated" value. The astronomers are confident that they've analyzed the frequencies correctly, and that the data carried by the transmission was definitely binary in nature. The full transmission contained a total of 16800688 (~2 MB) binary values.
The astronomers ran out of funding before they could finish analyzing the data, so an intern in the lab took the readings, packed the bits together, and uploaded it to the internet, in the hopes that someone else would be able to figure it out.
This is the exact Python code that the intern used when packing the data:
def main():
bits = None
with open("data.bits", "rb") as f:
bits = f.read()
print(len(bits))
byte_values = []
byte_value = 0
bit_idx = 0
for bit in bits:
byte_value |= (bit << bit_idx)
bit_idx += 1
if bit_idx % 8 == 0:
byte_values.append(byte_value)
byte_value = 0
bit_idx = 0
with open("mystery_file_difficult.bin", "wb") as f:
f.write(bytearray(byte_values))
if __name__ == "__main__":
main()
Challenge Details
Decode the alien transmission in the file uploaded at https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- This is a collaborative challenge.
- The challenge ends on August 27th, or when someone can explain what was sent in the alien's unintentional transmission, whichever occurs first.
- I will award points for partial credit at the end of the challenge for any correct statements that describe the alien's hardware/software systems.
- The points are not worth anything.
I sat down before going to bed and believe I have made some more progress.
I experimented with what I called the sign bit in the earlier post, and I'm certain I got it wrong. By ignoring the sign bit, I can reconstruct a much higher fidelity image. I can also do a non-obvious operation of rotating the bit to least significant place after inverting. I can't visually distinguish the two approaches, though.
I wrote a naive debayering filter and got this image out: https://i.imgur.com/e5ydBTb.png (bit rotated version, 16-bit color RGB. Red channel on even rows/columns is exact, blue channel on odd rows/columns is exact, green channel is exact elsewhere.)
You can reverse image search that image to find that it's a standard stock photo of bismuth, example larger but slightly cropped version: http://blog-imgs-98.fc2.com/s/c/i/scienceminestrone/Bismuth.jpg
Finding the original image would definitely help, even if it wouldn't fit the spirit of this challenge.
I haven't yet tried going in the reverse direction and seeing how much space this saves - doing this properly is tricky. I'm not aware of any good, ready-to-use libraries that provide things like context modeling and arithmetic coding, so writing a custom compressor is a lot of work.
In fact, I believe it may be worth trying to break the author's noise source on the sensor. Most programming languages use a fairly breakable PRNG, either a xorshift variant or an LCG. But this may be a dead end if cryptographic randomness was used. Again, this wouldn't fit the spirit of this challenge, but it would minimize description length if it worked.