This is my reply to faul_sname's claim in the "I No Longer Believe Intelligence To Be Magical" thread:
In terms of concrete predictions, I'd expect that if we
- Had someone² generate a description of a physical scene in a universe that runs on different laws than ours, recorded by sensors that use a different modality than we use
- Had someone code up a simulation of what sensor data with a small amount of noise would look like in that scenario and dump it out to a file
- Created a substantial prize structured mostly³ like the Hutter Prize for compressed file + decompressor
we would see that the winning programs would look more like "generate a model and use that model and a similar rendering process to what was used to original file, plus an error correction table" and less like a general-purpose compressor⁴.
...
If you were to give me a binary file with no extension and no metadata that is
- Above 1,000,000 bytes in size
- Able to be compressed to under 50% of its uncompressed size with some simple tool like
gzip
(to ensure that there is actually some discoverable structure)- Not able to be compressed under 10% of its uncompressed size by any well-known existing tools (to ensure that there is actually a meaningful amount of information in the file)
- Not generated by some tricky gotcha process (e.g. a file that is 250,000 bytes from
/dev/random
followed by 750,000 bytes from/dev/zero
)then I'd expect that
- It would be possible for me, given some time to examine the data, create a decompressor and a payload such that running the decompressor on the payload yields the original file, and the decompressor program + the payload have a total size of less than the original gzipped file
- The decompressor would legibly contain a substantial amount of information about the structure of the data.
Here is such a file: https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- It is 2 MB in size.
- When I compress it with zip using default settings, I see it's ~1.16 MB in size. or a little over 50%.
- I have tried to compress it to smaller sizes with various configurations of 7zip and I've been unable to get it significantly smaller than 50% of the file size.
- This file represents the binary output of a fake sensor that I developed for the purpose of this challenge.
- There is no "gotcha" on what type of sensor is used, or what the sensor is being used to detect. It will be obvious if the data is decoded correctly.
- I have not obfuscated the fundamental design or mechanism by which the sensor works, or the way that the sensor data is presented. The process to decode the data from similar sensors is straightforward and well documented online.
- The data is not compressed or encrypted.
Unfortunately, the data in this file comes from eavesdropping on aliens.
Background on LV-478
It was a long overdue upgrade of the surface-to-orbit radio transmitter used for routine communication with the space dock. However, a technician forgot to lower the power level of the transmitter after the upgrade process, and the very first transmission of the upgraded system was at the new (and much higher) maximum power level. The aliens accidentally transmitted an utterly mundane and uninteresting file into deep space.
The technician was fired.
Hundreds of years later...
Background on Earth
Astronomers on Earth noticed a bizarre radio transmission from a nearby star system. In particular, the transmission seemed to use frequency modulation to carry some unknown data stream. They were lucky enough to be recording when this stream was received, and they are therefore confident that they received all of the data.
Furthermore, the astronomers were able to analyze the transmission and assign either a "0?-modulated" or "1?-modulated" value. The astronomers are confident that they've analyzed the frequencies correctly, and that the data carried by the transmission was definitely binary in nature. The full transmission contained a total of 16800688 (~2 MB) binary values.
The astronomers ran out of funding before they could finish analyzing the data, so an intern in the lab took the readings, packed the bits together, and uploaded it to the internet, in the hopes that someone else would be able to figure it out.
This is the exact Python code that the intern used when packing the data:
def main():
bits = None
with open("data.bits", "rb") as f:
bits = f.read()
print(len(bits))
byte_values = []
byte_value = 0
bit_idx = 0
for bit in bits:
byte_value |= (bit << bit_idx)
bit_idx += 1
if bit_idx % 8 == 0:
byte_values.append(byte_value)
byte_value = 0
bit_idx = 0
with open("mystery_file_difficult.bin", "wb") as f:
f.write(bytearray(byte_values))
if __name__ == "__main__":
main()
Challenge Details
Decode the alien transmission in the file uploaded at https://mega.nz/file/VYxE3T5A#xN3524gW4Q68NXK2rmYgTqq6e-2RSaEF2HW8rLGfK7k.
- This is a collaborative challenge.
- The challenge ends on August 27th, or when someone can explain what was sent in the alien's unintentional transmission, whichever occurs first.
- I will award points for partial credit at the end of the challenge for any correct statements that describe the alien's hardware/software systems.
- The points are not worth anything.
I have further discovered that in the bulk of the data, the awkward seventh bit is not in fact part of the values in the image, it is a parity bit. My analysis was confused by counting septets from the beginning of the file, which unfortunately seems incorrect.
Analyzing bi-gram statistics on the septets helped figure out that what I previously believed to be the 'highest' bit is in fact the lowest bit of the previous value, and that value always makes the parity of the septet even.
I was trying to ignore the header for now after failing to find values corresponding to the image width and height, but it looks like it's biting me in the ass right now.
Analyzing the file more carefully:
The first 78 bits are the 'prelude', presumably added by the transmission system itself to estabilish a clock for the signal.
All following bits are divided into septets, with each septet's lowest (last) bit being a parity check.
But: the first 37 septets have inverted (odd) parity (It's possible it's metadata for the transmitting system itself).
The next 50 septets constitute some kind of header (with even parity).
The remaining septets are image data (with even parity), four septets per pixel, most significant first. This means the image has 24-bit depth.
There's a rogue zero bit at the end of the transmission to make things a multiple of 8 bits. Note that the intern's code does not explain the last zero bit. The code would not output extra data at the end, it would truncate the file if anything, so the zero bit must have been part of the transmission.
Curiously, by removing the data mentioned in the spoiler and packing everything into nice big-endian values, zpaq compresses the file worse than it does the original file. Just the main section of the file compresses to 850650 bytes.