So, about that Apple Belle footer...

You'll note that now the total entropy no longer peaks at 5-6 bits per symbol, but just decreases monotonically.

I should have been more specific in my previous post. I was actually looking at the entropy per-symbol, rather than the total. That's why I didn't notice the bug.

I'll reiterate that when you do these entropy calculations, it's a wise idea to always double-check your work by running you code on a pure-random input.

I did, actually. But ironically I had another bug in the way I was parsing my random source into a bit array, which was causing it to give a much more uniform distribution of entropy per symbol.

and anyway, shouldn't the correct symbol length be indicted by a low entropy, rather than a high one as you've assumed? I would expect so, too, but when you're dealing with an unknown message encoded or encrypted using an unknown method, things aren't always intuitive. Maybe something about the data makes it a particularly "pathological" case. The point was mainly that the statistics appeared (again, erroneously) to deviate from that of random data around a particular bit length. That made it a case worth investigating, regardless of what means. Sometimes the reasons for things only become clear in hindsight.

So that's all a bust, but I still suspect that a 5-bit symbol length is likely for other reasons. I think it's most plausible that this is a fixed-width encoding of some sort. 80 bits is very short message. You'd want to maximize the length of the message, in symbols, but this is balanced by the fact that if your symbol length gets too short, you don't have an effective dictionary.

For that sake, you're probably going to want to restrict yourself to case-insensitive text. You can probably dispense with numbers, spaces, and even some of the less common letters like X and Z. It's probably not possible to reduce your character set to <=16 characters, though, so you'll need 5 bits at a minimum.

Regarding Kolmogorov complexity, what I mean is:

Suppose you have a process that produces arbitrary strings consisting of n symbols of length z, from a dictionary of size s. For two strings of different length (within reason), the Kolmogorov complexity would be approximately the same since the algorithm is the same and only the description of the length would differ, which goes as log2(n). For sufficiently short strings, the Shannon entropy will be uncharacteristically high compared to longer strings. "Sufficiently short", in this case, is any string with length less than approximately sz. And, of course, the difference is less pronounced if the Kolmogorov complexity is very small (ex. endless repetition of the same string).

Odd, I did a quick calculation of the theoretical maximum per-symbol entropy you could expect to see in this case, and your numbers are larger, which either indicates a careless error in your code or some careless error in mine.

Input: unsigned char vbits[] = { 0, 0, 1, 1, 0, 1, 0, 0, // 0x34 1, 1, 0, 0, 0, 0, 0, 1, // 0xC1 0, 0, 0, 0, 0, 1, 1, 1, // 0x07 0, 1, 0, 0, 1, 0, 0, 1, // 0x49 0, 0, 1, 0, 1, 0, 0, 0, // 0x28 0, 0, 0, 1, 1, 0, 0, 0, // 0x18 0, 0, 0, 1, 1, 1, 0, 0, // 0x1C 0, 1, 0, 1, 0, 0, 1, 1, // 0x53 1, 1, 0, 0, 1, 0, 0, 1, // 0xC9 0, 0, 0, 1, 1, 1, 1, 1 // 0x1F };

Like this better?

 1-bit symbol entropy: 0.9710 (0.9710 per bit,  77.676 total)
 2-bit symbol entropy: 1.8803 (0.9402 per bit,  75.214 total)
 3-bit symbol entropy: 2.5929 (0.8643 per bit,  69.145 total)
 4-bit symbol entropy: 3.2842 (0.8210 per bit,  65.684 total)
 5-bit symbol entropy: 3.5000 (0.7000 per bit,  56.000 total)
 6-bit symbol entropy: 3.1369 (0.5228 per bit,  41.826 total)
 7-bit symbol entropy: 3.0328 (0.4333 per bit,  34.660 total)
 8-bit symbol entropy: 3.3219 (0.4152 per bit,  33.219 total)
 9-bit symbol entropy: 2.8368 (0.3152 per bit,  25.216 total)
10-bit symbol entropy: 3.0000 (0.3000 per bit,  24.000 total)
11-bit symbol entropy: 2.7552 (0.2505 per bit,  20.037 total)
12-bit symbol entropy: 2.4633 (0.2053 per bit,  16.422 total)
13-bit symbol entropy: 2.5560 (0.1966 per bit,  15.729 total)
14-bit symbol entropy: 2.2003 (0.1572 per bit,  12.573 total)
15-bit symbol entropy: 2.2641 (0.1509 per bit,  12.075 total)
16-bit symbol entropy: 2.3219 (0.1451 per bit,  11.610 total)

I had a boundary condition issue and it was counting one extra symbol for fractional cases.

/r/MLPLounge Thread