Hiding data from humans and computers in GIF files
That’s right. No extra data, no changes to the image. All the pixels and colours are exactly the same. Any data you want inside a GIF without any trace. But how? I’m glad you asked.
During highschool I was interested in steganography, the study of hiding messages in plain sight. I had just finished a simulation project that encoded GIFs, so I was still in the headspace of the GIF codec. I began tinkering with the different elements and found an area I wanted to manipulate. Here’s how I encoded 127 bytes into a GIF without changing the file size.
Why is this special?
Many steganography algorithms and strategies change the pixels of an image in such a way that it is not visible to the human eye (for example, least significant bit steganography). However, to a computer this difference is visible when compared to the original image. This algorithm produces a GIF with identical pixels to its original, and does not take advantage of any metadata in this image either. Nothing is added, no pixels are changed, no metadata is changed. Just the order of a couple bytes. This makes detection much harder, and is why this algorithm is special.
The GIF Codec: An Overview
The GIF codec is simple but powerful. It uses a very primitive form of compression, called a color palette. This limits the number of colours your images can use to reduce the size of each pixel. Instead of storing an R,G,B for each pixel, it simply stores an index (0, 1, … 255) corresponding to a colour in the palette. Here’s an example palette.
To the computer, this is what an image with this palette would look like.
A GIF can either use a global palette, or local palettes. Local palettes are different for each frame, allowing for more colours and less compression, while a global palette is defined once for the entire animation, reducing redundancy of colours between frames to 0. This script focused on global palettes.
The Global Palette
The secret to my steganography lies within the global palette. Although the colours in the 256 element array are specific, their order is completely arbitrary. With this we have our first glimpse of the strategy: rearrange the elements in a certain way such that the ordering encodes some sort of message. We can do this by first sorting the array of colours as a sort of base key. The colours are sorted lexicographically, red taking the highest priority, and blue taking the lowest priority. As long as no colour occurs twice, we can always have a winner between two colours which produces this definitive order.
Now that the array is sorted, we now imagine a new empty array with indices from 0 to 255. When placing the first element of the sorted array, we may choose index 0 to 255, to encode 256 base 10 digits. Then when we place the second element, we have 0 to 254 for 255 base 10 digits encoded (ignoring the added element and viewing the array as a contiguous). We continue this pattern all the way until there is only 1 option to place the element in, with 0 base 10 digits encoded for element 256 of the sorted array. The sorting is important because it tells us the order of our digits. Wherever the first element of the sorted array is in the encoded array tells us what our first decoded value is.
When you line up this pattern, the low numbers can be matched with the high numbers to always equal 255, or 1 byte.
So what we are left with are 128 bytes of data. To encode the data, simply split the numbers into their large and small counter parts based on their order in the sequence. For example, if we are encoding the third number, we have already used 2 indices. This means that our max index is 253, or 254 possible base 10 numbers. This will be our “large” number. The small number then represents the missing 3, as 253 + 3 = 256. So if our number was 253, the 3 portion would be set to 0. If our number was 254, the 3 portion would be set to 1, and so on. This way we can still create a maximum of 255 and minimum of 0 (a whole byte).
But wait, what about the number in the middle? Given that the first number is 256 and the last is 0, that means in the middle we will have a 128 bit option with no counterpart. This exception evaded me for quite some time until a group of peers I was presenting my idea to came across the idea before me. So, it is in fact only 127 whole bytes of data we are able to encode with an odd 128 bit number in the middle.
After you have the data available to you like an array, you can stop thinking about steganography and open your mind up to all the possibilities of what you can store in that space! With specialized compression engines like smaz you can store up to 256+ characters. Additionally, you can encrypt these messages so they aren’t publicly decodable.
Reflection
After making this, I thought about it for some time. It’s quite an interesting topic that I never hear talked about in Computer Science: the data encoded in order. There is an immense amount of data encoded in the order in which you store things. Some choose to use it, some don’t. For example, a sorted list has an immense use case for the data it is storing with its indices: logn search time! In this case, the GIF codec was not making use of the ordering.
This kind of thing makes me wonder. There are petabytes of data flowing through the internet on a daily basis. What secrets are hidden within them? Even developers with trained eyes for data would see this pass straight by and think nothing of it. Maybe this algorithm is already in use, but is ill discussed. After all, the whole purpose of the algorithm is for it to be undetectable.
(The code this blog is about can be found here)