the diary of alkaline penelope

Stop listening to Taylor Swift and start listening to the yt-dl source code

Posted: 2020-11-03 00:58
Tags: ,

A couple of weeks ago, the RIAA forced GitHub to take down the code repository for youtube-dl, a Python program which downloads videos from YouTube (and many, many other sites) onto a user’s computer. Read all about it on ArsTechnica.

What the RIAA wants to stop is people ripping their overpaid stars’ (Official) music videos and audio streams from YouTube. Admittedly, this act does rob the labels - ahem - musicians of YouTube advertising revenue. So their lawyers got together and hatched a genius plan… let’s stop anyone from downloading any youtube video ever again!! that’ll show those mean pirates!!!1

Uh, no? Fuck you?

This blunt act of corporate censorship is wrong for many reasons, not least because downloading a video off YouTube is not automatically a violation of anyone’s copyright. For one thing, there are lots of videos under a Creative Commons copyright licence permitting redistribution of the content, which implies that it’s okay to make an offline copy. The Massachusetts Institute of Technology, for example, has a project called MIT OpenCourseWare, in which lecture recordings from many of its classes are uploaded to YT for anyone to watch and learn from. I suspect this kind of commitment to openness and freedom just melts everyone’s brains at RIAA. They live in a world where you must be losing if you are not wringing the maximum amount of money and control out of your creations, and their behaviour only looks more and more anachronistically bizarre as this world crumbles.

Code As Noise

Image: from fauux.neocities.org
lain

Anyway, enough huffing and puffing. Internet users have been taking action. For example, you can now obtain the full source code for youtube-dl from the images attached to this tweet by following a few commands. This inspired me to do something similar with audio.

WARNING: TURN DOWN THE VOLUME BEFORE PLAYING THIS FILE. I’M SRS: youtube-dl.flac

If you download the file and install ffmpeg, you can obtain the latest (2020.11.01) release of youtube-dl as follows, assuming you have saved it somewhere as youtube-dl.flac. First, run the command:

$ ffmpeg -i youtube-dl.flac -c pcm_s16le -f s16le youtube-dl-2020.11.01.1.tar.gz

This turns the FLAC-encoded audio file back into a compressed archive, which is how the youtube-dl project distributes the source code on its website.

Now all you have to do is decompress the archive to get the contents:

$ tar -xf youtube-dl-2020.11.01.1.tar.gz

The code will now be in a folder called youtube-dl. A pre-built executable (also called youtube-dl, wow) will be in that folder, and you can run it in order to learn how to use it - for morally acceptable purposes, of course:

$ cd youtube-dl
$ ./youtube-dl

Have fun.

A very basic form of steganography

Steganography refers to techniques which disguise some important information in a context that does not suggest the presence of the information. The purpose of steganography is to smuggle the information past anyone who is not expecting it to be there, but also to allow anyone who does expect something to find it and extract it easily. In this case, the FLAC file above appears to store sound. You could rename it something like Analogue TV Static.flac to boost this impression, and sever any obvious association with source code. Nevertheless, the code lurks within, and is accessible to those who know the commands.

How it works

Every file is a sequence of bytes. The youtube-dl source code, although spread across many files, can be compressed into a single archive file. Thus all the information making up the source code can be treated as one stream of bytes.

Digital audio files are also sequences of bytes. Simplifying a bit, these bytes describe a succession of values, each within the range -32,768 to 32,767, which represent the movement of the speakers required to play the audio. There is a multitude of encodings and formats for storing these numbers; one of the simplest is to store the numbers one after another, with every successive 16-bit (two byte) block representing one number. (In technical terms, I am talking about mono, 16-bit PCM.)

Consequently, we can interpret the youtube-dl archive as 16-bit PCM, as a series of stored 16-bit numbers, et voilá! we have raw digital audio. At this point, we haven’t changed the archive at all. It is just being interpreted differently, as audio rather than as compressed text and data.

However, raw audio is not likely to be recognised as audio when you try to open it in a browser or from a file explorer. For a degree of extra steganographic security, it is a good idea to re-encode the youtube-dl archive in a lossless audio format, such as FLAC. The re-encoded version will not be identical to the original; in fact, it will be a slightly bigger file. Nevertheless, the contents of the original (i.e. the archive) are exactly recoverable from any losslessly compressed version. That’s what lossless compression means: no information is lost.

The commands from earlier in this post recover an archive file from a FLAC file. If you want to disguise a file of your own as audio, it’s very simple:

$ ffmpeg -c pcm_s16le -f s16le -i super-secret.tar.gz -c flac nothing-to-see-here.flac

The input file does not have to be a .tar.gz archive, by the way, but archives do allow you to wrap up as many files as you wish into one bundle before they get the steganographic treatment.