For the last three months I’ve been working on a project that combines my twin passions: music and computer programming. The problem is an interesting one, the formal name for it is ‘automatic music transcription’ and it is much harder than I ever thought it would be. I’m slowly getting a grip on the problem space, after a ton of exploratory programming. It seems so simple, doesn’t it? You just listen to music you hear piano, voice, guitar and a hundred other optional instruments collapsed into a single variable: a 16 bit value that indicates the volume at that particular point in time sampled at a high enough rate (say CD grade, 44100 Hz) for high fidelity playback.
Reversing that should be childs play, just listen to what is being played and write that down. It turns out that teaching a computer to listen to music is rather a hard problem. For one, that collapse causes you to lose a lot of information, and it also causes a lot of input signals to cancel out. That four note chord you just played still comes out as one set of samples. FFT to the rescue. Fast Fourier Transforms are one of many workhorses that signal processing software uses to transform those collapsed signals back into something that makes some sense from a musical perspective: frequency and intensity. This - again in theory - allows you to distinguish between the various signals in the input stream.
Each note has its own set of bins in the FFT output and you can look at those to figure out if a note is active at any given time or not. Of course, it is - again - not that simple. Every tone has its harmonics and those are sometimes even louder than the fundamental (for instance, up to roughly G1 the soundboard of a piano does not do much so you won’t hear the fundamental but you will hear the overtones and their modulation). So you need to distinguish between actual notes and harmonics. And of course you also need to ensure that you don’t strike a key over and over again if it is a sustained note.
The workflow for a particular input file looks like this:
mp3 file -> wav file (sox)
or
midi file -> wav file (fluidsynth)
Then
wav file -> midi file (my utility)
midi file -> wav file (fluidsynth once more)
wav file -> mp3 file (using lame)
In order not to be overwhelmed I’ve limited the project to solo piano at the moment. There are a lot of ways in which it could be expanded beyond that but I think if I go too broad in the beginning I’ll just end up getting overwhelmed and will give up. And just being able to reliably decode polyphonic solo piano would be a really neat result. Typical applications of this technology: piano tuition, generating scores for pieces for which no score is available (Automatic Music Transcription), re-instrumentation, transposition, midi out for a regular piano and so on.
At the moment I’m still far away from a result that is good enough for actual work. But here is some of the output to show you the state of the program as it is right now:
As you can see there is still a very long way to go. I’ve given myself about a year for this project, my original year end goal was about 50% accuracy, I’m already above that right now but progress is getting slower and tweaking the software to fix one problem usually creates another (or two) so it is much harder now to make a change that does not cause a regression for some other test. Even so I still have a ton of hopefully good ideas on how to keep going.