tRanscribing music from audio files

While working on one of my side projects, I realized that if I want to work with audio inputs, I may need to do a little bit of audio processing. Last time I did some work with signal processing I was in the final year of my undergraduate degree and I used Matlab. As I don’t own Matlab anymore,  I decided to check out sound processing/analysis libraries for R (seewave, tuneR, audio, and sound). While I was browsing through the reference manuals, I got excited by some of the functions these libraries offer and I couldn’t resist not to play with them (frankly, I often get distracted like this).

First set of functions I explored comes with the tuneR package and provides an option to derive musical notes from audio frequencies. I already knew that the automatic transcription of music is a non-trivial procedure and, hence, I did not expect that I would be able to perfectly transcribe every music file. However, I still wanted to find out how sensitive the tuneR functions are and what can or cannot they retrieve.

First, a quick background. All audio recordings contain audio frequencies sampled at a given recording (sampling) rate. For CDs, the standard sampling rate is 44,100, implying that 44.1K parts of audio recording are sampled every second (hence, the length of the music recording is equal to the number of samples multiplied by the sampling rate). Each music note corresponds to a specific frequency, e.g., the frequency of the note C is 261.63 Hz. Thus, to transcribe an audio recording into a set of music notes, one needs to identified the set of corresponding frequencies. Usually, frequencies are not identified for each of the original samples; instead, they are calculated for a group of adjacent samples, e.g., one frequency for every 1,000 samples.  A standard way to calculate a frequency content of an audio recording  is through Fourier transformation (FT).  Determining the optimal number of samples required to calculate the frequency is difficult and, consequently, it is difficult to derive a beat and/or tempo solely from audio recordings.

To test the tuneR functions I used two audio samples: G major scale scale in mono (on violin, thanks Kenny) and a sample from the Für Elise in stereo (on piano, from my music collection).

The resulting music annotation for G major scale was quite impressive (below, right). As you can see on the right figure below, transcribed notes almost exactly matched the notes originally played.

G major scale

G major scale
Red lines – transcribed notes; Gray lines – notes played

G major scale

G major scale

G major scale

G major scale after applying the “smoother” function


Interestingly, when I applied the smoothing function (“smoother“) to estimated notes, the resulting annotation was not equally impressive (see left). My assumption is that the “median” function, currently the only option for smoothing, is  the main reason for such not-so-great performance.

As some of the tuneR functions required for a music transcription currently work only with the mono sounds, my second test included a stereo sound from an MP3 file. To transform a stereo sound into a mono sound, I had three options: to use left channel, to use right channel, or to use the average value of both channels.


Left (top left), right (bottom left), and averaged (average value of the left and right) channels

The average channel option sounded the most intuitive for me. However, it was hard to say whether the averaging would enhance the signal or whether it would introduce more noise. As it takes only a few seconds to get the annotation from a music file, I decided to test all three versions.

The resulting annotation for For Elise was not as impressive as for the G scale (this was expected), but the overall results were still quite good, especially if we take into consideration the difference in the tempo and complexity between these two.  You can Google for the Für Elise sheet music if you want to compare notes.

The averaged channel produced the best results. Notes derived from the right channel were not too much off either. On the other hand, notes derived from the left channel did not match the original notes especially well. I don’t know if there is a specific technical reason behind this or it is just the consequence of the audio compression in the MP3 file.


Annotation derived from the averaged channel


Annotation derived from the right channel


Annotation derived from the left channel

Here is the R code I used to transcribe the music:


transcribeMusic <- function(wavFile, widthSample = 4096, expNotes = NULL)
  #See details about the wavFile, plot it, and/or play it
  #play(wavFile, "/usr/bin/mplayer")
  perioWav <- periodogram(wavFile, width = widthSample)
  freqWav <- FF(perioWav)
  noteWav <- noteFromFF(freqWav) 
  #Smooth frequecies 
  #noteWav <- smoother(noteWav)
  melodyplot(perioWav, observed = noteWav, expected = expNotes, plotenergy = FALSE)
  #Print out notes names
  #noteWavNames <- noteWav[!]

#Test1 (mono)
testSound <- readWave("G-scale.wav")
scaleNotesFreqs <- c(NA, NA, NA, 392.0, 392.0, NA, 220.0, NA, NA, 246.9, NA, 261.6, 261.6, NA, 293.7, 293.7, NA, 329.6, 329.6, NA, 370.0, 370.0, NA, 392.0, NA)
scaleNotes <- noteFromFF(scaleNotesFreqs)
transcribeMusic(testSound, expNotes = scaleNotes)

#Now let see what happen if music is in stereo
#Test2 - 1
songHlp <- readMP3("FE.mp3")
testSound_stereo <- extractWave(songHlp, from = 0, to = 12, xunit = "time")
testSound <- mono(testSound_stereo, "both") #average left and right channel

#Test2 - 2 (use each channel separately)
testSound_lc <- channel(testSound_stereo, which = "left")
testSound_rc <- channel(testSound_stereo, which = "right")


This entry was posted in Music, Signal processing and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *