How to perform note tracking in librosa

In music theory, notes represent a musical sound that can represent a sound’s pitchPitch refers to the perceived highness or lowness of a sound and duration. A note essentially represents a pitch classA pitch class is a concept in music theory that groups together all musical notes with the same name i.e C3 and C4 belong to the same pitch class that is C. Librosa is a Python library used to perform audio and musical analysis. It provides developers with various functions to perform various tasks on audio-based data, making it a useful tool for audio preprocessing and musical analysis.

How it works

To track different notes in an audio file, we can detect the change in the onsetsAn onset typically refers to the beginning or starting point of a musical sound or note. and the elapsed time between one onset to another to find the duration and the intensity of the pitch to classify them. To do this, we can use the onset_detect() function to get the frames for all the onsets in the audio.

Graphical representation of an onset
Graphical representation of an onset

We also need to find the chromaChroma values, are a set of musical features that represent the tonal content of an audio signal or piece of music. values using the chroma_stft() function provided by librosa. This function returns the chromatogram from a short-time Fourier transform (STFT) representation of an audio signal. This chromatogram represents the energy distribution of pitch classes over time. The mathematical equation for the short-time Fourier transform is as follows:

In this equation:

  • STFT(t,f)STFT(t, f) represents the short-time Fourier transform at time tt and frequency ff.

  • x(τ)x(τ) is the input signal.

  • w(tτ)w(t - τ) is the window function applied to the signal to create a short segment centered at time tt.

  • ej2πfτe^{-j2πfτ} represents the complex exponential used to analyze the frequency content at frequency ff.

  • denotes the integral, indicating that the STFT is computed as the integral of the product of the signal, the window function, and the complex exponential over a small time window.

Once we have the chroma values, we can simply use the frames we obtain from the onset_detect() function to obtain the maximum chroma value at these frames, and we can use the frames_to_time() function to calculate the duration of each note. This way, we can have each note’s pitch intensity and time duration.

Code

The following code tracks the notes in a trumpet audio clip:

import librosa
# Loading the audio file
audio_file = '../trumpet.ogg'
y, sr = librosa.load(audio_file)
# Extracting the chroma features and onsets
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
first = True
notes = []
for onset in onset_frames:
chroma_at_onset = chroma[:, onset]
note_pitch = chroma_at_onset.argmax()
# For all other notes
if not first:
note_duration = librosa.frames_to_time(onset, sr=sr)
notes.append((note_pitch,onset, note_duration - prev_note_duration))
prev_note_duration = note_duration
# For the first note
else:
prev_note_duration = librosa.frames_to_time(onset, sr=sr)
first = False
print("Note pitch \t Onset frame \t Note duration")
for entry in notes:
print(entry[0],'\t\t',entry[1],'\t\t',entry[2])
  • Lines 4–5: We import the audio file we will use for this task. The audio file is available for free and comes by default with the librosa library. Here is the link to the audio file:

  • Lines 8–9: We extract the chroma features and onset frames from the audio file. The onset values return an array of frames where a new musical note starts.

  • Lines 13–15: Here, we go through the onset frames and pick the index of the maximum chroma value at that frame. The note_pitch value indicates the note with the maximum chroma value, meaning it will give us one of the 12 musical notes.

  • Lines 17–24: If this is the first note that we track, then we convert the frame to time using the librosa.frames_to_time() function and set the first flag as False. If it is not the first note, we take the most recent note we tracked and calculate the difference between the last and current notes’ duration.

Output

In the output, we get a table of all the notes in the audio. We can see that we have tracked different notes and can distinguish them. Moreover, the output also gives us the onset frame, which tells us at which frame the note starts exactly. Lastly, we also get the duration of the note.

Conclusion

In conclusion, note tracking using librosa is a powerful and versatile tool for analyzing and extracting valuable information from audio data. Throughout this task, we explored how librosa can be applied to various use cases, demonstrating its significance in various fields.

Librosa’s note-tracking capabilities offer versatile applications in music transcription, speech analysis, and environmental monitoring. It aids in music interpretation, speech recognition, and acoustic event detection. Librosa is a valuable tool with broad implications for multiple industries, driving innovation and enhancing data analysis and decision-making.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved