HOPE_16 Hack the Violin: This Time There's AI!

by hack_the_violin and ebmbat

This past summer we gave a talk at HOPE_16 - Sponsored by Pfizer^® about the violin and AI.  When we surveyed what was already out there, we found very little and nothing that seemed to have to do with the musical/artistic side of playing the violin.  Most of what we found had to do with measuring pitch and/or rhythm and was also not really AI-based.  We really wanted something that was in the spirit of hack_the_violin: tips and tricks to make your sound a little sweeter and your playing a little easier, where the AI would do the heavy lifting and tell us which tips and tricks and when to use them, particularly in an artistic/musical context.  Not finding anything, we set about creating something ourselves.

First, we started with existing linguistics and audio analysis libraries exploring pitch and rhythm.  With the help of Claude Code, we created two command-line utilities for frequency analysis using the Parselmouth library, focusing on the four-string violin pitch range from G3 to E7.  Second, we chose Praat because it is commonly used in linguistics and phonetics to analyze and synthesize speech.  The first pitch analyzer script maps the pitch range of a four-string violin (G3 to E7) and returns statistics about the sample in a table.  The resulting visualization shows two charts - a waveform and a pitch contour.

The second pitch analyzer script returns a Pandas dataframe containing F1, F2, F3, jitter, and shimmer values.

After more research, we found the F1, F2, F3 results referred to formants.  Formants F1 and F2 are related to vowel height and vowel place.  This information expanded our previous perspective on pitch and frequency to include vowel height, vowel place, formants.  The jitter values beyond a certain threshold are associated with speech pathology.  For looking at tonal variation, this could be useful down the line in analyzing violin tone, but so far a correlation to a real world situation was not yet apparent.

We also found that shimmer values are defined in Praat software and their amplitude variations in vocal fold vibrations, which is a key indicator of acoustic voice quality.  So if jitter is about how steady the pitch is, shimmer is about how steady the loudness is from one vibration to the next.  Easy to see that this could be useful, but it would require a translation both in terms of technical accuracy and then to artistic use.

At this point, we pivoted to rhythm analysis.  We created a rhythm analysis script with Gemma-3-12b running in LM Studio and the Python Librosa library, which resulted in a CLI utility for rhythm analysis that estimates the tempo of a recording in beats per minute.  We recognized there would be more factors needed for an effective rhythm analyzer than just beats per minute.  For instance, we noticed we didn't take rubato into account.  Improvements will need a way to align tempi of multiple recordings for a practical analysis tool when comparing a student recording and an instructor recording.

While we were developing the rhythm analyzer, we stumbled upon voice recognition features, which later lead to a key breakthrough involving "singing" and the human voice and that was MFCCs - Mel-Frequency Cepstral Coefficients.  MFCCs don't directly measure jitter/shimmer.  Instead, they capture the spectral consequences of these instabilities.  This "broader fingerprint" analysis of the sound led to a breakthrough for our goals.

If you recall from HOPE XV's "Hack the Violin" presentation, the number one hack is singing.  Sing the melody and then play it on the violin (be not fooled by the apparent simplicity of this suggestion).  While examining MFCCs and their applications in voice and speech recognition, we noticed a key piece of information that both validates the pitch analysis scripts, strengthening the idea of violins sounding like the human voice and showing some of why singing is such a powerful tool for learning the violin.

In their 2018 publication, "Acoustic Evolution of Old Italian Violins from Amati to Stradivari," Hwan-Ching Tai et al used Praat software to analyze antique Italian violin recordings and compare them to male and female singers' recordings.  They found that the voice-like quality of these violins aligned with the rise of professional female singers.  Indeed, the idea of the violin sounding like the human voice goes further back to 1751, when Francesco Geminiani published The Art of Playing the Violin.

After reviewing known use cases of MFCCs, we believed they could apply to our violin project and then asked Claude Sonnet to help us co-author a timbre analyzer.  The timbre analyzer is a CLI utility built with the Librosa Python library.  It processes one or many .WAV files and generates timbre profiles of the audio samples.  The output includes the timbre analysis results for 13 MFCC values describing their perceived timbre qualities, a CSV file, and a dashboard.

At the beginning, we focused on the "Detailed MFCC Coefficient Analysis" results.  The timbre analyzer dashboard was mesmerizing, but it was unclear how these results could be meaningful for a student or a teacher.  Running the timbre analyzer on a .WAV file returns an overall timbre profile, including text descriptors like "brightness," "harmonic richness," "attack character," "warmth," "clarity," and so on.  There's also a detailed MFCC coefficient analysis printed out to the command line console.  The dashboard displays ten graphs which did not initially seem to connect to artistic expression.  We did take note that the text descriptors that were provided in the MFCC timbre analysis were similar vocabulary used to describe the artistic/ musical sound qualities when discussing music on the violin.  The results of a plain two-octave scale MFCC analysis did not tell us much by itself, so we made another two-octave scale.

This time, we played it in a bold musical style with a goal to determine if this analysis would distinguish any difference between the two differently performed scales or would it just hear two violins and classify them as the same type of sound.  In one set of MFCC results, there was a slight difference in the text descriptors, so we asked ChatGPT to compare the two sets of MFCC analysis results.  ChatGPT described the differences between the two performances in the same way we both heard them.  With the two MFCC analysis results and ChatGPT's LLM capacity, ChatGPT could comment on the artistic/musical qualities of the sound in a way that a student or player could understand right away.

This was a surprise and vastly exceeded our expectations based on previous interactions with various AI platforms.  The key here is that the MFCC analysis is an excellent representation of the type of sound violins and human voices make.  So we have good data going in which, of course, is more likely to make for good data coming out.  Having achieved this result, the next thing we did was take all the hack_the_violin playing/teaching notes (which were sourced into one big file) and asked ChatGPT to use this as a reference to tell the player of the plain sounding file how to sound more like the musically bold sounding file.  Chat was able to fetch and reference the same things we would have drawn on to give that instruction, both in general terms and with some specific techniques in the left and right hand.

This was fantastic, as we were really bridging the gap between a technical analysis and a real world context, and we could refer to any written document concerning the violin that used the same type of descriptive language.  We loaded up the earlier mentioned Geminiani violin treatise, and got similar yet still exciting results, in the written style of Geminiani!  The results were, again, on point with the most relevant parts of Geminiani's document being quoted and referenced to help the player.  This followed with treatises by Flesch, Galamian, Auer, Francescatti, and Leopold Mozart.  All returned similar results from the matching parts of their documents.

In essence, we discovered a way to analyze the sound of the violin, perceive its artistic aspects, comment on them, and get insight into them in relation to music using present and historical sources!  All of this in seconds at a time.

So what are MFCCs and how did they make sound measurable in a way for AI to comment on artistic expression with insight?

Librosa and MFCCs

Mel-Frequency Cepstral Coefficients (MFCC) are widely used for voice recognition, music genre classification, and music instrument identification purposes.  Developed in 1980 by Paul Mermelstein and Steven Davis in their research on acoustic data in speech recognition systems, MFCCs are numbers that describe spectral characteristics of sound and are measured in Mel scale units.

The Mel scale used in MFCC computation splits sound into different frequency bands, with more attention given to frequencies used to understand human speech, aligning with how we perceive pitch based on psychoacoustic research from the 1930s and 1940s.  Essentially, MFCCs represent how sound is perceived.  We believe MFCCs fit our use case quite well, tying together concepts such as singing and violins, human perception of sound, and capturing the shape of it to infer characteristics of timbre and texture in violin recordings.

We used 13 MFC coefficients in the timbre analyzer:
MFCC 2  - Sharpness
MFCC 3  - Harmonics
MFCC 4  - Attack
MFCC 5  - Body/Warmth
MFCC 6  - Clarity
MFCC 7  - Woody/Nasal
MFCC 8  - Brilliance
MFCC 9  - Airiness
MFCC 10 - Texture
MFCC 11 - Timbral Detail
MFCC 12 - Character
With these key characteristics, we transcended a strictly technical and numerical approach, enabling us to prompt ChatGPT and receive meaningful results back.

Next steps might be to create an Agent - an AI "virtual teacher" built from quotations and instructions from any violin master or combination thereof, combined with MFCCs of their performances. One could essentially have a lesson with any great player, with far greater depth than previously available from reading a treatise, or method book, or listening to an interview, or even watching a masterclass.

Ultimately, this could expand even further in terms of discovering best learning styles of individuals and tailoring advice for individuals. More immediately, one could record examples for a student and then have them run the analysis when they are practicing between lessons. In addition, it would be interesting to see how this type of analysis translates to other instruments.

A lot remains to be done, but it looks AI can be helpful to us humans in an artistic space.

References on MFCCs

github.com/jameslyons/python_speech_features

Return to $2600 Index