MIT Develops AI That Can Isolate and Edit the Individual Instruments in a Song

Violin musical instrument
  • Save

MIT’s latest AI feat has the power to pick apart music in an unprecedented way.

When a song is released, it’s in its final form.  It’s a single audio file that is nearly impossible to separate into individual instruments and voices.

Actually, companies are using artificial intelligence to make processes more efficient, including writing music.  There are also some advanced techniques for identifying individual components for licensing purposes.  Think of a deeper ID’ing Shazam, and you get the idea.

Now, there’s another breakthrough: the Massachusetts Institute of Technology (MIT) has announced a new AI that has the capability to isolate individual instruments within a piece of music.  Even better, it also makes it possible to adjust the individual elements, remove them, or remix them in any way.

“Trained on over 60 hours of videos, the ‘PixelPlayer’ system can view a never-before-seen musical performance, identify specific instruments at pixel level, and extract the sounds that are associated with those instruments,” states MIT.

“For example, it can take a video of a tuba and a trumpet playing the ‘Super Mario Brothers‘ theme song, and separate out the sound waves associated with each instrument.”

This new AI capability could seriously alter audio editing.

For example, the new separations could enable impeccable audio restoration methods for old music.  Additionally, a band teacher could place a video of an orchestra and isolate individual instruments for the students to hear. The possibilities go on and on.

Hang Zhao, the lead author for the project, envisioned a best-case scenario in which the researchers could recognize which instruments make which sounds.  “We were surprised that we could actually spatially locate the instruments at the pixel level,” states Zhao. “Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video.”

MIT’s PixelPlayer is considered deep-learning.

What does this mean? Deep-learning means that the AI can leverage varying patterns, regardless of their complexity, through neural networks that were implemented on previous videos.

In PixelPlayer, there’s one neural network that learns the visuals, another for the audio, and the last one for the specific pixels with certain sound waves to pull apart the various sounds.

Furthermore, PixelPlayer is self-supervised, which means that MIT and its engineers aren’t always able to pinpoint how it learns which instruments make certain sounds.