Work

My full list of publications

Graduate Research

Dissertation: Binaural Model-Based Source Separation and Localization

When listening in noisy and reverberant environments, human listeners are able to focus on a particular sound of interest while ignoring interfering sounds. Computer listeners, however, can only perform highly constrained versions of this task. While automatic speech recognition systems and hearing aids work well in quiet conditions, source separation is necessary for them to be able to function in these challenging situations.

This dissertation introduces a system that separates more than two sound sources from reverberant, binaural mixtures based on the sources' locations. Each source is modelled probabilistically using information about its interaural time and level differences at every frequency, with parameters learned using an expectation maximization (EM) algorithm. The system is therefore called Model-based EM Source Separation and Localization (MESSL). This EM algorithm alternates between refining its estimates of the model parameters (location) for each source and refining its estimates of the regions of the spectrogram dominated by each source. In addition to successfully separating sources, the algorithm estimates model parameters from a mixture that have direct psychoacoustic relevance and can usually only be measured for isolated sources. One of the key features enabling this separation is a novel probabilistic localization model that can be evaluated at individual time-frequency points and over arbitrarily-shaped regions of the spectrogram.

The localization performance of the systems introduced here is comparable to that of humans in both anechoic and reverberant conditions, with a 40% lower mean absolute error than four comparable algorithms. When target and masker sources are mixed at similar levels, MESSL's separations have signal-to-distortion ratios 2.0 dB higher than four comparable separation algorithms and estimated speech quality 0.19 mean opinion score units higher. When target and masker sources are mixed anechoically at very different levels, MESSL's performance is comparable to humans', but in similar reverberant mixtures it only achieves 20–25% of human performance. While MESSL successfully rejects enough of the direct-path portion of the masking source in reverberant mixtures to improve energy-based signal-to-noise ratio results, it has difficulty rejecting enough reverberation to improve automatic speech recognition results significantly. This problem is shared by other comparable separation systems.

Research Interests

Binaural source localization

I have been working on the problem of sound source localization. I take as my starting point binaural (two-microphone), reverberant recordings of one, two, or three simultaneous speakers. From these recordings, my system determines the direction from which the sounds are arriving and separates the speakers from one another as best it can.

Related stuff:

Music Similarity and Playlist Generation

Graham Poliner, Dan Ellis, and I have been working on the problem of playlist generation. This work went into our two publications, the first in the ACM Multimedia Systems Journal and the second as ISMIR 2005. The systems use SVM active learning to try to determine what you want to listen to. Take a look at the demo I put together for it.

In addition to the papers, a system based on this idea came in first place in the MIREX 2005 Artist identification competition at ISMIR and second place in the Genre identification competition.

Related stuff on my webpage:

Major Miner's music labeling game

I built a human computation game called Major Miner's music labeling game. From the intro:

The goal of the game, besides just listening to music, is to label songs with original, yet relevant words and phrases that other players agree with. We're going to use your descriptions to teach our computers to recommend music that sounds like the music you already like.

Players are having a good time with it. You can see the top scorers on the leader board, but don't let them intimidate you, it's pretty easy to score points once you get the hang of it. Check it out if you have some time to play.

Classes

Here are some final projects from classes I've taken here at Columbia. Maybe you want to see the full list of classes I've taken.

The Infinite Gaussian Mixture Model

For my final project in Tony Jebara's Machine Learning course, cs4771, I implemented Carl Rasmussen's Infinite Gaussian Mixture Model. I got it working for both univariate and multivariate data. I'd like to see what it does when presented with MFCC frames from music and audio. There were some tricky parts of implementing it, I wrote them up in a short paper describing my implementation. Since I've gotten the multivariate case working, I'll trust you to ignore all statements to the contrary in the paper. The IGMM requires Adaptive Rejection Sampling to sample the posteriors of some of its parameters, so I implemented that as well.

 

One sample taken from the igmm on my version of the "spirals" dataset

 

Download related pieces:

Active SVM Learning for Music Retrieval

For Professor Shih-Fu Chang 's course , Graham Poliner and I put together a music retrieval system. It used active SVM learning (a form of relevance feedback) on Fisher kernel features to try to recommend similar songs to those the user has tagged as relevant, while avoiding those the user has tagged as irrelevant. We're still planning on trying out different features, classifiers, song databases, and ground truth, i.e. "this work is just preliminary."

Here's the abstract:

In order to manage growing music collections, a personal music recommender could find new music, appropriate to the user's mood, that he or she would like to listen to. This paper approaches these goals using the flexible search technique of active SVM learning that adapts to users' perceptions instead of vice versa. In the best case, active SVM learning requires fewer than half the number of training examples a normal SVM classification would require to achieve the same precision and recall. In addition to the idea of applying active SVM learning to the audio domain, the paper has contributed a collection of ground truth classification of popular songs and a preliminary software implementation of this recommender.

"I'm going to update this any day now..." but you can download some related pieces here:

Audio Fingerprinting

For Dan Ellis' Digital Signal Processing class, I did some work on audio fingerprinting, with an eye towards using it as a means for measuring the similarity of sounds. I'm going to keep working on this, but maybe from a different angle.

Here's the abstract:

Shazam's audio features, consisting of pairs of spectral peaks with their associated difference in time, form a useful representation for identifying identical audio clips in the presence of noise and distortion. This project implements a shazam feature extractor and attempts to generalize it from the very specific identity detector to a less specific auditory similarity measure. This generalization unfortunately did not meet with much success, but we have created a number of reduced-data songs from the shazam representation that are still recognizeable even with no additional information from the original song.

Some related pieces:

Note that any code posted here is released under the Gnu General Public License v3. I may need to put something about that in the files themselves, but for now I trust you.

Don't forget to look at my undergrad work

mr-pc.org updated
Copyright © 2004-9 Michael I Mandel