recognizing speech in noise by synthesizing it
Reconstructing damaged, obscured, or missing speech can improve its intelligibility to humans and machines and its audio quality. This project introduces the use of an unmodified large vocabulary continuous speech recognizer as a prior model for speech reconstruction. By driving the recognizer to synthesize realistic speech that is similar to the reliable regions of the noisy observation, it can improve recognition and reconstruction accuracy.
mapping the importance of "glimpses" of speech
Predicting the intelligibility of noisy recordings is difficult and most current algorithms treat all speech energy as equally important to intelligibility. We have developed a listening test paradigm and associated analysis techniques that show that energy in certain time-frequency regions is more important to intelligibility than others and can predict the intelligibility of a specific recording of a word in the presence of a specific noise instance. The analysis learns a map of the importance of each point in the recording's spectrogram to the overall intelligibility of the word when glimpsed through ``bubbles'' in many noise instances. The important regions identified by the model from listening test data agreed with the acoustic phonetics literature.
a database for gesture-driven workloads
I'm collaborating with Prof. Arnab Nandi on a gestural interface for querying databases. Instead of typing SQL commands, the user manipulates representations of database objects, while the database provides immediate feedback. This feedback allows the user to quickly explore the schema and data in the database with fluid, multi-touch gestures. I designed the gesture classification system, which is able to take advantage of the proximity of interface elements and compatibility of schema and data objects.
automatically describing music from its sound
With tens of millions of songs on iTunes and spotify, people need better ways to explore large music collections. We propose using various machine learning algorithms to classify 10-second clips of songs according to a number of human-generated "tags", short textual descriptions like "male vocals", "acoustic", "guitar", and "folk". We have performed many experiments in collecting such data, the properties of the data, modeling the data by itself as a language model, and modeling the data with various features extracted from the audio.
We have found that people use more of the same tags when describing clips from "closer" together in time, meaning that clips from the same track share more tags than clips from the same album, which share more tracks that clips from the same artist, which share more tags than clips with nothing in common. We have found that tag language models improve classification accuracy on the raw data. And we have found that while support vector machines work well for classification, restricted Boltzmann machines and multi-layer perceptrons work better.
Binaural Model-Based Source Separation & Localization
When listening in noisy and reverberant environments, human listeners are able to focus on a particular sound of interest while ignoring interfering sounds. Computer listeners, however, can only perform highly constrained versions of this task. While automatic speech recognition systems and hearing aids work well in quiet conditions, source separation is necessary for them to be able to function in these challenging situations.
This dissertation introduces a system that separates more than two sound sources from reverberant, binaural mixtures based on the sources' locations. Each source is modelled probabilistically using information about its interaural time and level differences at every frequency, with parameters learned using an expectation maximization (EM) algorithm. The system is therefore called Model-based EM Source Separation and Localization (MESSL). This EM algorithm alternates between refining its estimates of the model parameters (location) for each source and refining its estimates of the regions of the spectrogram dominated by each source. In addition to successfully separating sources, the algorithm estimates model parameters from a mixture that have direct psychoacoustic relevance and can usually only be measured for isolated sources. One of the key features enabling this separation is a novel probabilistic localization model that can be evaluated at individual time-frequency points and over arbitrarily-shaped regions of the spectrogram.
The localization performance of the systems introduced here is comparable to that of humans in both anechoic and reverberant conditions, with a 40% lower mean absolute error than four comparable algorithms. When target and masker sources are mixed at similar levels, MESSL's separations have signal-to-distortion ratios 2.0 dB higher than four comparable separation algorithms and estimated speech quality 0.19 mean opinion score units higher. When target and masker sources are mixed anechoically at very different levels, MESSL's performance is comparable to humans', but in similar reverberant mixtures it only achieves 20–25% of human performance. While MESSL successfully rejects enough of the direct-path portion of the masking source in reverberant mixtures to improve energy-based signal-to-noise ratio results, it has difficulty rejecting enough reverberation to improve automatic speech recognition results significantly. This problem is shared by other comparable separation systems.
Model-based EM Source Separation & Localization
MESSL is the source separation and localization system at the core of my dissertation. Its imput is binaural (two-microphone), reverberant recordings of one, two, or three simultaneous speakers. Its output is an estimate of the regions of the spectrogram that each source dominates and estimates of the interaural parameters (interaural time, phase, and level differences) for each source. It makes no assumptions about the sources themselves or the geometry of the microphones or room.
how do singers tune in various contexts?
I've helped out Johanna Devaney with her work studying the effect of context on singers' intonation. We're looking at how singers change their tuning based on the harmonic context of other singers, based on the presence of accompaniment, and based on the melodic context of individual lines. We do this by analyzing recordings of the singers using automated and semi-automated tools we have developed. These tools have been released as the Automatic Music Performance and Analysis Toolkit (AMPACT) on github.
I built a human computation game called Major Miner's music labeling game. From the intro:
The goal of the game, besides just listening to music, is to label songs with original, yet relevant words and phrases that other players agree with. We're going to use your descriptions to teach our computers to recommend music that sounds like the music you already like.
Players are having a good time with it. You can see the top scorers on the leader board, but don't let them intimidate you, it's pretty easy to score points once you get the hang of it. Check it out if you have some time to play.
the recti-linear room simulator
This code will generate binaural impulse responses from a simulation of the acoustics of a rectilinear room using the image method. It has a number of features that improve the realism and speed of the simulation. It can generate a pair of 680 ms impulse responses sampled at 22050 Hz in 75 seconds on a 1.8 GHz Intel Xeon. It's easy to run from within scripts to generate a large set of impulse responses programmatically.To improve the realism, it applies anechoic head-related transfer functions to each incoming reflection, allows fractional delays, includes frequency-dependent absorption due to walls, includes frequency- and humidity-dependent absorption due to air, and varies the speed of sound with temperature. It also randomly perturbs sources in proportion to their distance to the listener to simulate imperfections in the alignment of the walls.
To improve simulation speed, it performs all calculations in the frequency domain and the complex exponential generation code is written in C, it only calculates the Fourier transforms of anechoic HRTFs as it needs them, and then it caches them, and it culls sources that are beyond the desired impulse response length or are significantly quieter than the direct path.
and playlist generation
Graham Poliner, Dan Ellis, and I built a system to automatically generate playlists based on acoustic similarity of songs. This work went into our two publications, the first in the ACM Multimedia Systems Journal and the second as ISMIR 2005. The systems use SVM active learning to try to determine what you want to listen to. Take a look at the demo I put together for it.
implementation of the Infinite Gaussian Mixture Model
For my final project in Tony Jebara's Machine Learning course, cs4771, I implemented Carl Rasmussen's Infinite Gaussian Mixture Model. I got it working for both univariate and multivariate data. I'd like to see what it does when presented with MFCC frames from music and audio. There were some tricky parts of implementing it, I wrote them up in a short paper describing my implementation. Since I've gotten the multivariate case working, I'll trust you to ignore all statements to the contrary in the paper. The IGMM requires Adaptive Rejection Sampling to sample the posteriors of some of its parameters, so I implemented that as well. Thanks to Siddharth Gopal for a bugfix.
Download related pieces: