Key Results of this Research

Saliency-enhanced features halve the error rate of human analysts.

In our 2012 ICASSP paper, we demonstrated that human analysts tasked with detecting anomalies in a large audio file can halve their error rates (F-score increases from 0.3 to 0.6) by the use of a visualization tool in which visual saliency of the spectrogram is a monotonic function of estimated probability of an audio anomaly.

Audio visualization permits anomaly detection at 8X real time.

In our 2011 APSIPA paper we showed that the use of zoomable audio visualization tools allows some users to find audio "easter eggs" (anomalies, e.g., motorcycles, cuckoo clocks, and spaceships added in to a background composed of eight hours of orchestral music) at a rate eight times faster than they would achieve by simply listening to the audio.

Left: during the 2009 Beckman Open House, visitor scores were posted on the wall using pink sticky notes, in order to encourage competition. Right: visitors to the open house found anomalies as much as eight times faster than they would have by listening to the audio.

CLEAR AED Competition: acoustic event detection

The 2007 CLEAR Acoustic Event Detection competition included two sub-goals: (1) Classification sub-goal sought to correctly classify discrete isolated events into one of 12 labeled categories (door snak, paper shuffling, footsteps, knocking, hair moving, phone ringing, spooncup jingle, key jingle, keyboard, applause, cough, and laughter), (2) Detection sub-goal sought to correctly detect and label the same 12 event categories in a business meeting recorded by multiple tabletop, wall-mounted, and headset microphones. All tested systems performed well in the Classification sub-task (typically 90 percent accuracy). In the Detection task, the best performance of only 34 percent accuracy was achieved by our HMM recognizer with AdaBoost feature selection (listed in the table below as Adaboost). The second-best and third-best systems had AEDACC scores of 23% and 21%; the other two systems had AEDACC scores below 10%.

Since 2007 we have substantially improved system accuracy by the use of tandem neural network-Adaboost inputs (Adaboost+T) and Gaussian mixture supervector rescoring (Adaboost+S); the benefits of these two modifications are slightly super-additive (Adaboost+T+S; see our paper in Pattern Recognition Letters).

Acoustic Event Detection Accuracy (AEDACC). Event Categories: ap=applause/clapping, cl=clink, cm=chair moving, co=cough, ds=door snak, kj=key jingle, kn=knock, kt=keyboard typing, la=laughter, pr=phone ringing/music, pw=paper work, st=steps. See our AMI event labels for an extension of these event categories.