AVICAR Project: Audio-Visual Speech Recognition at UIUC

Introduction

Speech recognition in an automobile is typically performed using a sing le microphone, often mounted in the sun-visor in front of the driver. Typical ac oustic background noise levels vary from approximately 15dB SNR to -10dB SNR. At these noise levels, even recognizers with a very small vocabulary may generate too many recognition errors for practical use.

To solve this problem, we have implemented a method of audio-visual spe ech recognition using a multisensory visor-mounted array composed of eight micro phones and dashboard-mounted array of four video cameras.

Objectives of this project are:

  • Acquire data in realistic automobile environment
  • Develop and apply robust audio-visual feature extraction algorithms
  • Test the resulting features by training and testing small-vocabular y speech recognition models

 

Equipment

Eight channel audio recording

An array of eight microphones connected to portable microphone preamplifiers are mounted on the visor. Eight separate audio channels are recorded by the ADAT audio recorder.

Four channel video recording

An array of four cameras are mounted on the dashboard. They are combined at the quadratic video multiplexer and recorded by a DV camcorder.

Control of recording session

A cordless telephone is used as a DTMF tone generator. It feeds control signals to the audio and video data.





Structure of the Database

  • Data is collected under the following five conditions;
    1. Car in idling
    2. Car running at 35 mph with all windows rolled up
    3. Car running at 35 mph with front windows rolled down
    4. Car running at 55 mph with all windows rolled up
    5. Car running at 55 mph with front windows rolled down
  • For each of the five cases, there are two sets of scripts under the following categories;
    1. Isolated digits
    2. Isolated letters
    3. Phone numbers
    4. Sentences
  • Created by Bowon Lee Last updated 09/22/2005