International Speech Lexicon Project Page (ISLEX)

Mark Hasegawa-Johnson and Margaret Fleck

ISLEX is an on-again, off-again project whose goal is to provide dictionaries suitable for automatic speech recognition in the widest possible variety of circumstances. The current English dictionary has 300,000 entries, culled from open-source dictionaries including CMUdict and Moby. There are currently no dictionaries in other languages, but we're working on it.

Download

Features

ISLEdict is designed to have the following features:
  1. Redistributable. All sources for this dictionary allow redistribution (see SOURCES.bib for a list of sources, partly obsolete). All entries in this dictionary may be edited and redistributed.
  2. Words with pronunciation but no part of speech tags are words that have not been verified or could not be verified. Most of these words will eventually be verified, but some may be unverifiable, either because they are neologisms from some corpus or other, or because they are mistakes. They remain in the dictionary for now, in case you find them useful.
  3. Words marked "(fw misspelling)" are words known to be misspelled, usually from the Switchboard dictionary. They remain in the dictionary because they may be useful in Switchboard training.
  4. Useful out-of-the-box. This dictionary includes every entry in the Mississippi State Switchboard dictionary, including fragments, digit strings, neologisms, mispronunciations, and misspellings. Therefore you should be able to train a Switchboard recognizer using this dictionary without any special manipulation of fragments, digit strings, etc., etc. Note that this property will change soon: fragments and neologisms will be consigned to auxiliary dictionaries, downloadable but separate.
  5. Feature-rich. The dictionary includes the following information. Pronunciation, syllabification, and lexical stress are available for every word. Part of speech and named entity tags (when appropriate) exist for most words. Morphology exists for only a few words.
  6. Part of speech is derived from the Edinburgh WSJ corpus, from the SIL KTagger program, and from named entity files: see README-POS.txt.
  7. Named entity tags are labeled with "nnp" or "nnps", followed by the named entity type. The following tags currently exist.
  8. Morphology: two types of morphological information are provided by the SIL KTagger program for the words it knows (with some extra labels provided by hand):
    1. ( root:) specifies the root word
    2. (morphology:...+...) specifies orthographic morpheme decomposition for words with known non-trivial decomposition
    Mark Hasegawa-Johnson, June 1, 2007