International Speech Lexicon Project Page (ISLEX)
Mark Hasegawa-Johnson and Margaret Fleck
ISLEX is an on-again, off-again project whose goal is to provide
dictionaries suitable for automatic speech recognition in the widest
possible variety of circumstances. The current English dictionary has
300,000 entries, culled from open-source dictionaries including CMUdict
and Moby. There are currently no dictionaries in other languages, but
we're working on it.
Download
- Interim version May 22, 2008. Changes: coverage almost doubled
by incorporating the nmoby dictionary; phone codes
changed to worldbet; all information with any restrictions is now removed,
so that you can now freely redistribute the dictionary without
a license file. Caveat: the dictionary is
way too big, you should not use this dictionary for any real ASR
task; subdivided smaller versions will (hopefully) be posted here soon.
Download worldbet version here.
The StarChallenge phone code is a kind of stripped-down worldbet.
Download StarChallenge dictionary here.
- Version 0.2.0, June 6, 2007: zip,
tgz
Features
ISLEdict is designed to have the following features:
- Redistributable. All sources for this dictionary allow
redistribution (see SOURCES.bib for a list of sources, partly obsolete).
All
entries in this dictionary may be edited and redistributed.
- Words with pronunciation but no part of speech tags are words
that have not been verified or could not be verified. Most of
these words will eventually be verified, but some may be
unverifiable, either because they are neologisms from some
corpus or other, or because they are mistakes. They remain in
the dictionary for now, in case you find them useful.
- Words marked "(fw misspelling)" are words known to be
misspelled, usually from the Switchboard dictionary. They
remain in the dictionary because they may be useful in
Switchboard training.
- Useful out-of-the-box. This dictionary includes every entry in
the Mississippi State Switchboard dictionary, including fragments,
digit strings, neologisms, mispronunciations, and misspellings.
Therefore you should be able to train a Switchboard recognizer
using this dictionary without any special manipulation of
fragments, digit strings, etc., etc. Note that this property
will change soon: fragments and neologisms will be consigned
to auxiliary dictionaries, downloadable but separate.
- Feature-rich. The dictionary includes the following information.
Pronunciation, syllabification, and lexical stress are
available for every word.
Part of speech and
named entity tags (when appropriate) exist for most words.
Morphology exists for only a few words.
- Part of speech is derived from the Edinburgh WSJ corpus, from
the SIL KTagger program, and from named entity files: see
README-POS.txt.
- Named entity tags are labeled with "nnp" or "nnps", followed by
the named entity type. The following tags currently exist.
- (nnp (surname|boyname|girlname) \d\.\d\d\d) specifies that
the entry was logged as a surname, male given name, or female
given name in the 1990 U.S. Census, or was listed as one of
these name types in a relevant wikipedia entry, or (in the
case of manual verification) elsewhere on the web. The number
at the end of the entry specifies the percent frequency of
the surname, male name, or female name in 1990 U.S. Census
data, rounded to the nearest thousandth of one percent.
- (nnp city) tags include cities with more than 100,000
inhabitants, U.S. state capitols, and other cities.
- (nnp country) includes member states of the United Nations,
and an assortment of other country-like entities (dependent
protectorates, historical nations, etc.).
- (nnp continent) and (nnp state) are used to mark continents
and U.S. states.
- (nnp place) is used for other places including rivers,
mountains, planets and moons, and mythical places.
- (nnp product) is used generically for registered trademarks,
generic names of medicines, types of music, and names of
philosophies (e.g., dixieland (nnp product)).
- (nnp company) is a for-profit corporation; (nnp organization)
is any non-country, non-company organization.
- Morphology: two types of morphological information are provided
by the SIL KTagger program for the words it knows (with some extra
labels provided by hand):
- ( root:) specifies the root word
- (morphology:...+...) specifies orthographic morpheme
decomposition for words with known non-trivial decomposition
Mark Hasegawa-Johnson,
June 1, 2007