Information for participants
9-11, plenary 1
Computational lexicography for speech and language
Dafydd Gibbon (U. Bielefeld), gibbon@spectrum.uni-bielefeld.de
Content: The lexicographic problems posed by speech and language data are very
different. The course treats
(1) Specific problems of spoken language data
as lexicographic input;
(2) Lexical frameworks, formalisms, theories, models;
(3) Examples of lexical representation and inference;
(4) Lexical standards
and resources;
(5) UNIX tools for lexicography and WWW lexicon interfacing.
For more information and a reading list have a look at this page !
11-13, parallel 1a
Lexical knowledge representation
Julie Berndsen (U. Bielefeld) & Gerald Gazdar (U. Sussex)
Content: This course will provide a rather detailed look at some of the techniques currently being used for the declarative representation of lexical knowledge in the context of inheritance based lexicons. Much of the published work on the latter concent rates on semantic and syntactic matters. This course, however, will focus entirely on issues that arise at lower levels of description, including, for example, the role of phonological units and prosodic structures in inflectional morphology and the types of temporal relations and phonetic/phonological information required in lexica intended for linguistic speech recognition below the level of the word. Each teaching session will be divided into a one hour lecture and a one hour practical class in which students will be encouraged and assisted to develop their own lexicon fragments using an implementation of the DATR language for lexical knowledge representation.
11-13, parallel 1b
The use of lexica in text-to-speech systems
Silvia Quazza (CSELT S.p.A., Torino, Italy) & Henk Van den Heuvel (U. Nijmegen), silvia.quazza@cselt.stet.it & heuvel@iris1.let.kun.nl
Content Quazza: A general presentation on TTS and a short introduction
to TTS lexica followed by the CSELT approach concerning TTS lexica, covering:
1) lexica for TTS-oriented part-of-speech tagging
2) lexicon-based automatic learning of rules assigning lexical stress in
Italian
3) lists of exceptions to phonetic transcription rules
4) lists of expansions of abbreviations and acronyms
5) domain-specific pronunciation lexica for TTS applications
6) use of lexica in the development of CSELT's telephone directory application, in
which CSELT's concatenative TTS system is specialized with ad-hoc high-quality
acoustic units (in some cases entire words):
6.1) domain-specific frequency lexica to be covered by the units
6.2) relational database for acoustic unit extraction, where the speech
signal is linguistically tagged
Content Van den Heuvel: THE ONOMASTICA PROJECT: MULTILINGUAL PRONUNCIATION LEXICONS FOR NAMES
The ONOMASTICA project (started January 1993; ended June 1995)
concentrated on the generation of pronunciation lexicons for
forenames, surnames, street names, city names and company names for 11
European languages for speech synthesis purposes.
In addition, 1000 names per language were selected which were transcribed
in all 11 languages, thus permitting interesting opportunities for
cross-language comparisons.
In this course the lexicons resulting from this project and the
principles underlying their generation will be presented and evaluated.
Reading list Quazza
Reading list Van den Heuvel
14-16, parallel 2a
Speech databases
Christoph Draxler (U. Munich), draxler@phonetik.uni-muenchen.de
Content:
1st day: General introduction to Speech Databases (Corpora)
- Outline of lecture
- Definition: Corpora
--- CRIL (computer representation of individual languages)
- Classification of Corpora
- Standards (and standards bodies)
--- ASCII, ISO, UniCode, audio formats
--- SAM, IPA, NIST
--- Annotation standards
--- SGML
2nd day: Technology
- Technology overview
--- storage
--- networks
- Introduction to
--- Signal Processing
--- Relational and Object-Oriented Databases
--- Parsing
(limited to what is needed in the case studies)
- Exercises for Introductions
3rd day: Sample Corpora
- Corpus design and creation
- Three case studies
--- small: MRI images (or similar)
--- medium: PhonDat
--- large: Verbmobil and SpeechDat
4th and 5th day: Access and Distribution
- Infrastructure
--- DBMS storage of corpora
--- WWW access
--- CD-ROM production and distribution
--- corpus updates
--- legal issues/accounting
- Processing
--- tools for corpus creation
--- low- and high-level access interfaces
--- integration of additional data
- Outlook: modern corpora
Reading list Draxler 
14-16, parallel 2b
Constraint-based Lexicons
Gosse Bouma (BCN, Groningen), Frank van Eynde (CCL, Leuven) & Dan Flickinger (CSLI, Stanford & BCN, Groningen)
Content: Constraint-based grammar formalisms, such as Head-driven Phrase
Structure Grammar, often combine highly abstract and general syntactic rules
with information-rich lexical entries. Consequently, the structure of
the lexicon is of central concern to such theories.
In the course we provide an overview of techniques which can be used
to encode lexical information in a computationally and linguistically
satisfying manner. We will discuss (type)
hierarchies, (non-)monotonic inheritance, relational constraints, and
lexical rules. The examples will cover inflectional as well as derivational
morphology, and valency alternations.
Course Overview
Introduction:
-Historical Remarks. Lexical rules in transformational
grammar and LFG.
-Which data need to be accounted for? Inflectional
morphology, derivational morphology, argument-structure
alternations, valence alternations.
-What issues arise? When are two lexical entries distinct,
related, identical ? Lexical integrity. Lexical rules as redundancy
rules vs. lexical rules as productive rules. Directional vs.
non-directional views of lexical rules. Exceptions and blocking.
Zero derivations.
- What technical devices are at our disposal? The
architecture of HPSG, the geometry of signs, (type)
hierarchies, implicational constraints, (non-) monotonic
inheritance, underspecification, lexical rules, relational
constraints.
Inheritance:
- The hierarchical lexicon and type inheritance.
- Multiple vs. single inheritance.
- Monotonic vs. nonmonotonic inheritance.
Lexical Rules:
- Examples.
- Lexical rules as `type-to-type' mappings, as mappings between
fully specified signs, or as mappings between feature structure descriptions.
- Default matching of input and output.
- Implementation issues: Lexical rules as unary syntax rules,
rule interactions, applying rules under subsumption.
Valence alternations by Lexical Rule and Underspecification:
- English Auxiliary verbs, inversion, negation, contracted
negations, tag questions.
- Extraposition verbs.
- An account using lexical rules and an account using underspecification.
Demonstration.
Existing Formalisms and Systems:
- Discussion of (a selection of) the following systems:
Ale, Alep, TFS, ConTroll, Hdrug, LKB, Page.
- How are lexical rules processed ?
- Possibilities for nonmonotonic and/or multiple inheritance ?
Lexical Rules as relational constraints:
- Lexical rules as relations between full-blown lexical entries.
- Dutch verb clusters and the scope of adjuncts.
- Delayed evaluation allows processing with recursive lexical
rules.
- Demonstration.
Monotonic Lexical Rules:
- Lexical rules as recursive constraints on the mapping from
argument structure to valence.
- Lexicalist Extraction: accounting for complement,
adjunct, and subject extraction without lexical rules.
- Tentative: Argument composition and extraction without
subsumption tests, French clitization without lexical rules.
Recommended reading:
This list of articles is recommended as reading material
for students who want to attend the course. We will assume that all
participants are familiar with the items marked
`essential'. The paper by Bresnan provides a background for many of the
issues (especially the use of lexical rules) discussed in this course.
The advanced material presents more recent approaches, and will also
be presented in class.
The papers by Krieger and Nerbonne, van Eynde, van Noord and
Bouma, and Bouma are available on the World Wide Web.
Additional material will be distributed during the summer school.
Reading list Bouma, Flickinger & Van Eynde.
Or in .bib format.
9-11, plenary 2
The mental lexicon
Ardi Roelofs & Harald Baayen (Max Planck Institute for Psycholinguistics, Nijmegen), ardi@mpi.nl & baayen@mpi.nl
Content: The mental lexicon plays a key role in psycholinguistic research on the production and perception of speech. In speech production, following initial conceptualization, words have to be accessed in the mental lexicon and made available for pronunciation. In language comprehension, visual or auditory input has to be mapped onto the entries in the mental lexicon, in order to access word meanings. The aim of this course is to outline the differences and similarities between speech production and comprehension, to introduce the computational models that have been developed for the processes of lexical access, and to show how these models are constrained by experimental data on the one hand, and by the lexical statistics of the language on the other.
11-13, parallel 3a
The use of lexica in ASR
Lori Lamel & Martine Adda-Decker (LIMSI, Orsay), lamel@limsi.fr
Content: We will explore the role of the lexicon in automatic speech recognition systems. Lexical design entails two main parts - selection of the vocabulary items and representation of the pronunciation entry using the basic units of the recognition system. For large vocabulary, continuous speech systems, the recognition vocabulary is selected to maximize lexical coverage for a given size lexicon. The pronunciation of each lexical item is usually specified using phonemes or phone-like units. The use of pronunciation alternatives will be addressed. Comparative practical sessions will address issues in lexical design and phonological variability in English, French and German.
11-13, parallel 3b
Automatic Learning of Lexical Structure
Walter Daelemans (Tilburg) & Gert Durieux (CNTS, UIA, Antwerp)
Content: The course provides an overview of statistical and machine
learning techniques for the extraction of lexical rules and
representations from corpora for NLP. Focus will be on
(i) techniques
for lexical disambiguation (morphosyntactic and word sense
disambiguation),
(ii) unsupervised learning techniques for the
discovery and use of lexical categories, and
(iii) learning of
phonological aspects of lexical representation and processing.
corpora analysis & disambiguation.
Reading list Daelemans & Durieux.
Web page with course material
14-16, parallel 4a
Recognizing Lexical Units in Text
Gregory Grefenstette, Anne Schiller, Salah Ait-Mokhtar
(Multilingual Theory and Technology,
Rank Xerox Research Centre, Grenoble)
Content Grefenstette: Lexical Affinities: First, Second, Third Order
- Discovery Techniques
Content Schiller: Part-of-speech tagging and NP markup
- Overview
--- NP recognition tool
- Introduction
-- - finite-state calculus
-- - basic commands of "FST"
- Building finite-state transducers
--- sample lexicon
--- sample guesser
--- sample tokenizer
- Multi-word units
--- MW in lexicon/guesser
--- tokenizing MWs
- HMM disambiguation
- finite-state NP markup
Content Ait-Mokhtar : Incremental finite-state parsing and dependency extraction.
- An Introduction to shallow parsing
- Incremental Finite-State parsing :
--- Annotating text with regular expressions (replace operator, etc.)
--- Marking non recursive chains : AP, NP and PP
--- Marking verbs and embedded clauses
--- Syntactic function marking (e.g. Subject and Object),
- Dependency extraction
Reading list Grefenstette, Schiller & Ait-Mokhtar.
Or in a .bib format.
14-16, parallel 4b
Multilingual Lexicography
Susan Armstrong (ISSCO, U. Geneva) & Bianka Buschbeck-Wolf (Institute for Natural Language Processing, U. Stuttgart), susan.armstrong@issco.unige.ch & bianka@ims.uni-stuttgart.de
Content:The course will present issues in lexicon development
from a multilingual perspective. Pairing lexical items across
languages requires adetailed analysis and explicit representation
of the lexical and contextual properties of the words and their
translations. The information to be coded includes morphological
properties, subcategorization frames and dependency relations, as
well as semantic and pragmatic information. For a bilingual
application, these properties must be mapped to the
particularities of related words in another language.
Another
dimension to be taken into account is the formalization of the
lexical information according to the user or application. For a
human language learner, background knowledge and inferencing
skills can be used to interpret a given dictionary entry, albeit
differently for a passive or active task. For a machine
translation system, the lexicon entries must be coded in a very
formal and explicit manner. The representation of lexical and
structural mismatches in view of translation is in part also
dependent on the system design.
The course will survey and
analyze current approaches to solve these problems and will
discuss methods for partially automating these tasks.