PAPERSIN COMPUTATIONAL LEXICOGRAPHYCOMPLEX '94

Teljes szövegt

(1)PAPERS IN COM PUTATIONAL LEXICOGRAPHY. COMPLEX '9 4 Edited by Ferenc Kiefer, Gábor Kiss and Júlia Pajzs. R e s e a r c h In s t it u t e f o r L in g u is t ic s H u n g a r ia n A c a d e m y o f S c ie n c e s , B u d a p e s t.

(2)

(3) PAPERS IN COMPUTATIONAL LEXICOGRAPHY COMPLEX '9 4.

(4)

(5) PAPERS IN COMPUTATIONAL LEXICOGRAPHY COMPLEX ' 9 4 E d ite d b y. Ferenc Kiefer, Gábor Kiss and Júlia Pajzs. Research Institute for L inguistics H un g a r ia n A cadem y of S ciences , B udapest. 1994.

(6) Proceedings of the 3rd International Conference on Computational Lexicography, COMPLEX '9 4 Budapest, Hungary. All correspondence should be sent to Research Institute for Linguistics Hungarian Academy of Sciences Department of Lexicography and Lexicology Budapest P.O. Box 19 Hungary 1250. Cover design by Gábor Kiss Technical assistence: Judit Pais. ISBN 963 8461 78 0 ® Research Institute for Linguistics Hungarian Academy of Sciences, Budapest 1994. Hozott anyagról sokszorosítva 9421541 AKAPRINT Nyomdaipari Kft. Budapest. F. v.: dr. Héczey Lászlóné.

(7) Contents JÚLIA PAJZS P r e f a c e .................................................................................................................................... v ii. Steph a n ВОР? - M arc DOMENIG A U s e r -C e n te r e d M e ta -F o r m a lis m fo r M o r p h o lo g y. ..................................1. L orne H. BOUCHARD - L ouisette EMIRKANIAN T h e O r g a n iz a tio n o f th e L e x ic o n in G S F : S tru ctu re and I m p le m e n t a t io n .......................................................................................................................13. O liver CHRIST A M o d u la r an d F le x ib le A r c h ite c tu r e fo r an In teg ra ted C o r p u s Q u e r y S y s te m ....................................................................................................2 3. J eremy CLEAR I C a n ’t S e e th e S e n s e in a L a r g e C o r p u s ........................................................3 3. M arkus D uda A P a r a lle l A p p r o a c h to L e x ic o n D e s ig n. ............................................................4 9. Stefa n o FEDERICI - Vrro PIRRELLI T h e C o m p ila tio n o f L a r g e P r o n u n c ia tio n L e x ic a : th e E lic ita tio n o f L e t t e r - t o - S o u n d P a ttern s th ro u g h A n a lo g y - B a s e d N e tw o r k s . . 5 9. G unter GEBHARDI L e x ic a l A c c e s s in an In teg ra ted S p e e c h -L a n g u a g e S y s te m. ............... 6 9. G regory GREFENSTETTE - P asi TAPANAINEN W h a t is a W o r d , W h a t is a S e n te n c e ? P r o b le m s o f T o k e n iz a tio n. . 79. P atrick HANKS L in g u is tic N o r m s and P r a g m a tic E x p lo ita tio n s o r , W h y L e x ic o g r a p h e r s N e e d P r o to ty p e T h e o r y , an d V ic e V e r s a ....................... 8 9. U lrich HEID C o n tr a s tiv e C la s s e s - R e la tin g M o n o lin g u a l D ic tio n a r ie s to B u ild an M T D ic tio n a r y .................................................................................. 115. A dam KILGARRIFF A D ic tio n a r y fo r L a n g u a g e G e n e r a t i o n ......................................................... 127.

(8) F rank KNOWLES - P eter ROE F a c ilita tin g th e C o r p u s -B u ild in g P r o c e s s and M a x im is in g th e „ A n a ly t ic a l Y ie ld ” : A L S P -O r ie n te d C a se S tu d y ....................... 137. Béatrice LAMIROY L e x ic o g r a p h ie C o m p u ta tio n n e lle e t A u x ilia ir e s d e s L a n g u e s R o m a n e s .............................................................................................................................. 147. ÉRIC LAPORTE E x p e r ie n c e s in L e x ic a l D is a m b ig u a tio n U s in g L o c a l G r a m m a r s. .. 163. U n S y s t e m e d ’ln te r p r é ta tio n d e s V e r b e s P s y c h o lo g iq u e s du F r a n s a i s ....................................................................................................................... 173. Y vette Y annick MATHIEU. M ehryar MOHRl S y n ta c tic A n a ly s is b y L o c a l G ra m m a rs A u to m a ta : a n E ffic ie n t A l g o r i t h m .......................................................................................................................... 179. NAM Jee -Sun R e p r é s e n ta tio n d e la C o m b in a to ir e d e s V a r ia n te s C o n s o n a n tiq u e s e t V o c a liq u e s e t d e la C o m b in a to ir e d e s S u ff ix e s d e C o n ju g a is o n d e s A d je c tifs e n C o r é e n. ......................................................................................... 193. JÚLIA PAJZS P r o je c t R e p o r t o n th e H is to r ic a l D ic tio n a r y o f H u n g a r i a n ............... 2 0 5. R oswitha RAAB-FISCHER A H y p e r in fla tio n o f L e x ic a l M e g a -M o n s te r s ? Mega-, Ultra-, an d Hyper- a s I n te n sify in g P r e fix e s: A C o r p u s -B a s e d S t u d y ...................................................................................................................................... Super-,. 215. F erenc ROVNY T h e D e b r e c e n C o m p u ta tio n a l L e x ic o g r a p h ic a l-T e r m in o lo g ic a l P r o je c t in F o r e ig n L a n g u a g e s fo r S p e c ia l P u r p o se s th e In itia l S ta g e ............................................................................................................ 225. J acqueline VISCONTI H o w a M o r p h o lo g ic a l L e x ic o n fo r th e Italian L a n g u a g e C an D e a l w ith E n c litic P r o n o m in a lis a tio n ............................................................ 235. E duard WERNER T o w a r d s an E x p e r t S y s te m fo r U p p e r S o r b i a n ......................................... L is t o f P a r tic ip a n ts. ............................................................................................................ 245 253.

(9) PREFACE. This volume collects the papers presented at the third conference on Computational Lexicography and Text Research, held at Budapest, on 7-9 July 1994. The conference was jointly organized by the Hungarian Academy of Sciences, Research Institute for Linguistics and the Université Paris 7, Laboratoire Automatique Documentaire et Linguistique. This time again a great number of papers were submitted for the conference ranging through most topics of computational lexicography and corpus research: from the very theorethical subjects of lexicography such as prototype theory to such practical problems as, for instance, optimal methods for parallel search in the lexicon. Some papers are on different experiences in corpus research, a couple of them offering a method for lexical disambiguation, sense discrimination. We can see interesting examples of using and building lexical databases for different purposes (MT, AI, speech generation and recognition etc.). The new possibilities offered by electronic publishing of dictionaries are also presented.. We are grateful for the work of the program committee who assited in selecting among the submitted papers and helped the authors to prepare the final version of their presentation by their valuable comments. The members of the committee: Anna BRAASCH University of Copenhagen, Maurice GROSS Université Paris 7, Ferenc KIEFER Hungarian Academy of Sciences, Ole NORLING-CHR1STENSEN University of Copenhagen, Júlia PAJZS Hungarian Academy of Sciences, Tamás VÁRAD1 University of London.. Júlia Pajzs.

(10)

(11) A User-Centered Meta-Formalism for Morphology. St e p h a n. BOPP - M a r c DOMENIG. Abstract This paper presents a system for the specification and the use of dictionary databases. Prominent characteristics of the system are its user-centredness and its knowledge specification formalism. We call the latter a meta-formalism because it allows the linguist to work on a higher level of abstraction than formalisms based on rewriting rules. The paper focuses on the “toolcharacter” of the system rather than on the underlying algorithms. The formalism has been im plemented and tested in several prototyping cycles. Specifications of Italian, German, English and French morphological rule bases have shown that the system is particularly promising for the formalization of word formation..

(12) 1.. Introduction. The system Word Manager can be viewed as a successor of Koskenniemi’s two-level model (Koskenniemi 83). Unlike most systems inspired by the two-level model (Bear 88, Emele 88, G orz88, Kataja 88, Kay 87, Koskenniemi 90, Trost 90, Karttunen 92), Word Manager does not try to extend the formalism’s expressiveness for a wider coverage of languages, but it was originally designed to improve the data management capabilities of the system (Domenig 89, 90). The implementation of a first fully operational prototype version in 1989 and three major redesigns resulted in a system with, c.a., the following characteristics: •. Word Manager follows a client-server model, where a server handles the data manage ment and different clients handle the data access - including all user interfacing.. •. Focus on reusability: Word Manager maintains a large network of knowledge accessible in various ways (see below). The purpose of this approach is to construct a reusable database: the database must be accessible by all kinds of applications requiring morpho logical knowledge.. •. Distinction between rule and entry knowledge: a sharp distinction is made between the specification of rule and entry knowledge. Rule knowledge has to be specified before entry knowledge can be added. The system distinguishes separate user interfaces for the specification of the two kinds of knowledge: they are called linguist interface and lexi cographer interface, respectively.. The following sections will focus on the user-centredness of the system. We will show this with the example of the linguist interface, a version of which is in the public domain and in stalled on an ftp server.. 2.. Tools for the Morphological Knowledge Specification. The linguist is responsible for the specification of morphological rules. The linguist interface is the client who has full access to the knowledge specification formalism of Word Manager (WM). The description of the tools available for the specification of morphological rules will demonstrate that the system has been designed to meet the requirements of a linguistic expert in a user-friendly manner.. 2.1.. Database as Document. A WM-Database is a morphological dictionary database consisting of morphological rules and entries, usually but not necessarily, of one language. Each database corresponds to one docu ment that can be manipulated only from within a dedicated “knowledge engineering environ-.

(13) 3 ment”. This environment supports the system's formalism by providing dedicated editors for a number of sub-formalisms, each of which covers a specific domain. In addition, it offers testing and debugging tools which permit the user to work in specification/compilation/testing cycles.. 2.2.. Structuring of the Specification. The user can structure the rules hierarchically into so-called inflection units and word-formation units. She or he can use the same or similar structuring criteria as in a traditional grammar. This both facilitates the specification process and results in specifications that are easy to understand. Figure 1 shows the structuring of a comprehensive Italian morphology (Bopp 1993). 1t a l i a n : i n f l e c t i o n. root (Cot I'D (ICat Regular) (ICat Irregular) (ICat Hard-Coded) (Cat Rdj > (Manner Qua I) (ICat Regular) (ICat Irregular) (ICat Hard-Coded) (1 Cat Indie) (ICat Entered) (ICat Hard-Coded) (llanner Poss) (Cat U). Italianiujord-form ation. root (UFCat Derivation) (UFCat H-To-R) (UFCat fl-To-M) (UFCat M-To-M) (UFCat R-To-fl) (UFCat M-То—U> (UFCat 9 - T o - N > (UFCat Conversion) (UFCat Suffixing) (UFCat R-To-U) (UFCat V-To-fl) (UFCat R-To-Rdv) (UFCat IICF+Suffix) (UFCat Compound ing). Fig. 1: Outline structure of inflection units and word-formation units (not fully extended; features in bold have underlying sub-nodes). 2.3.. Local Specification Process. The hierarchical structure enables the linguist to work “locally”. The inflection units and the word-formation units are largely independent. Furthermore, it is possible to restrict the scope of rules to one or any number of sub-units, to one particular type of entry, etc. In this manner, highly generalizing rules as well as rules with a clear-cut “local” scope can be specified. Consider the string manipulation rule responsible for the umlauted plural in German nouns like "Vater/Väter" ('father/fathers'): (ISRule N o u n -U m lau t_ a/ä) " ( .* ) A ( .* ) / \ 1 ä \ 2 " (IC at N -S te m ). (IC at N -Suffix)(N um PL). The example shows that the formalism used for string manipulation rules allows restrictions on the strings (input must contain the character "A") as well as restrictions on the features of the formatives that are combined into a noun plural form (a noun stem plus a noun plural suffix). The rule is fired only when all the restrictions are met ("A" is replaced by "a"). Further restric-.

(14) 4 tions on such string manipulation rules can be defined by associating them with individual in flection rules, word-formation rules, or even with single entries. The latter possibility can be employed for irregular phenomena and as an “escape hatch” for cases where an exception is discovered at a late stage in the data acquisition process.. 2.4.. The meta-formalism. The Word Manger formalism can be considered a meta-formalism: it abstracts away from the underlying machine-oriented processing in order to provide a user-centred view. Let us illus trate this with two example rules. The first rule is an inflection rule for Italian nouns of the a/e-class (e.g. "donna/donne" 'woman/women'; "pizza/pizze"): (RIRule. N -R egular.+a/+e). citation-forms (ICat N -S tem ) (ICat N -Suffix.+a)(N um SG). word-forms (ICat N -S tem ) (ICat N -Suffix.+a)(N um SG) (ICat N -S tem ) (ICat N -Suffix.+e)(N um PL). The rule combines a noun stem and a singular suffix to the wordform of the singular and the same stem plus a plural suffix to the wordform of the plural. The formatives (stem and suf fixes) are specified within the same inflection unit (cf. 2.2.) as the rule. The features specifying formatives and rules can be chosen freely, the only restriction being that regular inflection rules have the attribute 'RIRule', regular word-formation rules the attribute 'RWFRule', etc. This, again, allows the user to keep the specification very close to the terminology used in traditional linguistics. The second example rule is a word-formation rule defining the prefixing of regular Italian nouns (e.g. "presidente > vicepresidente”, ''formalismo > metaformalismo"): (RWFRule. D erivation.To-N.N-To-N. Prefixing). source 1. (W FCat Prefix). (Cat N) (RIRule ?) > entry-features (G ender >) 2 (ICat N-Stem). target (RIRule ?). 1 2. (ICat N-Stem). All noun stems belonging to a lexeme class defined by an inflection rule for regular nouns (Cat. N)(RIRule. ?) can be prefixed with formatives qualified by a feature (WFCat. P re fix ).

(15) 5 (specified within the same word-formation unit). The prefix (1) and the noun stem (2) are combined into a noun stem in the order defined by the digits representing them under target. The lexeme classes of the newly formed entries and their gender features are propagated (indicated by ">") from the source lexemes. The examples show why linguists understand this formalism after a relatively short learning period: they can use familiar terminology and knowledge factoring. Furthermore, the examples illustrate why we call the formalism a meta-formalism; the rules are on a higher level of abstrac tion than rewriting rules. This means that they can be compiled in various ways. Currently, WM supports three types of compilation: First, it generates a network which can be used by a finite-state machine for the analysis and generation of inflected forms. Second, it compiles a set of AI-type condition/action rules which permit the generation of word formations. These rules are primarily employed to enter complex entries, so that derivational dependencies between lex emes can be recorded and controlled by the system. The third compilation algorithm generates a set of rewriting rules that can be interpreted by a unification-based parser, which permits analysing complex words that are potentially generated by the word-formation rules. Contrary to most hand-compiled rewriting rules for word-formation analysis, these rules are not de signed to construct parse trees containing morphosyntactic information; instead, they build trees whose nodes represent meta-level word-formation rules, which means that they can be used for (semi-)automatic registration of complex entries (see 2.7). Evidently, further rule sets could be derived from WM's meta-level formalism. Given our cur rent rule generation algorithms, it would be quite easy, for instance, to compile a set of rules that collects morphosyntactic features for unknown word formations.. 2.5.. Browsing Facilities. To support the user in the task of specifying knowledge, sophisticated browsing options were realised. They offer the possibility to view and access the specified knowledge from different perspectives.. 2.5.1.. General Entity Browser. The General Entity Browser allows the user to browse the entire network of entities: rules, formatives and entries. By indicating both the kind of entity (Entity Restriction) and a restriction on the features qualifying these entities (Feature Restriction), the user can search very selec tively. Figure 2 shows the result of browsing with a restriction on the entity. RIRules. Inflection Rules) and the feature (Cat N) (Category Noun) in a German database:. (Regular.

(16) 6. Fig. 2: Entity Browser with RIRule selection for nouns Each of these entities can be further explored with the Aspects menu in the lower right-hand comer of the browser. The user clicks on one of the listed entities and selects one of the options in the aspects menu. The aspects available are different, of course, for different kinds of enti ties. Figure. 3. shows (some of) the entries inflecting with the rule (RIRule ................... G e rm a n -S tB : B ro u js e r m m ,л ■■■ i E n tity R e s tr ic tio n '* '!. N-Regular.+ES/+E):. i P is. | F e a tu r e R e s t r ic t io n s ’* '!. 1E n trie s R e tr ie u e. l. E n trie s o f ru le '(R IR ule N -R e g u lar.E S /E )'. "muttertag" "Prozent" "nosenstrauss" "schnitt" "schuh" "spiel“ "stnauss" "tag“. (Cat (Cat (Cat (Cat (Cat (Cat (Cat (Cat. H)(Gender Ю(Gender N)(Gender N)(Gender N)(Gender N)(Gender N)(Gender H)(Gender. |- f || H isto ry * \. П) N) П) П) n) N) П) П). о V\ .... <> ]| A s p e c ts -* I. _________ Ш Fig. 3 : Entries inflecting with the inflection rule (RIRule. N-Regular.+ES/+E). Since the retrieval of further entities results in their being collected and displayed in the table the “original” entity was displayed, this retrieval procedure can be repeated infinitely..

(17) 7. 2.5.2.. Lexeme Browser. This browser is specially designed for testing purposes. It can be invoked from the General Entity Browser or by analysing words. It offers different views on a lexeme: the user can test its inflection rule (e.g. by viewing the wordforms it generates, cf. Fig. 4), the word-formation rule by which the lexeme has been created, the word-formation rules by which other lexemes have been derived or composed (e.g. by viewing the so-called Generation History, cf. Fig. 5), etc. G e r m a n - S tB : k in d Lexem es. P arad ig m s. "k in d” <Co t И) <Gender H>. n a. ( R I R u le N - R e g u l a r IEÍS/ER) P arad ig m s KNumPL). 1. W o r d f o r m s IF o r m a ti u e s (Cat (Cat (Cat (Cat (C at (Cat. ’kinds" "k indes" "к ind" "к inde" "kind". NК Mum NXNu* NXNum NXNum NXNum NXNum. SGXCase SG К Case SGXCase SGXCase SGXCase SGXCase. Horn) Gen) Gen) Dot) Dat) Hcc). О. F o rm atiu es 'kind". (Cat NX ICat N-SlemXICat Reg) (Cat NX ICat N-Suff ixXNum SGXICat E. О. а Fig. 4.: Lexeme Browser of "kind" ('child'), view on wordforms of the singular G e rm a n - S tB : k in d Lexem es. G e n e r a tio n H isto ry. 1"kind" (Cat MXGender N>. 1Ш. (R IR u le N - R e g u l a r [E]S/ER) G e n e r a t i o n H i s t o r y | N e tu B r o u i s e r " k i n d l ich" "k in derreich " "Schulkind“ "kinderbuch“ "bauernkind" "herzenskind" “k i n d e r e i”. (Cat (Cat (Cat (Cat (Cat (Cat (Cat. fl) fl) NX Gender NXGender N)(Gender NX Gender NК Gender. О N) N> N> N) F). Fig. 5: Lexeme Browser of "kind", view on derived lexemes.

(18) 8 By selecting one of the words listed under Generation History, a further lexeme browser can be opened. In this way, the user can follow step by step all dependencies between lexemes (fig. 6) German-StB: kind Lexemes IGeneration History. |J4<_i_nd^ (Cat. German-StB: "kindlich’ (Cat H) (RI Rule N-Rei Lexemes [Generation History 3 Generation H i P l German-StB: "kindlichkeit’ (Cat N)(Gender F) i H i 'kind!ich" (RIRule R-Regi Lexemes | Paradigms'kinderreich“ 'Schulkind“ Generation Hi; "kindlichkeif (Cot N X Gender F> 'kinderbuch” 'bauernkind" ‘herzenskind“ 'kindere i". L№. “kindlIchkeit'. (RIRule N-Regular -/[EIN) Paradigms. UJordforms Formatiues “kindlichkeif "kindlichkei f "kindlichkei f "kindlichkeit”. (Cat <Cat (Cat (Cat. NXNum NXNum N X Nun N X Nun. SGXCose SGXCose SGXCose SGXCose. Hom' Gen) Dat) flcc). о. о Formatiues 'kindl ichkeit“. (Cat NXICat N-StenXICat Reg). f:. Fig. 6: Sequence of lexeme browsers of related lexemes Further viewing options show whether two entries are related by derivation, compounding or conversion, what string manipulation rules were fired when applying a word-formation rule, the dependency relations between entries (Fig. 7), etc. ID F-C luster:kind (D epth: fill). ( k i n d l i c h (C a t Я))— (k l nd I i chke i t (C a t H X G ender ( k i n d < C a t H X G e n d e r Н>)ч. -(k in d e r e i (C a t H X G ender F>). (s c h u le (C a t H X G ender F>)~. (s c h u lk in d (C a t MX Gender ГО). (b au er (C a t H X G ender HXGenl. fcau ern k ln d CCat MX Gender H S). (h e rz <Cat N X G ender N>)— (buch (C a t N К Gender N))( r e ic h (C at R>)-----------------. —‘v ' ^ X h e rz e n s k i nd (C a t NX G ender M>) (k in d e rb u c h (C a t N X G ender И)),. —^ k i n d e r r e ic h. (C a l R>). (q u t o r (C a t N X G ender M)}-. Fig.7: "kind": (partial) word-formation cluster view. nki n d er b u ch a u to r.

(19) 9. 2.6.. Test Entries. As the examples above illustrate, a WM-database contains entries in the rule specification phase already. Each specification of rule knowledge includes a number of so-called hard-coded en tries. They serve two purposes: 1) as test and example entries for particular lexeme classes and 2) for the hard-coding of entries considered irregular. For additional testing, the linguist can temporarily add so-called Lexicographer Entries (LE). By adding simplex entries, the inflection rules are tested: by adding complex entries, the word-formation rules are tested. Since LE are not stored as a part of the rule specification, the user can specify as many LE as he/she wishes without unnecessarily blowing up the rule specification.. 2.7.. Analysis of Potential Entries. A further test function is provided by the possibility to analyse potential entries. These are en tries that are not (yet) contained in the database but composed of elements (stems, affixes) which are already stored. The system proposes derivations according to the word-formation rules specified in the database. In the linguist interface, this option can be used to test the com pleteness of the word-formation rules. Figure 8 shows the derivations proposed for the wordforms "legalized”, "uncommonly" and "machine-readable” in an English database. Correct parses can be selected and directly entered into the database as lexicographer entries (cf.. 2.6.) E nglish.source: T entative P a rse s ВЕД| EOiequlqr|. I D S E nglish.source: T entatiue P a rse s HEDl. 115421 И 11eqq 11 lizl. Пз!»1 Й328|^. -------------------------t p arse. ЕзШ! ED. lunl lcommonl. ЕЙ lcogmK>nl llul Ш ЛИ E nglish.source: T entatiue P a rse s Д Е Й. 5Г Iftdi-PsrighriK H sl. 2 p a rse s laochinel Q. H583I. Iraodl labial. o. 5Г. ПЗ 1. p a rs e s. [. (IK. ]. Ja Fig. 8: Tentative parse trees for potential entries.

(20) 10. 3.. Conclusion. We have presented a system we consider a successor of the two-level model. Its design was originally focused on data-management capabilities, which resulted in a client-server architec ture and a formalism with two distinctive characteristics: user-centredness and the introduction of a meta-level which abstracts away from rewriting rules. While the first characteristic pro vides obvious benefits for the end user, the latter is promising because it carries the potential of compiling the meta-level rules to different types of data structures and rules for different pur poses. So far, three compilation algorithms have been realized, all of which serve primarily for rule and entry acquisition purposes. Other algorithms, optimised for run-time usage of opera tional databases, have yet to be conceived. Several comprehensive morphological rule bases have been developed. This experience has shown that the system is both easily understandable for linguists and powerful enough to allow the specification of the inflectional and derivational morphology of several natural languages (Bopp 88. 93), (Brunner91), (Garcia 91), (Gregorio 93), (Gupta 89). The rule base for Italian morphology (Bopp 93) is - specially as far as word formation is concerned - the most compre hensive of this language. The database structure, the sophisticated specification facilities and the flexible knowledge rep resentation resulted in a large system that was expensive to develop. But then. Word Manager is conceived as a system to be used in a larger client-server environment. Furthermore, it is possible to transpose the knowledge contained in a WM-database into small, PC-compatible systems like, e.g., the morphological analysers developed at Xerox PARC as described in Karttunen (1992).. References: Bear J. (1 9 8 8 ):'Morphology with Two-Level Rules and Negative Rule Features.' In Proceedings o f the 12th International Conference on Computational Linguistics, COL1NG-88, Budapest, August 22-21. Bopp S. (1988): Tentative di formalizzazione computazionale della morfológia flessionale dell’italiano, Lizentiatsarbeit an der Philosophischen Fakultät I der Universität Zürich. Bopp S. (1993): Computerimplementation der italienischen Flexions- und Wortbildungs morphologie, Olms Verlag, Hildesheim. Brunner C. (1991): An Implementation o f English Morphology Using the Program Word Manager, Lizentiatsarbeit an der Philosophischen Fakultät 1 der Universität Zürich. Domenig M. (1989): Word Manager, A System for the Specification, Use and Maintenance o f Morphological Knowledge, Habilitationsschrift, University of Zurich. Domenig M. (1990): 'Lexeme-based Morphology: A Computationally Expensive Approach Intended for a Server-architecture', in Proceedings o f the 13th International Conference on Computational Linguistics COLING-90, Helsinki. D om enig M ., ten H acken P. (1992): Word Manager: A System fo r Morphological Dictionaries, Olms Verlag, Hildesheim..

(21) 11 Emele M. (1988): 'Überlegungen zu einer Two-Ievel Morphologie für das Deutsche.' ln Proceedings 4. Oesterreichische Artificial-Intelligence-Tagung, Wien, August 2931, 1988. Published in the series Informatik-Fachberichte, 176, Springer. Garcia C. (1991): Computerimplementation der deutschen Morphologie, Lizentiatsarbeit an der Philosophischen Fakultät I der Universität Zürich. Görz G., Paulus D. (1988): 'A Finite State Approach to German Verb Morphology.' ln Proceedings o f the 12th International Conference on Computational Linguistics, COLING-ЯР, Budapest, August 22-27. Gregorio S. (1993): Implementation o f English Inflectional and Derivational Morphology, Lizentiatsarbeit am Institut für Informatik der Universität Basel. Gupta A. (1989): La formalisation de la morphologie frangai.se sur la base du Systeme Word Manager, Lizentiatsarbeit an der Philosophischen Fakultät 1 der Universität Zürich. K arttunen L., K aplan R. M ., Z aenen A. (1992): 'Tw o-Level M orphology with Composition.' In Proceedings o f the 15th International Conference on Computational Linguistics, COLING-92, Nantes, July 23-28. Kataja L., Koskenniemi K. (1988): 'Finite-state Description of Semitic Morphology: A Case Study of Ancient Akkadian.' In Proceedings o f the 12th International Conference on Computational Linguistics, COLING-8H, Budapest, August 22-27. Kay M. (1987): 'Nonconcatenative Finite-State Morphology.' In Proceedings o f the Third Conference o f the European Chapter o f the Association fo r Computational Linguistics, Copenhagen, April 1-3. Koskenniemi K. (1983): Two-Level Morphology: A General Computational Model for WordForm Recognition an Production, doctoral thesis at the University of Helsinki, Publications № 11. Koskenniemi K. (1990): 'Finite-state Parsing and Disambiguation.' In Proceedings o f the 13th International Conference on Computational Linguistics, COL1NG-90, Helsinki, August 20-25. Trost H. (1990): 'The application of two-level morphology to non-concatenative German morphology.' In Proceedings o f the 13th International Conference on Computational Linguistics, COUNG-90, Helsinki, August 20-25..

(22)

(23) The Organization of the Lexicon in GSF: Structure and Implementation. L o r n e H . B O U C H A R D - L o u is e t t e E M IR K A N IA N. A wide-coverage computational grammar must be based on a comprehensive lexical database in which the information is structured and represented in an efficient way. We describe how morphological, syntactic and semantic knowledge of French can be extracted systematically and in a computationally economical way from standard reference works such as Le Grand Robert de la langue fran<;aise and Le dictionnaire de notre temps in order to construct a lexical database. This task is a bootstrap process, whereby certain information which can be gleaned easily from the dictionary is used to glean further information, and so on through multiple stages. The ultimate goal is to construct a lexical knowledge base which can account for language regularity in a systematic way and can cope with the creative use of language.. x. I n t r o d u c t io n. This research w as u n d e rta k e n as p a rt of a larger project, the goal of w hich is to d e v e lo p a w id e -c o v e rag e co m p u ta tio n a l g ra m m a r of F rench [E m irk an ian & B ouchard 1992]. From a practical p o in t of view , a com putational g ram m ar m u st h av e w id e -c o v e rag e if it is to be tru ly u sefu l in the large. F rom a m o re th eo retical p o in t of view , a co m p u ta tio n a l g ra m m a r m u st also h av e w id ecoverage if it p u rp o rts to be a credible m odel of n a tu ral language perform ance [B ouchard, E m irk a n ia n & M orin 1992]. The im p o rtan ce of the lexicon as a c e n tra l re p o s ito ry of p h o n o lo g ic a l, m o rp h o lo g ic a l, sy n tactic a n d se m a n tic in fo rm atio n is stressed in m o st co ntem porary linguistic theories. H ence a w idecoverage co m p u ta tio n a l g ram m ar m u st be b ased o n a co m p reh en siv e lexical d atab ase in w h ich the inform ation is stru c tu re d a n d rep resen ted in an efficient w ay. The co n stru ctio n of su ch a d atab ase is a co nsiderable task a n d m u st be.

(24) 14 a u to m a te d as m u c h as F u rth e rm o re , this task is n ew w o rd s an d be able to sy stem sh o u ld be capable fo rm idable task indeed.. p o ssib le, or c o m p u te r a ssiste d a t the v e ry least. o p en e n d ed since the system m u st co n stan tly learn augm ent an d refine existing entries. U ltim ately such a of accounting for novel or creative use of language, a. S u b c a t e g o r iz a t io n. in. GPSG. and. GSF. Lexical ID ru les are the com ponents in G eneralized P hrase S tru ctu re G ram m ar (GPSG) [G azdar, K lein, P u llu m & Sag 1985] w hich b rid g e the g ap betw een g ra m m a r a n d lexicon: subcateg o rizatio n inform ation is a codification of lexical behavior. S ubcategorization of the head of lexical ID rules is used to encode the co m p le m e n t s tru c tu re n o t o n ly of verbs, b u t also of adjectives a n d n o u n s. B ecause of sp a c e lim ita tio n s, w e sh all focus ex clu siv ely on v erb s for the re m a in d e r of o u r p re s e n ta tio n . T he c o m p le m e n t s tru c tu re is a com p lex reflectio n in sy n tax of d istin ctio n s based u p o n sim p ler sem antic p ro p erties of verbs [Levin 1993]. V erb s u b c a te g o riz a tio n h a s b een tre a te d e x te n siv e ly in th e G ra m m a ire S y n ta g m a tiq u e d u Fran^ais (GSF) [GIREIL 1993]. S u b categorization fram es or schem a are associated w ith SUB features an d associated w ith a given verb is a list or set of such features. Figure 1 show s a partial list of the subcategorization codes rep resen tativ e of those u sed in the GSF.. Features. Corresponding schema. SUBO SUB1 SUB2 SUB3 SUB4 SUB5. Intransitif N3 N3,(N1) N3,(P3[á,CPR N3J) N3,(P3[de,CPR N31) N3, ADV3[avec, CPR N1). Dormir, travailler Couper, manger, résoudre, apercevoir, tenir Nommer, élire Foumir, apporter, envoyer, dire Retirer, extraire Traiter Marie avec soin. SUB20. P3[á,CPR N3]. Mentir, penser, attentif, profiter.accés, tenir. SUB33 SUB34. P3[á,CPR V3] P3[de,CPR V3J. Donner, penser, songer, tenir, contribuer Jurer, douter, profiler, arréter. SUB50 SUB51. Q3[que] V3[VF0RM er]. Aimer, vouloir, fait, penser Aimer, vouloir, apercevoir, penser, pouvoir. Examples. Representative list of subcategorization schema used in GSF Figure 1. T he v erb penser , for exam ple, h as the features SUB20+, SUB33+, SUB50+ and SUB51+. F igure 2 lists the subcategorization fram es associated w ith penser..

(25) 15. Features SUB20 SÜB33 SUB50 SUB51. Corresponding schema P3[ä,CPR N3) P3[á,CPR V3) Q3[que] V3[VFORM er]. Examples Gilles pense á Mireille Gilles pense á visiter Boston Mireille pense que cette thése est trés bonne Mireille pense venir. Subcategorization schema for penser Figure 2. T his in fo rm atio n w as p ain stak in g ly com piled by h a n d for the lexicon u se d in the p ro to ty p e of the GSF a n d it w as felt th a t so m eh o w this task m u st be a u to m a te d o r c o m p u te r-a ssiste d at le a st w h e n scalin g u p th e p ro to ty p e . F u rth erm o re, alth o u g h GSF is cu rrently a m orphosyntactic analyzer, w e p lan to extend it w ith a sem antic com ponent, since all p roblem s in au to m atic language analysis ca n n o t be so lv ed by m o rp h o sy n tax alone [B ouchard & E m irk an ian 1992]. T his in v o lv es sto rin g m ore in fo rm atio n in the lexicon. In p a rtic u la r, sortal restrictio n s [A lshaw i & C arter 1992] w hich are p a rt of the kn o w led g e of la n g u ag e w h ich lies on the b o u n d a ry b etw een sy n tax a n d sem antics. Sortal restriction in fo rm atio n is in tim ately linked w ith case m ark in g in fo rm atio n and can be co n sid ered a refin em en t thereof. Beyond sortal restrictio n s, sim plified sem antic analysis requires the them atic stru ctu re of verbs. To o u r k n o w le d g e , th ere are cu rren tly n o c o m p reh en siv e m ach in e-read ab le lexicons fo r th e F ren ch la n g u a g e w h ich are g e n e ra lly a v ailab le a n d it is som ew hat reluctantly that w e decided to construct a lexical database. M ach in e-read ab le d ictio n aries are sources of n a tu ra l lan g u ag e k n o w le d g e in w hich in fo rm atio n is sto red in a coherent an d system atic w ay. The kn o w led g e extracted can be u sed to b u ild a lexical d atabase, a necessary first step to w ard s b u ild in g a lexical k n o w led g e base. A lso, a lth o u g h m o st ex isting d ictio n aries w ere co nstru cted for consultation by h u m an readers, w e think it is in terestin g to explore h o w read ab le they are by a m achine, the p u rp o se of w hich is to extract specific know ledge of language in a system atic an d efficient way.. St r u c t u r e. o f t h e l e x ic o n f o r. GSF. The o rg an iz in g p rin cip le of lexical k now ledge in GSF is th at of a lattice w ith m u ltip le in h eritan ce a n d d e fa u lt valu es, p rin cip les w hich have been w id ely a d o p te d by th e k n o w led g e rep resen tatio n co m m u n ity in artificial intelligence [B rachm an, Fikes & L evesque 1983]. In h eritan ce is a p rin cip le w hich allow s com m on in fo rm a tio n to be sto red once, at the h ig h est level po ssib le in the hierarchy, an d to be sh ared by all item s w hich in h erit it, unless this is explicitly.

(26) 16 o v e rrid d e n locally. P u re sim ple inheritance can be sh o w n to im p lem en t, in an efficient m a n n e r, a sim p le form of logical inference. W ith m u ltip le in h eritan ce a n ite m can in h e r it fro m m o re th a n one a n c e sto r. A lth o u g h m u ltip le in h e rita n c e can be sh o w n to be p ro b lem atic, esp ecially in n o n -m o n o to n ic co n tex ts, it can be u s e d to s tru c tu re the lexicon by effectively elim in a tin g re d u n d a n c y p ro v id e d it u sed in a d iscip lined w ay [Russell, B allim , C arroll & W a rw ick -A rm stro n g 1992]. Finally, d e fa u lt values are a co n v en ien t technique for specifying a v alue w h ich is to be u sed as a defau lt value, th at is unless it is explicitly s tip u la te d to be otherw ise. The use of d efau lt values can also help re d u c e re d u n d a n c y . W e seek to construct this lattice stru ctu re w ith the help of the c o m p u te r, w h ich m ean s th at th e re le v a n t d a ta m u st be a v ailab le in a s tru c tu re d m achin e-read ab le form . This d ata is extracted m ainly from m achineread ab le dictionaries.. Kn o w led g e. e x t r a c t io n : a b o o t s t r a p p r o c e s s. T h e k n o w le d g e w e seek to e x tra c t from m a c h in e -rea d a b le d ic tio n a rie s is esse n tia lly of th ree ty p es, i.e., m orp h o lo g ical, syntactical a n d sem antic. This ex tra c tio n is im p le m e n te d in steps. T he extraction process is a ssisted by a n u m b e r of tools w hich w e have im plem ented. B row sing th e dictio n ary A lth o u g h the analysis of Le dictionnaire de notre temps h a d been p erfo rm ed in a UNIX e n v iro n m e n t, w e chose to analyze Le Grand Robert on the M acintosh u sin g H y p e rC a rd 2.2, w hich su p p o rts the international character sets, since w e h a d a w ealth of existing scripts and pre-com piled external com m ands (XCMDs) a t o u r d isp o sa l. T his H y p e rC a rd -b a sed system n o w form s the core of o u r w orkbench for exploring French lexical data. Le Grand Robert on CD-ROM is split into tw o m ain files: a definition file (47 Mb) a n d a c ita tio n s files co n tain in g q u o tes from F rench lite ra tu re (32 M b). The d ic tio n a ry file is p re -in d e x e d by th e list of w o rd e n trie s (n o m en clatu re). H o w ev er b o th files can be searched on-line in our system as free text a n d the fact th a t th ey can b o th be in d ex ed on-the-fly, either in full or p a rtia l w o rd m ode, tu rn s o u t to be in v alu ab le in practice. In d eed , w e w ere so p leased w ith the resu lts th a t w e also h av e created a H y p erC ard stack for Le dictionnaire de notre tem ps. A screen d u m p of the entry for abaisser can be fo u n d in Figure 3 on the next page. W ord tag g in g T he n o m e n c la tu re p ro v id e s th e g ram m atical fu n ctio n of the w o rd s directly , h o w e v e r it is in c o m p le te in th e sen se th a t even som e fre q u e n tly o ccu rrin g w o rd s are m issing. W e use a list of the m ost-frequently occurring w o rd s [Catach 1984] w h ic h is c o n su lte d before the dictionary. T here rem ain s of course the p ro b lem of d ealin g w ith the inflected form s of a w ord..

(27) Uerboide-Z a b a is s e r. ( О Ol. The entry for abaisser. abaisser [abese] I. V. tr. [1] 1, Faire descendre (#qqch) á u n n iv e a u in fe rie u r. A baisser u n store. - A baisser ses regards. » M A T H A baisser u n chiffre, le re p o rte r a la d ro ite d u reste d u d iv id e n d e , d a n s u n e d iv is io n . - A baisser u n e p erp en d icu laire: m e n e r u n e p e rp e n d ic u la ire á u n e d ro ite , á u n p lan. 2. D im in u e r la h a u te u r de (qqch). A baisser u n m u r, » CUIS A baisser u n e páte, 1 'a m in c ir au ro u le a u . 3, D im in u e r (# u n e g ra n d e u r, u n e q u antité). A baisser les prix. Syn. ré d u ire . » M A T H A baisser le degré d 'u n e e q u atio n , ra m e n e r sa re so lu tio n ä celle d 'u n e éq u a tio n de degré m o in d re . 4. A baisser qqn, l'a v ilir, Г h u m ilie r . La m isé re abaisse l'h o m m e . S yn. d égrader. II. V. p ro n . Q. Q. i. m. conjugaison abaisser. Ш\ О. VERBE : a b a is s e r. О. INDIC AT IF P re s e n t j ' a b a is se tu a b a is s e s il a b a is se nous a b a is so n s vous a b a is s e r ils a b a is s e n t. I. Im parfait j ' a b a is s a is tu a b a is s a is il a b a is s a it nous a b a is sio n s vous a b a is s ie r ils a b a is sa ie n t P a ss é sim p le. О #. j ' a b a is sa i tu a b a is s a s il a b a is sa nous a b a isső m e s vous a b a is s ö te s. ]Ф ] ; < ^ = = = 7. <> й<>.

(28) 18 The req u ired k n o w led g e can be extracted from the nom enclature, p ro v id in g w e also h av e k n o w le d g e of F rench m o rphology. A sim ple te c h n iq u e b ased on suffix strip p in g w as u sed to test the tagging of w o rd d efin itio n s in Le Grand Robert. A utom atic tagging by this sim ple p rocedure p ro duces a b etter th an 85% success rate. W e are h o w ev e r c o n sid e rin g acq u irin g the XEROX Finite-S tate M o rp h o lo g y Tools a n d lexicon for French [K arttunen 1993; K arttunen & Beesley 1992], since it success rate is claim ed to be m uch better and it is available off the shelf. The tags assig n ed by this p ro ced u re are often very am biguous, b u t fo rtu n ately th e sen ten ce frag m en ts are sh o rt a n d the left an d rig h t w o rd s in context can effectively be used, in m ost cases, to constrain the assigned tags.. In d u c tio n of fin ite-state gram m ars A n aly sis of d ic tio n a ry e n trie s in ZYZOMYS [B ouchard, E m irk an ian & G ros d 'A illo n 1991] w as p e rfo rm e d u sin g a ch art p a rse r d riv e n by a h an d -crafted co n tex t-free g ra m m a r. W e chose in ste a d to an aly ze Le Grand Robert u sin g a p p ro x im a te fin ite -sta te g ra m m a rs [E jerhed 1988] w h ich are a u to m a tic a lly in d u c e d fro m a sam p le of the text to be an aly zed . T his is an ex am p le of in d u c tio n b ased on positiv e d ata only [A ngluin 1980] a n d special care m u st be tak en in o rd e r to p re v e n t the p ro c e d u re for over gen eralizin g . T his can be ac h ie v e d by c a re fu lly o rd e rin g the d a ta acco rd in g to the so-called su b set principle, in o rd e r to control the generalization step. A sam ple of the text to be p a rse d is h a n d b rack eted an d a finite-state autom aton is in d u ced from the tag in fo rm a tio n u sin g a m odification of the tail clustering technique [M id et 1980]. T he m o d ificatio n consists sim p ly in lim iting g en eralizatio n to w ith in a seg m ent: this greatly reduces the com binatorial explosion. The resu ltin g finite-state au to m ato n is then converted by h and into a finite-state tran sd u cer w hich is used to bracket the rest of the text autom atically. The finite-state transducer is a crude s y n ta x a n a ly z e r w h ic h is u se d to a u to m a tic a lly e x tract s u b c a te g o riz a tio n in fo rm atio n a n d sortal restrictions from dictionary entries. The sim ple lan guage style u se d in d ictio n ary en tries can explain w h y such a technique is effective. T his ap p ro ach seem s to fit in nicely w ith the XEROX lexical tools an d the finitestate local g ram m ar ap proach used in [Silberztein 1993].. A sample training sequence for grammar induction Figure 4.

(29) 19 Kn o w l e d g e. ex t r a c t e d. S u b categ o riz a tio n in fo rm atio n T he tr a n s itiv e /in tr a n s itiv e d is tin c tio n is s y s te m a tic a lly re c o rd e d in th e n o m en clatu re of Le Grand Robert. H ow ever, there ap p ears to be no system atic w ay in w h ic h th e d ic tio n a ry re c o rd s case m a rk in g in fo rm a tio n : in d e e d , so m etim es th e in fo rm a tio n is fo u n d as p a rt of th e w o rd sen se d efin itio n , som etim es as p a rt of the exam ple. A p a rt from this am b ig u ity , case m arkings w hich are reco rd ed are easily extracted from either of these sources. Sortal restrictio n s The construction of the so rt hierarchy is another exam ple of a b o otstrap process. A sh allo w lattice of so rts p a tte rn e d on [A lshaw i, et al. 1992] is u se d as a first app ro x im atio n . The analysis of w o rd sense definitions and of the exam ples is u sed to extract the nom inal h ead of the n o u n p h rases w hich are arg u m en ts of a v erb a n d th ese n o u n s are th en u sed to refine the lattice. This step c u rren tly req u ires u ser assistance since Le Grand Robert is a language dictionary and lacks the w orld k now ledge coverage of an encyclopedic dictionary. W e are considering usin g Z Y Z O M Y S to h elp reduce the am o u n t of user in tervention required.. A DATABASE OF MOTION VERBS. W e d e c id e d to c o n stru ct a lexical d atab ase of verbs w hich co u ld be analy zed system atically w ith the h elp of the com puter. This lexical datab ase is based on in fo rm a tio n e x tra c te d fro m the m a c h in e -re a d a b le d ic tio n a rie s as w ell as in fo rm a tio n e x tra c te d from the tables of th e lex iq u e-g ram m aire project. The le x iq u e-g ra m m aire project of the LADL of the U n iv ersity P aris VII [Boons, G uillet & L e d é re 1976; G ross 1975; G uillet & L edére 1992; L ed ére 1989] has over the y ears p ro d u c e d a sy stem atic classification of F rench verbs in the form of tables of their syntactic an d sem antic p roperties [L edére 1990]. A lso available is a list of sim ple sentences w hich exem plify the classification schem e u sed in the lexique-gram m aire [G uillet 1990]. W e are cu rre n tly an aly zin g a su b set of verb en tries, those of verbs in v o lv in g m o tio n , in o rd e r to s tu d y a n d e v e n tu a lly ex p lo it in a sy ste m a tic w a y the syntactic a n d sem an tic regu larities a n d in terrelatio n sh ip b etw een verbs in this restricted do m ain . The actual d efin itio n of this class is rath er interesting, since m otion verb entries are n o t m a rk e d as su ch in Le Grand Robert. F rom an in itial list of k n o w n m o tio n v erb s — w hich incid en tally w as fo u n d in Le Grand Robert in the en try for m ouvem ent — the list is extended by analyzing definitions in the dictionary. A lthough w o rd p ro d u ctio n by affixation is not as reg u lar in French as it can be in o th er lan g u ag es, a n u m b e r of m otion verbs can be id entified directly in the n o m en clatu re u sin g affix analysis, as for exam ple from the ro o t v erb porter the follow ing list of verbs are p ro d u ced by prefixation: apporter, colporter, déporter,.

(30) 20 exporter, héliporter, etc. The list of syno n y m s, w h ich can read ily be extracted fro m Le Grand Robert an d Le dictionnaire de notre temps is u se d to enrich the in itia l list. T he p ro p e rtie s of the m o tio n s verbs are o rg an ized alo n g m a n y d im en sio n s a n d w e are investigating how it can be represen ted as a lattice w ith m u ltip le in h eritan ce. Each en try in the lexical d atabase w hich describes a verb has a n u m b er of fields w h ic h in c lu d e th e c la sse s a ssig n e d to it in th e w o rk of th e LA D L, s u b c a te g o riz a tio n in fo rm a tio n , th e m a tic role a ssig n m e n t, fo c u s, lis t of p rep o sitio n s, lists of synonym s an d antonym s and finally exam ples sentences of u ses of the verbs. This d atab ase has ap p ro xim ately 550 entries at the c u rren t tim e. Figure 5 show s a typical entry in this database.. VERBE= abaisser CLASSES= {38L} SOUSCAT= <SN:x,SN:y, (SP:z) , (SP:w)> [x=agent,y=theme,z=source,w: ROLES= FOCUS= final PRÉP= INIT= (de) MED= U FINAL= (á) SYN= {baisser, biller, abattre} ANT= {elever, relever, soulever, surélever, exhausser, hausser) EX= {Voulez-vous abaisser la vitre ? Abaisser qqch. en inclinant, en penchant. Abaisser un bras, la tété. Abaisser la páte, de la pate avec un rouleau á pátisserie, l'aplatir en couche mince.). An entry in the motion verb database. Figure 5. C o n c l u s io n. T he d esign an d im p lem en tatio n of a w ide-coverage lexicon is a considerable task w h ic h re q u ire s th e av ailab ility of m an y resources: a c o m p reh en siv e lexical d atab ase, co m p u te r tools an d m anpow er. We hope th at this in v estm en t can in so m e sm all w ay c o n trib u te to the a d v an cem en t of m ore effective a u to m atic n atu ral lan g u ag e processing. A c k n o w le d g m e n t W e w ish to th a n k all the research assistants involved in the GSF project a n d in p a rtic u la r A n d ré CLO U TIER, Sim on PLOUFFE, B enoit RO B IC H A U D an d C aroline VIEL w h o w ere m ore closely involved w ith the lexical aspects of the research..

(31) 21 R eferences Alshawi, H. & D. Carter. (1992). Sortal Restrictions. In H. Alshawi (Ed.), The Core Language Engine (pp. 173-185). Cambridge MA: The MIT Press. Angluin, D. (1980). Inductive Inference of Formal Languages from Positive Data. Information and Control, 45, pp. 117-135. Boons, J.-P., A. Guillet & C. Ledére. (1976). La structure des phrases simples en franqais. I: Phrases intransitives. Génévé: Droz. Bouchard, L., L. Emirkanian & J.-Y. Morin. (1992). Computational Grammar as Knowledge Representation. In PROC. Sixth International Conference on Systems Research Informatics and Cybernetics (Volume II), Baden-Baden, International Institute for Advanced Studies in System Research and Cybernetics, pp. 121-132. Bouchard, L. H. & L. Emirkanian. (1992). An Exploratory Environment for the Study of the Syntax and Semantics of Natural Language. Fourth Symposium on Logic and Language, Budapest. Bouchard, L. H., L. Emirkanian & F. Gros d'Aillon. (1991). Extracting French Morphological and Syntactic Information from a Machine-Readable Dictionary. In Computational Lexicography, Balatonfüred, Research Institute for Linguistics, Hungarian Academy of Science, pp. 9-24. Brachman, R. )., R. E. Fikes & H. J. Levesque. (1983). KRYPTON: A Functional Approach to Knowledge Representation. Research Report No. 16, Fairchild Laboratory for Artificial Intelligence Research. Catach, N. (1984). Les listes orthographiques de base du franqais (LOB): les mots les plus fréquents et leurs formes fléchies les plus fréquentes. Paris: Nathan-Recherche, 156 pp. Ejerhed, E. I. (1988). Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods. In Second Conference on Applied Natural Language Processing, Austin TX, pp. 219-227. Emirkanian, L. & L. H. Bouchard. (1992). Approche computationnelle aux phénoménes morphologiques et syntaxiques du frangais. Rapport de recherche, UQAM. Gazdar, G., E. Wein, G. Pullum & I. Sag. (1985). Generalized Phrase Structure Grammar. Cambridge MA: Harvard University Press, 276 pp. GIREIL. (1993). La sous-catégorisation et la cliticisation. Rapport de recherche, Université du Québec á Montréal. Gross, M. (1975). Méthodes en syntaxe. Paris: Hermann. Guillet, A. (1990). Phrases simples illustrant les tables de verbes du lexique-grammaire.Diskette, personal communication. Guillet, A. & C. Ledére. (1992). La structure des phrases simples en franqais: constructions transitives locatives. Génévé: Droz, 445 pp. Karttunen, L. (1993). Finite-State Lexicon Compiler. Research Report No. ISTL-NLTT-1993-04-02, XEROX Palo ALto Research Center. Karttunen, L. & K. R. Beesley. (1992). Two-Level Rule Compiler. Research Report No. ISTL-92-2, XEROX Palo Alto Research Center..

(32) 22 Ledére, С. (1989). Les mots ont-ils une grammaire? Le Francis dans le monde, (numéro spécial intitulé ...El la grammaire), pp. 40-49. Ledére, C. (1990). Organisation du lexique-grammaire des verbes frangais. Langue franqaise, 87, pp. 112-122. Levin, В. (1993). English Verb Classes and Alternations. Chicago: Chicago University Press, 348 pp. Midet, L. (1980). Regular inference with a tail clustering method. IEEE Trans, on Systems, Man, Cybernetics, 10, pp. 737-743. Russell, G., A. Ballim, J. Carroll & S. Warwick-Armstrong. (1992). A Practical Approach to Multiple Default Inheritance for Unification-Based Lexicons. Computational Linguistics, 18(3), pp. 311-337. Silberztein, M. (1993). Dictionnaires électroniques el analyse automatique de textes: le Systeme INTEX. Paris: Masson, 233 pp..

(33) A Modular and Flexible Architecture for an Integrated Corpus Query System. O l iv e r. CHRIST. A bstract This paper describes the architecture of an integrated and extensible corpus query system developed at the University of Stuttgart and gives examples of some of the modules realized within this architecture. The modules form the core of a corpus work bench. Within the proposed architecture, information required for the evaluation of queries may be derived from different knowledge sources (the corpus text, databases, on-line thesauri) and by different means: either through direct lookup in a database or by calling external tools which may infer the necessary information at the time of query evaluation. The information available and the method of information access can be stated declaratively and individually for each corpus, leading to a flexible, extensible and modular corpus workbench.. 1. In trod u ction. W ith the availability of tagged and annotated text corpora, corpora cannot be regarded any more as mere sequences of words. Additionally, more and more linguistic knowledge bases become available and provide additional knowledge about words (MRDs, on-line the sauri like W o rd N et [Miller et a i, 1993], morphological knowledge bases like the CELEX database [Baayen et a i, 1993] When using and querying corpora, all this knowledge should be usable within a corpus query system in order to enable the lexicographer or linguist to express the linguistic properties of the examined phenomenon as precisely as possible (in order to reduce the amount of data which has to be browsed manually), no m atter how the knowledge necessary to evaluate the query is stored or by which means it is derived. When a corpus is thus regarded as a structured object composed of several different knowledge sources, a problem arises because different knowledge sources require possibly different access methods. Furthermore, for many types of information, it is useful not to store the information physically at all but to compute it at the time of query evaluation. For example, bigram tables for large corpora might grow too big to be held online. Au tomatically assigned part-of-speech tags, on the other hand, might either be stored in a.

(34) 24 database when they are regarded as “stable” or might be computed at the time of query evaluation by a tagging tool. Additionally, a corpus query system need not necessarily be used only by human users: a parser might consult a corpus annotated with parse trees (treebank) to disambiguate between several syntactic structures by looking up similar, but disambiguated syntactic patterns; a generator might use a semantically annotated corpus to filter lexical preferences. These different knowledge sources, access strategies and usage situations are best sup ported by a hierarchical, modularized system architecture where the single modules can be combined in different ways to adapt the system to various usage situations. We therefore designed and implemented the following architecture: To abstract as much as possible from the different storage properties, the data access was split between a “logical d ata access layer” , which is independent of data access methods and storage properties, and a “physical d ata access layer” , which is the data-oriented interface to the knowledge sources and which is responsible for data access and network-based corpus data interchange. The adaptation of the system to different usage situations is achieved through different interfaces to the logical access layer, but tools may also request data from the physical layer directly. A general-purpose query language, which treats the whole corpus as a structured knowledge source and allows to express queries involving all knowledge sources declared for a specific corpus (no m atter how the knowledge is accessed physically), was added to the logical access layer. This architecture is sketched in figure 1. Applications, Tools. Figure 1: The modular architecture of a flexible corpus query system In the following sections, these modules are described in more detail. Section 2 outlines the physical layer. In section 3, the logical layer and the query language are described. One usage situation is the interactive use of the query system. For this purpose, presentation and interaction tools have been built which are explained in section 4. In section 5, some directions of our further work are described. The paper ends with a short conclusion in section 6..

(35) 25. 2. T h e ph ysical layer. The task of the physical layer is to provide a uniform interface between the logical layer and the files, databases or tools which “store” the information the corpus is built of. The physical layer therefore encapsulates knowledge about file and tool access and provides an interface which is independent of the storage device and the information type (static vs. dynamic). Due to its proximity to the physical corpus representation, the physical layer also provides methods for corpus management, bigram table creation and management, corpus preparation and indexing, frequency counting etc. Currently, the physical layer supports the following types of corpus annotations:. • positional attributes are attributes where a (string) value is assigned to (almost) every corpus position. The sequence of words the corpus text is built of is one example of a positional attribute. Other examples are part-of-speech tags and base forms (see figure 2). An arbitrary number of positional attributes can be assigned to a corpus;. pos: N N word: Pierre Vinken 1 1 0 1. IP . 1 2. NUM N 61 years 1 1 3 4. ADJ old 1 6. N blessing 1 rv2. IP 1 n-1. Figure 2: Positional attributes: Values are associated with corpus positions. • structural attributes are attributes which capture information about sentence bound aries, article boundaries etc. Currently, recursive structures (like NPs with embedded NPs) cannot be represented. The number of structural attributes is not limited;. • bigram tables are related to one of the positional attributes of a corpus and hold information about the absolute number of adjacent occurrences of two values of the attribute within a given window size1. Note th at, for example, both word bigrams as well as part-of-speech-tag bigrams can be represented;. • alignment information can be added to a pair of parallel corpora (which are, roughly speaking, translations of each other) to represent information about corresponding (aligned) ranges (sentences, for example). As in the case of structural attributes, we cannot represent recursive alignments or alignments on more than one level (for example, information about aligned words additionally to aligned sentences); • finally, dynamic attributes are attributes the values of which are not stored physically, but which are computed at query evaluation time by calling external tools, similar to a function call. An arbitrary number of arguments can be declared for a dynamic attribute. When the value of a dynamic attribute is requested, the argument list is filled and an external tool is called. The external tool, then, returns the computed * JWe use the term attribute value to denote one element of the list of distinct strings which occur as the values of a positional attribute. In the case of the corpus text, this is the list of distinct words which occur in the corpus..

(36) 26 value, which is either a string or an integer value. Neiter indices nor bigram tables can be built for dynamic attributes. A corpus has to be prepared in a special way before its data can be used by the query system. This preparation step involves character set normalization, tokenization, sentence boundary detection (if required), and - in case of annotated corpora - the partitioning of the different positional attributes (for example, corpus text and part-of-speech tags) into several files. Then, a special one-word-per-line format is produced which is used as input for the construction of the internal corpus representation and the indices2. The corpus text itself is not needed any more after transforming it into the internal representation. Details of the internal corpus representation and the encoding steps are described in [Christ, 1994]. After a corpus is encoded, it must be registered. This is achieved through a registry file which declares the attributes and their types assigned to a corpus. All corpus accessing tools access a corpus only via a symbolic name, which is the file name of the registry file. The tools (and the users) therefore need not know where a corpus is stored in the file system in order to access the data. All relevant information is captured in the registry file. NAME " H a n s a rd c o rp u s ( e n g l i s h p a r t ) " ID h a n s a rd -e HOME / c o r p o r a / e n c o d e d /h a n s a r d - e ATTRIBUTE w ord ATTRIBUTE p o s DYNAMIC ishum an(S T R IN G ): INT " / c o r p o r a / u t i l s / c m d / s n - h y p e n ’$ 1 ’ human" ALIGNED h a n s a r d - i. # th e fre n c h p a r t. Figure 3: A small sample registry file A sample registry file may look as illustrated in figure 3. It declares a corpus h an sard -e and the directory in which the data can be found. Two positional attributes are assigned to this corpus, word and pos. Additionally, the dynamic attribute ishuman is declared, which takes a string as an argument and returns an integer value (where “0” means “no” and “1” means “yes”). Upon query evaluation, a shell command is executed which consults W o r d N e t to evaluate whether the argument string may denote a “human object”. The corpus is aligned to another corpus, h a n s a rd -f. A corpus can be extended after registration. Positional attributes (as well as all other types of attributes) can be added to an existing corpus without need for reindexing existing data. For testing purposes, we have implemented a T C P /IP protocol for network-based ex change of corpus data within the physical layer. Through this protocol, it is possible to declare that a given attribute of a corpus (or the whole corpus) is stored on a remote computer. Upon access to remotely stored data, a network connection is built up, access authorization is verified and, if access is granted, the requested data is returned. Through 2The internal corpus representation we use is inspired by an - unfortunately - unpublished draft paper by Ken W. Church, “A Set of Unix Tools for Processing Large Text Corpora”..

(37) 27 this exchange protocol, it is possible to split corpus data between several computers in the internet. This is useful, for example, to share corpus data between several computers or to run query tools on computers which have too little memory or hard disk space to hold large corpora (although data access is slowed down a lot by remote connections). The remote status of an attribute is hidden within the physical layer, th at is, clients of the physical layer do not need to handle remote corpora differently from local data access. One of the most im portant “clients” of the physical layer is the logical layer, which is described in the following section. Other clients are tools which do not need to access a cor pus through a query language (for example, word list generators or tools which statistically evaluate frequency or bigram counts).. 3. T h e logical layer and th e query language. The logical layer uses the information provided by the physical layer to parse and eval uate corpus queries given in the query language described below3. Within this layer, the set of positional attributes defined on a corpus can be seen as a sequence of entities re ferred to by corpus positions. These entities may have several attributes, for example the attribute W ord for the “character string” found at a given corpus position, Pos for the part-of-speech tag assigned to th a t word, R oot for the base form of th at word, etc. The query language allows to find sequences of entities where a number of conditions over such attribute-value pairs hold. Conditions are boolean expressions which involve attribute-value tests, where all posi tional attributes defined on a corpus can be used. Such a condition may look as follows: (1 ). fw ord“ " c h a i r .* " к p o s !» " i . * " ]. When this condition is evaluated against a given corpus position, it is tested whether the value of the word attribute at that corpus position matches (=) the regular expression " c h a ir.* " and the value of the pos attribute does not match (!=) the regular expression " N .* " 4.. A query consists of a regular expression over such conditions. In addition to concatena tion of conditions, the other standard regular expression operators are available, like for an arbitrary number of repetitions of the preceding regular expression, for at least one repetition, “?” for optionality, and “ I” for disjunction. Parentheses can be used for grouping of expressions. □ is a “wildcard” which matches every corpus position. Addi tionally, the interval operator {n, m} is supported, which denotes at least n, but at most m repetitions of the preceding regular expression5. Thus, regular expressions are used on the level of attribute values as well as on the level of conditions. Example (1) is already a query, since it is a one-element regular expression. 3Currently, the logical layer only supports positional, structural and dynamic attributes; access to bigram and alignment attributes has yet to be implemented. 4We use the POSIX EGREP syntax, for regular expressions. In this standard, the dot matches every character and the star matches any (possibly empty) sequence of the last character or regular (sub-(expression. A common error is to write "*•" when all strings beginning with a capital H should be matched, but the regular expression "H*" denotes all strings which entirely consist of a sequence of capita] Ks. sWhen m is omitted in such an interval, exactly n repetitions are matched..

(38) 28 When a query is evaluated, the query interpreter computes all matches of the regular expression in the corpus. A match of a query is a “substring” of the corpus, th at is, a corpus interval the boundaries of which are the beginning and ending corpus positions of the match. Since regular expressions are used which, in general, may contain repetition operators, these intervals can differ in length. The result of a whole query is the set of matches, th a t is, a set of corpus intervals. The following examples illustrate some aspects of the query language. Query (2) (2). [poe**"JJ.*"3 [pos*"N.*"] "andlor" Сров""И.*"] [ро8»"Ш" ft word !* "th a t" ]. returns all corpus intervals which are (adjacent) sequences of an adjective ( J J , JJR , JJS)6, a noun (NN, NNS), a conjunction, another noun and finally a preposition or subordinating conjunction (IN) which must not be th a t (in the corpus, th a t was often tagged as IN, which should be excluded in this query)7. When in a condition only the vord attribute is accessed (together with the equality operator), the brackets can be omitted. So "an d lo r" is just an abbreviation for the complete condition [w o rd = "an d |o r"]8. Dynamic attributes can be accessed in a simple way: (3 ). " k i l l . * " П ? [ p o i- " f t.• " ft is h u iia n (v o rd ) ]. As defined in the sample registry file in figure 3, the dynamic attribute ishuman requires a string argument and returns an integer value which internally is interpreted as “Yes” if the value is 1, and interpreted as “No” if the value is 0. In query 3, ishuman is called with the value of the word attribute of the noun. When the query is evaluated, all matches are computed which are a sequence of a word beginning with k i l l , followed by an optional, unspecified word (for example, by), and finally followed by a noun for which the consultation of W ord N et gives reason to assume that it may denote a human. A predefined dynamic attribute is “f ” , which returns the absolute frequency of its argument in the corpus. To search the “most common human beings who are loved”, the following query could be formulated: (4). "love.*" Cl? Cpoe*"N.*" ft f(word)>10 ft ishunan(vord)] ;. Structural attributes, like sentence boundaries, can be accessed by SGML-like tags: (5). [p o a " " i.* " l. []. <s> "S h e". This query returns all corpus intervals where a noun, followed by an arbitrary item (which is to match the full stop or other sentence delimiter) occurs in front of a sentence boundary, followed by the word "She". Structural attributes like sentence or article boundaries can also be used to limit the search space when repetitions are used. For example, the query eThe corpus on which query (2) was run is a part of the Penn Treebank, which has been tagged with the Penn Treebank POS tagset. See [Marcus et a/., 1993] for an explanation of the tags. 'Query 2 serves to filter concordances which illustrate the problems of adjective scope and PP-attachment within conjoint noun phrases. For a few matching lines, see figure 4. *The condition "andlor" could as well be expressed as ([»ord*“and"] I [*ord*"or"]>, or, abbreviated, as ("an d "l"o r"). Whereas the latter two expressions use disjunction on the level of conditions (and have to be grouped by parentheses), the expression used in query (2) uses disjunction on the level of attribute values..

(39) 29 (6 ). " p r e s i d e n t " [ ] • " s a id ". would search the two strings " p re s id e n t" and " s a id " separated by an arbitrary number of non-specified items. In general, only those matches which entirely lie within one sentence will be of interest. This can be achieved by using the w ith in construct: (7 ). “p r e s i d e n t " □ * " s a i d " w ith in s ;. Now, the whole match has to lie within the boundaries of one sentence9. All structural attributes defined on a corpus can be used as boundary markers (like <s>) or in the w ith in construct. For example, when the structural attribute a r t i c l e was defined on a newspaper corpus, w ith in a r t i c l e can be used in queries as well. An additional, powerful construct of the query language are label references, which can be used instead of an attribute value. A condition can be labelled by preceding it with a label name and a colon ( “a : ”), as in (8): (8 ). « : [ p o s - " * .» " ] . . .. Then, in a subsequent condition in the same query, an agreement of attribute values can be expressed: (9). » : [ p o ee "H .* " ]. □•. [poe*"PR P" к т ш в а.п ш О w ith in s ;. Here, the value of the num attribute of the personal pronoun (PRP) must be the same as the value of the num attribute at the position the label a refers to, th a t is, the value of the number attribute of the noun. The whole match must lie within one sentence. Another example which illustrates the power of label references is the following query: (1 0 ). a : [р о е » " * .* " ] ( [ ] ♦ [w o rd * a.w o rd ] H 2 } w ith in a ;. This query returns all intervals where the same noun occurs more than tree times within the same sentence. The query language implements some additional constructs, which cannot described in detail here. For a full description of the query language, its power and a comparison with other corpus query languages, see [Schulze, 1994]. Query results can be saved in files and reloaded and reviewed in later sessions. The logical layer supports subsequent queries on the result of an earlier query, which can greatly reduce the search space and therefore improves efficiency. For example, in a newspaper corpus, one can first extract all articles of a corpus where a certain syntactic construction is used. Afterwards, this “subcorpus” of articles can be analysed by subsequent queries running only on a part of the original corpus. Additionally, set operators are supported, th at is, query results can not only be produced by queries, but also by combining results of earlier queries with union, intersection and difference operators. Through this mechanism, it is possible, for example, to intersect the set of sentences generated by a first query with the set of sentences of a second query to get all sentences where the conditions expressed in both queries hold. Although the same result could possibly be produced by a single query as well, set operators are more user-friendly. In our eyes, searching on the results produced by ®The number of sentences which may “surround" the matched interval can be expessed with a number following the w ithin keyword. uw ithin 2 • ” therefore allows a two-sentence distance..

(40) 30 earlier queries and the possibility to combine query results to new “subcorpora” supports a successive refinement of queries with the gain of efficiency, and allows a stepwise approach to the solution of complex problems. The result of a query can be postprocessed by different tools for presentation, frequency counting, additional filters etc. The following section describes two simple presentation tools.. 4. P resen ta tio n m od ules. Figure 4: The X kwic presentation module A presentation module has the task to display the information returned by a query, suitably formatted for a human user. One instance of such a module is a program called XKWIC which is an X Window System based graphical user interface for displaying key word in context (KWIC) concordances. X kwic also provides an input area for typing in queries to the logical layer, thus being a general and comfortable interface for corpus work. Figure 4 shows X kwic after processing the query displayed in the topmost window.