日本語処理研究工房 ことばの森

大規模な日本語複単語表現辞書:JMWEL —次世代の日本語処理に向けて

Overview and Coverage

The Japanese Multiword Expression Lexicon (JMWEL) is a comprehensive database with a rich set of grammatical attributes fine-tuned for phrase-based Natural Language Processing (NLP) applications such as machine translation (MT), information retrieval (IR) as well as morphological analysis and syntactic analysis of a wide-range of Japanese documents. It contains about 160,000 entry-lemmas (type-level) covering almost all kinds of linguistically idiosyncratic but commonly used Japanese phrases, i.e. idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, proverbs, and old sayings.

JMWEL consists of eight basic sub-lexicons each of which reflects their individual grammatical functions, i.e. sub-lexicon of nominal, verbal, adjectival, adjective-verbal, adverbial, adnominal, connective, and functional expressions. In addition, JMWEL has five topic-based sub-lexicons such as of standard-idioms, onomatopoeic-expressions, adverbs/sayings/clichés, syntactically ill-formed-phrases, and four-character-idioms. (See Configuration.).

Notable features of JMWEL are summarized as follows:

The following table shows the fields for basic information which are common to all sub-lexicons. Some sub-lexicons have a few more fields, depending on their specificities.

Description of Selected Fields

Lemma Expression written in hiragana, without space-marker.
Segmentation Expression in lemma is separated by space-markers into a string of morphemes.
Notational variants Each morpheme is given its all possible types of character, i.e. kanji or katakana.
Syntactic function A symbol denoting syntactic function of the expression is given. It corresponds to "part of speech" for singular word.
Syntactic Structure Dependency structure, equivalently parse-tree of the expression is given. Constituents which can be internally modified are marked.
Left context condition Specific words or phrases in the left context of the expression, if required in the sentence, are noted.
Right context condition Specific words or phrases in the right context of the expression, if required in the sentence, are noted.

Application

The most imperative and genuine application of JMWEL will be to annotate idiosyncratic word-strings i.e., Multiword Expressions (MWEs) on the morpho-syntactic structure of the input Japanese sentence. Here the structure of the whole sentence should be consistent with the structure of each MWE constituent which is preset in JMWEL. Fixing each MWE as a chunk or a loose chunk which allows gaps inside in morpho-syntactic analysis is to simulate the process of human speech understanding, i.e., based on a cognitive linguistics postulate.
This framework of morpho-syntactic analysis accompanied with MWE-annotation will make a firm basis for various kinds of next-generation NLP, such as semantic analysis of natural language, machine translation, information retrieval, discourse understanding, dialogue system. *
It will be a great strides as well for Japanese NLP to strengthen the existing Japanese parser such as Cabocha, KNP, COTOHA API so as to annotate correctly the MWEs occurring in the input text by incorporating JMWEL.
*PAT. 5379318

Significance

The neural-net-based NLP, e.g., Google's NN-Machine Translation, has recently achieved great success, but has nevertheless revealed two basic problems. One is non-accountability for the linguistic processing, then the difficulty of trouble shooting. The other is the difficulty to handle semantically non- or less-compositional phrases, i.e., MWEs.
In the history of rule-based NLP, a large amount of MWE types and tokens which occur in natural language documents and utterances had been long underestimated until it was first highlighted as a key point for advanced NLP technology in early 2000s. Since then, compiling MWE lexicon has been widely tried in NLP society, mostly by corpus-based statistical methods.** However, it is probable that the comprehensive and reliable solution has not been obtained yet. JMWEL, whose scheme was established in late 1960s, is presumably a world-first, manually compiled and full-scale MWE lexicon for NLP.
** ACL (Association for Computational Linguistics) has been holding annual workshop on MWE since 2003.

Configuration

Sub-lexicons classified by grammatical function

Sub-lexicons classified by theme