日本語処理研究工房 ことばの森

大規模な日本語複単語表現辞書:JMWEL —次世代の日本語処理に向けて

Overview and Coverage

The Japanese Multiword Expression Lexicon (JMWEL) is a comprehensive database with a rich set of grammatical attributes fine-tuned for phrase-based Natural Language Processing (NLP) applications such as machine translation (MT), information retrieval (IR) as well as morphological analysis and syntactic analysis of a wide-range of Japanese documents. It contains about 140,000 entries covering almost all kinds of linguistically idiosyncratic but commonly used Japanese phrases, i.e. idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, proverbs, and old sayings.

JMWEL consists of eight basic sub-lexicons each of which reflects their individual grammatical functions, i.e. sub-lexicon of nominal, verbal, adjectival, adjective-verbal, adverbial, adnominal, connective, and functional expressions. In addition, JMWEL has five topic-based sub-lexicons such as of standard-idioms, onomatopoeic-expressions, adverbs/sayings/clichés, syntactically ill-formed-phrases, and four-kanji-idioms.

Notable features of JMWEL are summarized as follows:

The following table shows the fields for basic information which are common to all sub-lexicons. Some sub-lexicons have a few more fields, depending on their specificities.

Description of Selected Fields

Lemma Expression written in hiragana, without space-marker.
Segmentation Expression in lemma is separated by space-markers into a string of morphemes.
Notational variants Each morpheme is given its all possible types of character, i.e. kanji or katakana.
Syntactic function A symbol denoting syntactic function of the expression is given. It corresponds to "part of speech" for singular word.
Syntactic Structure Dependency structure, equivalently parse-tree of the expression is given. Constituents which can be internally modified are marked.
Left context condition Specific words or phrases in the left context of the expression, if required in the sentence, are noted.
Right context condition Specific words or phrases in the right context of the expression, if required in the sentence, are noted.