A comprehensive list of fixed and semi-fixed multiword expressions in the Japanese language: JEFI

Overview and Coverage

The Japanese Multiword Expression Lexicon (JMWEL) is a comprehensive database with a rich set of grammatical attributes fine-tuned for phrase-based Natural Language Processing (NLP) applications such as machine translation (MT), information retrieval (IR) as well as morphological analysis and syntactic analysis of a wide-range of Japanese documents. It contains about 160,000 entry-lemmas (type-level) covering almost all kinds of linguistically idiosyncratic but commonly used Japanese phrases, i.e. idioms, quasi-idioms, collocations, quasi-collocations, clichés, quasi-clichés, proverbs, and old sayings.

JMWEL consists of eight basic sub-lexicons each of which reflects their individual grammatical functions, i.e. sub-lexicon of nominal, verbal, adjectival, adjective-verbal, adverbial, adnominal, connective, and functional expressions. In addition, JMWEL has five topic-based sub-lexicons such as of standard-idioms, onomatopoeic-expressions, adverbs/sayings/clichés, syntactically ill-formed-phrases, and four-character-idioms. (See Configuration.).

Notable features of JMWEL are summarized as follows:

Extensive collection of expressions
Deliberate description of the notational variants of each expression
Detailed description of morphological and syntactic structure of each expression
Description of internal modifiability, i.e. possibility of the presence of gaps, in each expression

The following table shows the fields for basic information which are common to all sub-lexicons. Some sub-lexicons have a few more fields, depending on their specificities.

Description of Selected Fields

Lemma	Expression written in hiragana, without space-marker.
Segmentation	Expression in lemma is separated by space-markers into a string of morphemes.
Notational variants	Each morpheme is given its all possible types of character, i.e. kanji or katakana.
Syntactic function	A symbol denoting syntactic function of the expression is given. It corresponds to "part of speech" for singular word.
Syntactic Structure	Dependency structure, equivalently parse-tree of the expression is given. Constituents which can be internally modified are marked.
Left context condition	Specific words or phrases in the left context of the expression, if required in the sentence, are noted.
Right context condition	Specific words or phrases in the right context of the expression, if required in the sentence, are noted.

Application

The most imperative and genuine application of JMWEL will be to annotate idiosyncratic word-strings i.e., Multiword Expressions (MWEs) on the morpho-syntactic structure of the input Japanese sentence. Here the structure of the whole sentence should be consistent with the structure of each MWE constituent which is preset in JMWEL. Fixing each MWE as a chunk or a loose chunk which allows gaps inside in morpho-syntactic analysis is to simulate the process of human speech understanding, i.e., based on a cognitive linguistics postulate.
This framework of morpho-syntactic analysis accompanied with MWE-annotation will make a firm basis for various kinds of next-generation NLP, such as semantic analysis of natural language, machine translation, information retrieval, discourse understanding, dialogue system. *
It will be a great strides as well for Japanese NLP to strengthen the existing Japanese parser such as Cabocha, KNP, COTOHA API so as to annotate correctly the MWEs occurring in the input text by incorporating JMWEL.
*PAT. 5379318

Significance

The neural-net-based NLP, e.g., Google's NN-Machine Translation, has recently achieved great success, but has nevertheless revealed two basic problems. One is non-accountability for the linguistic processing, then the difficulty of trouble shooting. The other is the difficulty to handle semantically non- or less-compositional phrases, i.e., MWEs.
In the history of rule-based NLP, a large amount of MWE types and tokens which occur in natural language documents and utterances had been long underestimated until it was first highlighted as a key point for advanced NLP technology in early 2000s. Since then, compiling MWE lexicon has been widely tried in NLP society, mostly by corpus-based statistical methods.** However, it is probable that the comprehensive and reliable solution has not been obtained yet. JMWEL, whose scheme was established in late 1960s, is presumably a world-first, manually compiled and full-scale MWE lexicon for NLP.
** ACL (Association for Computational Linguistics) has been holding annual workshop on MWE since 2003.

Configuration

Sub-lexicons classified by grammatical function

JMWEL_nominal（名詞性）---- 28,400 expressions including institutionalized - phrases, non-compositional compound-nouns.
JMWEL_verbal（動詞性）
- class1　NpV trigram --- 36,500 expressions including LVCs, SVCs
- class2 --- 36,300 expressions
- class3 --- 4,200 expressions including non-compositional compound-verbs
JMWEL_adjective（形容詞性） ---- 5,800 expressions.
JMWEL_adjective-verbal（形容動詞性） ---- 2,800 expressions.
JMWEL_ adverbial（副詞性） ---- 17,600 expressions.
JMWEL_ adnominal-adjective（連体詞性）---- 17,100 expressions.
JMWEL_discourse-marker（文脈指標性） ---- 1,900 expressions including complex discourse-markers and complex sentence adverbs.
JMWEL_functional（機能語性）
- post-predicative（助動詞、終助詞性） --- 5,100 expressions including complex auxiliary verbs and complex ending-particles
- postpositional（格助詞、接続助詞性） --- 2,700 expressions including complex case-particles and complex connective-particles

Sub-lexicons classified by theme

JMWEL_standard idiom（標準的慣用句） --- 4,900 expressions.
JMWEL_onomatopoeic expression（オノマトペ共起表現） --- 43,600 expressions.
JMWEL_four-character-idiom（四字熟語） --- 3,200 expressions.
JMWEL_incomplete phrase（不完全句） --- 470 expressions.
JMWEL_proverb/saying/cliché（格言/成句/決まり文句） --- 4,000 expressions.