Modelling Highly Inflective Language for Target Applications Using Natural Language

Ponte Academic Journal
Nov 2015, Volume 71, Issue 11

Modelling Highly Inflective Language for Target Applications Using Natural Language

Author(s): Mirjam Sepesy Maucec, Janez Brest, Andrej Zgank

J. Ponte - Nov 2015 - Volume 71 - Issue 11

Abstract:

Language models are widely used in different applications including speech recognition, machine translation, handwritten recognition, stenographic codes conversion, and information retrieval. In such applications, language models are meant for constraining the search space by delivering a priori probabilities of possible word sequences. Language is a robust and necessarily redundant communication mechanism. Its redundancies commonly manifest themselves as predictable patterns in word sequences, and it is largely these patterns that enable language modelling. Several methods for statistical language modelling were originally developed for English and declared as language independent. Although they do not incorporate linguistic knowledge of the English language, the results for other languages are only modestly successful. Our general goal is the treatment of inflective languages. The idea of the paper is to adjust language modelling methods to make them more powerful when modeling inflective languages. High inflection in a language is correlated with some degree of word-order flexibility. Morphological features either directly identify or help disambiguate the syntactic participants of a sentence. Modelling morphological features in a language not only provides an additional source of information but also alleviate data sparsity problems. In this research Slovenian language is taken as an example of highly inflective languages. The results of comparative analysis of four language model types are presented: word-based, lemma-based, POS (Part-Of-Speech)-based and MSD (Morpho-Syntactic-Description)-based language models. Some combinations of them in terms of linear interpolation are investigated. Experiments are performed using the largest Slovenian corpus FidaPLUS. It is lemmatized and tagged with POS and MSD tags. Constructed language models are evaluated by perplexity values. Our experiments prove that interpolated models outperform a classical language model. The use of language models is demonstrated in two prototype systems: speech recognition and machine translation.

Download full text:
Check if you have access through your login credentials or your institution