Language Comparison Based on Parallel Texts

There is a large body of research using linguistic corpora to investigate the structures of individual languages; in contrast, there are very few studies that investigate linguistic structures across languages on a corpus-linguistic basis. This project is devoted to the development of quantitative and corpus-based methods for analyzing linguistic structures from a typological or linguistic comparative perspective. In doing so, we assume that a good approximation of the structures of individual languages can be achieved with the help of general algorithmic methods. The goals of the project can be summarized in three points: First, we will process corpora on under-researched languages using computational linguistic procedures to the point where they are available for typological language comparison. Since the corpora processed in this way are not annotated, we will work complementarily with parallel corpora, which will provide a starting point for us to investigate the nonannotated corpora with automatic procedures. Second, this project will use existing algorithms and develop new algorithms to annotate the corpora we create and to extract relevant statistics from the corpora for automatic determination of typological parameters. Finally, we will investigate how much linguistic knowledge about individual languages is required to determine a typological parameter.