logo
Ponte Academic Journal
Apr 2015, Volume 71, Issue 4

Multilingual Focused Crawling: Fetching Topic-Specific Web Documents in Different Languages

Author(s): Ari Pirkola

J. Ponte - Apr 2015 - Volume 71 - Issue 4



Abstract:
This study focuses on analyzing multilingual focused crawling which refers to the process of fetching topic-specific Web pages in multiple languages. A focused crawler that is able to identify similar link words in different languages based on fuzzy matching was developed in the study. The fuzzy matching feature allows the system to traverse through topic-specific pages written in different languages. The results reported in this paper are based on 15 test topics and 90 crawls in the domains of genomics, genetics, and rare diseases. The start URL pages were written in English, German, and Spanish. The languages of the crawled pages were detected by a standard n-gram based language identification algorithm. The crawling results contained documents written in 17 different languages. The percentages of documents written in these languages are reported. We also investigated the frequency of crosslingual links within a topic, i.e., the frequency of the cases where a relevant child document and its parent document are written in different languages. The results showed that the language of start URL documents largely determines the language of the obtained documents. That is, the documents obtained in crawling are to a great extent written in the same language as start URL documents. However, the share of English documents was high in all results. In agreement with these results, we found that relevant cross-lingual links are infrequent, with the exception of links to English pages.
Download full text:
Check if you have access through your login credentials or your institution