• Jul 29, 2019 News!IJMLC Had Implemented Online Submission System, Please Sumbit New Submissions thorough This System Only!   [Click]
  • Jul 16, 2019 News!Good News! All papers from Volume 9, Number 3 have been indexed by Scopus!   [Click]
  • Jul 08, 2019 News!Vol.9, No.4 has been published with online version.   [Click]
General Information
    • ISSN: 2010-3700 (Online)
    • Abbreviated Title: Int. J. Mach. Learn. Comput.
    • Frequency: Bimonthly
    • DOI: 10.18178/IJMLC
    • Editor-in-Chief: Dr. Lin Huang
    • Executive Editor:  Ms. Cherry L. Chen
    • Abstracing/Indexing: Scopus (since 2017), EI (INSPEC, IET), Google Scholar, Crossref, ProQuest, Electronic Journals Library.
    • E-mail: ijmlc@ejournal.net
Dr. Lin Huang
Metropolitan State University of Denver, USA
It's my honor to take on the position of editor in chief of IJMLC. We encourage authors to submit papers concerning any branch of machine learning and computing.

IJMLC 2018 Vol.8(2): 90-97 ISSN: 2010-3700
DOI: 10.18178/ijmlc.2018.8.2.669

The Role of Homograms in Machine Translation

Lucia Nacinovic Prskalo and Marija Brkic Bakaric
Abstract—The Croatian language is a pitch-accent language, in which the tone contour realized in the stressed syllable carries the lexical information. Therefore, in some cases, a different lexical accent gives the word a different meaning. In such cases, the ambiguity of the word in written texts, where accents are not usually marked, can be solved by determining the appropriate accent. There are also cases when various basic and derived forms of words have different meanings, different morphosyntactic descriptions (MSDs), and possibly different accents. When words have the same written forms but different meanings, they are called homograms. In order to resolve the ambiguity of homograms, we created a lexicon of homograms that is comprised of all Croatian nouns of different gender, which have the same written forms (if accents are not marked) but different meanings, MSDs, and possibly different accents. This lexicon consists of 19,366 entries and 3,460 unique homograms. Each entry in the lexicon comprises the homogram (unaccented word), the accented word, the corresponding MSD, and the accented lemma. The obtained lexicon enables us to identify and disambiguate homograms within the corpus efficiently and accurately. We also evaluated and analyzed the performance of machine translation (MT) systems for the Croatian–English language pair with a special emphasis on homogram translation. We confirmed that the disambiguation of homograms can improve the performance of MT systems in avoiding major translation mistakes related to assigning the wrong meaning to homograms.

Index Terms—Disambiguation of homograms, lexicon of homograms, pitch accent language, word sense disambiguation.

The authors are with the Department of Informatics, University of Rijeka, Croatia (e-mail: lnacinovic@inf.uniri.hr, mbrkic@inf.uniri.hr).


Cite: The Role of Homograms in Machine Translation, "Lucia Nacinovic Prskalo and Marija Brkic Bakaric," International Journal of Machine Learning and Computing vol. 8, no. 2, pp. 90-97, 2018.

Copyright © 2008-2019. International Journal of Machine Learning and Computing. All rights reserved.
E-mail: ijmlc@ejournal.net