This web site uses cookies to deliver its users personalized dynamic content. You are hereby informed that cookies are necessary for the web site's functioning and that by continuing to use this web sites, cookies will be used in cooperation with your Web browser.
In recent years, statistical machine translation (SMT) has become the leading paradigm for machine translation. SMT systems are built by analyzing huge volumes of parallel corpus and learning translation models from this data. The quality of SMT systems largely depends on the size of training data. Since the majority of parallel data is in major languages, SMT systems for larger languages are of much better quality compared to systems for smaller languages. This quality gap is further deepened due to the complex linguistic structure of many smaller languages. Languages like Latvian, Lithuanian and Croatian (to name just a few) have complex morphological structure and free word order. To learn this complexity from corpus data, much larger volumes of training data are needed. Current systems are built on the data accessible on the web, but it is just a fraction of all parallel texts. Most of them still reside in the local systems of different corporations, public and private institutions, and desktops of individual users. The cost and the know-how required for building custom MT solutions deter many small-to-medium companies from utilizing the power of MT technologies.
Short description of the task performed by Croatian partner