A [Better] CAT Breed for the Slavic Soul

Aha! I said to myself upon spying this presentation among the 2013 ATA Conference’s offerings. At last, I will find out which elusive CAT tool actually does a good job with Slavic languages! I had tried several tools, but hadn’t yet run across one that was able to accommodate the peculiarities of my language, Russian, particularly when it came to all of the inflected forms.

Alas, it took no more than two slides for me to be sorely disappointed – not in Konstantin Lakshin's presentation, but in the sad news that there is, in fact, no such thing as a good CAT tool for Slavic languages. Or, at least, there isn’t yet.

Despite my initial dismay at the news, I fortunately stayed to hear the entire presentation. It can be briefly summarized as follows: A combination of technical, linguistic, and particularly market forces have conspired to make CAT tools what they are today: decidedly Slavic-unfriendly. The good news is that many of the pieces needed to improve them already exist, and it’s up to us to put pressure on developers and companies to make use of those pieces.

The reason it took the better part of an hour to provide this information is that the presentation included a lot of very interesting history, examples, and details. It really was quite educational, at least for me.

Kostya started by outlining the history of computer use in translation, and the development of CATs in particular. He began with a discussion of a 1966 government-funded report by the Automatic Language Processing Advisory Committee on the use of computer technology in translation. The gist of this report as it applies to our CAT tool discussion is that machine translation doesn’t work well, but that something vaguely resembling what we now consider a CAT tool, with a similar workflow, might be useful. This pseudo-CAT workflow used the punch card operator – i.e., a human being – as a morphology analyzer. This is interesting, because one of our principal complaints about today’s CAT tools is that they do not have morphology analysis capability. The report also compared use of this early form of CAT with a standard translation process, and found that while it might save some time, its primary advantage was that it “relieve[d] the translator of the unproductive and tiresome search for the correct technical terms.” The report emphasized that compiling the proper termbase was really the key to an effective translation tool.

In the decade or so following the report, the emphasis in computer-assisted translation was thus on building termbanks. In other words, the focus was on words and phrases – small subsegments, if you will – and these termbanks were generally compiled for specific large organizations operating in specific contexts and were not readily transferrable to other entities.

The philosophy that drives current CAT tools – the “recycling” of previously translated texts – emerged fully only in 1979, though large corporations had begun exploring this starting in the late 1960s. This philosophy was in great part a result of the requirements and technologies in place at the time. In the 1960s, for instance, the world was a less integrated place, and there was limited control over the input side – the source text content, editing, and so on. The example Kostya provided was scientific texts coming out of the USSR that were being translated. Fast-forward to the 1980s and 1990s: large corporations have end-to-end control of processes and utilize translation (and translation technology) for their own documents. In this latter context, being able to retrieve and reuse entire sentences made a lot of sense. Note also that in the prevailing markets in which the early CAT tools developed, the primary languages were not highly inflected.

In the late 1980s and early 1990s, the first commercially available CAT tools appeared: IBM Translation Manager II, XL8, Eurolang, and two still-familiar tools, Trados and Star Transit. Trados, in particular, started life as a language services provider trying to get an IBM contract.

The mid- to late 1990s saw the emergence of tools being created ostensibly for translators: Déjà Vu, Memo Q, and WordFast. However, rather than being fundamentally different from their larger predecessors, these often turned out to be essentially smaller, less functional versions of Trados. This era also witnessed the development of smaller commercial players, such as WordFisher (a set of Word macros) and in-house tools such as LionBridge, Foreign Desk, and Rainbow (specifically for software localization), as well as Omega T, the first open-source CAT tool.

That brings us to the present day, the 2000s, when there are too many CAT tools to list, and there have been many mergers and acquisitions among them. However, NONE of the existing tools can be considered very useful for Slavic or other highly inflected languages. In addition to the reasons noted above, there were other issues that contributed to this situation as the software was being developed. First, there were no obvious ways to incorporate Cyrillic into early software. Second, there were additional market forces, such as software piracy, the cross-border digital divide, and the lack of major clients, that provided little incentive to software developers to make CAT tools that would be particularly useful in Slavic-language markets.

Today, we have a much wider playing field in terms of the market for translation. Translation work is “messier” now, and involves things like corporate rebranding and renaming, a variety of dialects and non-native speech, outsourcing, rewrites for search engine optimization, and bidirectional editing in which both source and target documents are being modified. In this environment, the old “termbase plus recycled text” CAT model is not sufficient.

From this historical background, Kostya next proceeded to illustrate just what the difficulties are that Slavic languages present for today’s CAT tools. These can be boiled down to their relatively free word order, their rich morphology, and their highly inflected nature. The CAT tool’s “fuzzy match” capabilities are insufficient for Slavic languages.

Kostya then provided a number of illustrative examples. Consider the following pairs of segments:

              To open the font menu, press CTRL+1.

              Press CTRL+1 to open the font menu.

               Analyzing and characterizing behaviors

               Analysing and characterising behaviours

He ran these and other examples through about a half-dozen CAT tools using a 50% match cutoff, and found that the first example was considered only a 60-80% match, and the second was 0% (in other words, below the 50% threshold). The CAT tools on the market generally do not recognize partial segments in a different order, nor can they tell that “analyzing” and “analysing” are essentially the same word. In other words, they lack language-specific subsegment handling, and morphology-aware matching, searching, and term management. They are also missing form agreement awareness (e.g., noun/adjective case agreement). This diminishes their utility for those translating out of Slavic languages, to be sure, but it also complicates matters for those translating into Slavic languages, as word endings in retrieved fuzzy matches must constantly be checked and corrected.

The obvious question that Kostya next asked is, can this situation be fixed? In theory, yes. Kostya believes that many software tools already in use by search engines, machine translation, and the like could be integrated into CAT tools. These include Levenshtein distance analyzers that can handle differences within words; computational linguistics tools such as taggers, parsers, chunkers, tokenizers, stemmers, and lemmatizers, which analyze such things as syntax and word construction; morphology modules; and even Hunspell, the engine already in use by numerous CAT tools for spellchecking but not for analyzing matches.

Developers continue to cite obstacles to integrating these tools: it’s complicated, they are too language-specific, we don’t know how to set up the interface, there are licensing issues, we have limited resources. While all of these are legitimate factors, Kostya believes that they do not present insurmountable obstacles. He is hopeful that developers will start seeing these tools as data abstraction tools that enable the software to break down the data into something that is no longer language-specific.

So what can we do about this lack of suitable CAT tools? Kostya’s recommendation is principally that we talk to software developers and vendors and explain what we want. We need to create our own market pressure to move things along. In addition, we need to educate developers and vendors about the existing tools that are available; for instance, we might point them to non-English search engines that utilize morphology analyzers.

Alas, there is neither a good CAT tool for the Slavic soul nor a quick fix to this situation. But after listening to Kostya’s presentation, I have a much better understanding of how this situation developed and how we might take action to prompt vendors and developers to move in a new direction.