Material Detail

Broadening statistical machine translation with comparable corpora and generalized models

Broadening statistical machine translation with comparable corpora and generalized models

This video was recorded at Center for Language and Speech Processing (CLSP) Seminar Series. As we scale statistical machine translation systems to general domain, we face many challenges. This talk outlines two approaches for building better broad-domain systems. First, progress in data-driven translation is limited by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce substantial improvements on out-of-domain test data without suffering in-domain degradation. Second, many modern SMT systems are very heavily lexicalized. While such information excels on in-domain test data, quality falls off as the test data broadens. This next section of the talk describes robust generalized models that leverage lexicalization when available, and back off to linguistic generalizations otherwise. Such an approach results in large improvements over baseline phrasal systems when using broad domain test sets.

Quality

  • User Rating
  • Comments
  • Learning Exercises
  • Bookmark Collections
  • Course ePortfolios
  • Accessibility Info

More about this material

Browse...

Disciplines with similar materials as Broadening statistical machine translation with comparable corpora and generalized models

Comments

Log in to participate in the discussions or sign up if you are not already a MERLOT member.