Material Detail

Broadening statistical machine translation with comparable corpora and generalized models

This video was recorded at Center for Language and Speech Processing (CLSP) Seminar Series. As we scale statistical machine translation systems to general domain, we face many challenges. This talk outlines two approaches for building better broad-domain systems. First, progress in data-driven translation is limited by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce substantial improvements on out-of-domain test data without suffering in-domain degradation. Second, many modern SMT systems are very heavily lexicalized. While such information excels on in-domain test data, quality falls off as the test data broadens. This next section of the talk describes robust generalized models that leverage lexicalization when available, and back off to linguistic generalizations otherwise. Such an approach results in large improvements over baseline phrasal systems when using broad domain test sets.

Keywords:: videolectures, ocwc, oec

Disciplines:

Science and Technology / Computer Science

Go to Material

Bookmark / Add to Course ePortfolio

Create a Learning Exercise

Add Accessibility Information

Rate

Add a Comment

Quality

User Rating
Comments
Learning Exercises
Bookmark Collections
Course ePortfolios
Accessibility Info

Report Broken Link
Report as Inappropriate

More about this material

Material Type:: Presentation
Date Added to MERLOT:: February 8, 2015
Date Modified in MERLOT:: February 8, 2015
Author:: Chris Quirk, Microsoft Research
Submitter:: The Open Education Consortium
Primary Audience:: College General Ed, College Lower Division, College Upper Division
Technical Format:: Video

Mobile Compatibility:: Not specified at this time
Language:: English
Cost Involved:: No
Source Code Available:: No
Creative Commons:: This work is licensed under a Attribution-NonCommercial-NoDerivs 3.0 United States

Browse...

Disciplines with similar materials as Broadening statistical machine translation with comparable corpora and generalized models

Science and Technology / Computer Science

Material Detail

Broadening statistical machine translation with comparable corpora and generalized models

Quality

More about this material

Browse...

Disciplines with similar materials as Broadening statistical machine translation with comparable corpora and generalized models

People who viewed this also viewed

Other materials like Broadening statistical machine translation with comparable corpora and generalized models

Comments