Xem mẫu

Paraphrasing and Translation Chris Callison-Burch N Doctor of Philosophy Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh 2007 Abstract Paraphrasing and translation have previously been treated as unconnected natural lan-guage processing tasks. Whereas translation represents the preservation of meaning when an idea is rendered in the words in a different language, paraphrasing represents the preservation of meaning when an idea is expressed using different words in the same language. We show that the two are intimately related. The major contributions of this thesis are as follows: • We define a novel technique for automatically generating paraphrases using bilingual parallel corpora, which are more commonly used as training data for statistical models of translation. • We show that paraphrases can be used to improve the quality of statistical ma-chinetranslationbyaddressingtheproblemofcoverageandintroducingadegree of generalization into the models. • Weexplorethetopicofautomaticevaluationoftranslationquality,andshowthat the current standard evaluation methodology cannot be guaranteed to correlate with human judgments of translation quality. Whereas previous data-driven approaches to paraphrasing were dependent upon either data sources which were uncommon such as multiple translation of the same source text, or language specific resources such as parsers, our approach is able to harness more widely parallel corpora and can be applied to any language which has a parallel corpus. The technique was evaluated by replacing phrases with their para-phrases, and asking judges whether the meaning of the original phrase was retained and whether the resulting sentence remained grammatical. Paraphrases extracted from a parallel corpus with manual alignments are judged to be accurate (both meaningful and grammatical) 75% of the time, retaining the meaning of the original phrase 85% of the time. Using automatic alignments, meaning can be retained at a rate of 70%. Being a language independent and probabilistic approach allows our method to be easily integrated into statistical machine translation. A paraphrase model derived from parallel corpora other than the one used to train the translation model can be used to increase the coverage of statistical machine translation by adding translations of pre-viously unseen words and phrases. If the translation of a word was not learned, but a translation of a synonymous word has been learned, then the word is paraphrased iii and its paraphrase is translated. Phrases can be treated similarly. Results show that augmenting a state-of-the-art SMT system with paraphrases in this way leads to sig-nificantlyimprovedcoverageandtranslationquality. Foratrainingcorpuswith10,000 sentencepairs, weincreasethecoverageofuniquetestsetunigramsfrom48%to90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches. iv Acknowledgements I had the great fortune to be doing research in machine translation at a time when the subject was just beginning to flourish at Edinburgh. When I began my graduate work, I was the only person working on the topic at the university. As I leave, there are five other PhD students, three full-time researchers, and two faculty members all striving towards the same goal. The School of Informatics is undoubtedly the best place in the world to be studying computational linguistics, and the intellectual community here is simply amazing. I am grateful to every member of that community but would like to single out the following people to whom I am especially indebted: • MyPhDsupervisor,MilesOsborne,whosedata-intensivelinguisticsclassopened my eyes to statistical NLP and played a crucial role in my deciding to stay at Edinburgh for the PhD. His endlessly creative ideas and boundless enthusiasm made our weekly meetings in his office (and at the pub) a true joy. As much as it is due to any one person, my success at Edinburgh is due to Miles. • My best friend and business partner, Colin Bannard, without whom I would not have founded Linear B. One of my fondest memories of Edinburgh is sitting in our living room trying to name the company. Linear B was perfect since it allowed us to convey to investors that we use clever methods to decipher foreign languages, while at the same time tacitly acknowledging that it might take us decades to do so. • Josh Schroeder, who is the primary reason that it did not take decades to achieve all that we did at Linear B. Josh lived in the boxroom in my flat for a year, in-trepidly writing code so elegant and easy to maintain that I still use it to this day. Linear B put me in the enviable position of having two full-time programmers working for me during my PhD. The quality and amount of research that I was able to produce as a result far outstripped what I would have been able do alone. • Philipp Koehn joined the faculty at Edinburgh after I hounded him to apply and then lobbied the head of the school to allow student input into the hiring deci-sion (a diplomatic means of me getting my way). When Philipp arrived at the university he became the center of gravity for the machine translation group and allowed us to form a coherent whole. He has been a wonderful collaborator and I value the time that I had to work with him. v ... - tailieumienphi.vn
nguon tai.lieu . vn