Pages in topic:   < [1 2 3 4] >
Machine translation: your experience with the various MT programmes? ("state of play")
Thread poster: Barnaby Capel-Dunn
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
MT for non-Major(ity) languages Nov 24, 2008

Roald Toskedal wrote:
Seems to me that most of you are referring to translation between the large World Languages, English, French, Spanish, and German in this thread, but what about the smaller languages? From my own region of the world, I can name Swedish, Danish, Norwegian, Sami, Finnish, and Finnish-Swedish, and I say that we're not facing any imminent 'danger' from MT in our neck of the woods...

I mean, the market and possible ROI is just too small for anybody to invest in creating MT for these languages, together with some thousands of smaller languages on this planet.

We've seen it with several efforts on developing Speech Recognition in Norwegian. Over the last 10 years, several enterprises have broken their back on such an effort, leaving us with no Speech Recognition applications for Norwegian today.

As I'm sure most of you know, the work and investment required to create MT for a language is not dependent on the size of the language, nor the market prospect, so Norwegian MT will be just as costly as that for English or Spanish, or Chinese/Japanese for that matter.


Roald,

I'd like to address this point about Major vs Minor(ity) languages (also referred to as less-prevalent, less(er) widely used, less frequently used, sparse-data languages).

MT for minority languages:

I spent several years working on MT systems for the latter type of languages.
List of publications at bottom of this post on this specific topic. Some of them are no longer online because the Geocities website recently went defunct where they were posted. I'll try to fix that problem in the near future and get them back online.

If all existing parallel content for each minority language (and a corresponding target language) could be collected, and aligned and cleaned up and used to train an SBMT system, then this would be quite serious. How can this be?

The following 3 papers in the list provided below shown results of how long it takes to get a functional system up and running on minority languages:

* The paper by Lenzo, Hogan and Allen presents info on the timeline and resources to put such resources together to build a system
* The paper by Allen, Hogan at SPCL98 gives the results of collection of all existing translations, and translating/editing an additional 14,000 sentences in 1 year to create a database for that language.
* The paper by Allen, Hogan at LREC98 cited in the posts above shows the ramp-up that can be done on 2 minority languages, and the increase which can be obtained in a year.
* The 2 papers by Eskenai, Hogan, Frederking and Allen also show the timelines for getting such systems into place. These were rapid deployment systems on a series of different languages (Serbo-Croatian, Creole, Korean, etc) and the same method was used for all of them.

Funding for such systems:

The same monetary figure that was stated by a few MT companies for funding a new language from scratch, mentioned in a round table on MT for minority languages at AMTA1998, is what I have heard in the past 2 years for the same need. The financial investment is very related to the size of the language and market share because it requires revenue based on MT software license sales to not simply reimburse the investment but also to make a profit in order to do more work, and keep the company alive.
All of the Rule-based MT companies will likely state such a figure, because the work and effort is based upon the work on grammatical rules and dictionary coding.

On the other hand, the Statistics-based MT method is independent of the language. And the two commercial SBMT companies have been putting a significant amount of work into minority languages over the past year. Asian Online keeps indicating that it is working on a 1 Billion word project into Asian languages in using its own technology, and once that content is complete to do the final retrain of the system, then this could make a major dent in the application of MT for such languages.

Scandinavian languages:

MT for Scandinavian languages does exist. To my knowledge, Scania has been using an MT system with their Scania Swedish writing system for over a decade.

Some of the MT systems for these languages started as projects funded in part by European Commission Human Language Technology projects. However, after the funding period, then the software sales need to cover the rest of development and maintenance.

Systran does have an English-Swedish MT system, but it is one of the most recently developed language pairs. Hiring of resources for grammar and dictionary work is likely dependent on the history and forecasting of software sales to determine the market value to invest more.

I recall reading somewhere recently that PROMT is now working on a Scandinavian language version.

Some papers on minority language translation systems:

ALLEN, Jeff. 1998. Lexical variation in Haitian Creole and orthographic issues for Machine Translation (MT) and Optical Character Recognition (OCR) applications. Paper presented at the First Workshop on Embedded Machine Translation systems of the Association for Machine Translation in the Americas (AMTA) conference, Philadelphia, 28 October 1998.
http://www.geocities.com/jeffallenpubs/amta98-allen-final.htm

ALLEN, Jeff and Christopher HOGAN. 1998. Evaluating Haitian Creole Orthographies from a non-literacy-based Perspective. Presented at the Meeting of the Society for Pidgin and Creole Linguistics (SPCL) at the Meeting of the Linguistic Society of America (LSA), New York City, New York, January 1998.
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/chogan/Web/Publications.html

ESKENAZI, Maxine, Christopher HOGAN, Jeff ALLEN, and Robert FREDERKING. 1997. Issues in database creation : recording new populations, faster and better labelling. In Proceedings of Eurospeech97. Vol. 4: 1699-1702. Conference held in Rhodes, Greece, 22-25 September 1997.

ESKENAZI, Maxine, Christopher HOGAN, Jeff ALLEN, and Robert FREDERKING. 1998. Issues in database design: recording and processing speech from new populations. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC98), 28-30 May 1998, Granada, Spain. Vol. 2, pp. 1289-1293.

HOGAN, Christopher and Jeff ALLEN. 1999. Phonemic and Orthographic realizations of 'r' and 'w' in Haitian Creole. Paper presented at the International Conference of the Phonetic Sciences (ICPhS 99), San Francisco, 1-7 August 1999.

LENZO, Kevin, Christopher HOGAN, and Jeff ALLEN. 1998. Rapid-Deployment Text-to-Speech in the DIPLOMAT System. Paper presented at the 6th International Conference on Spoken Language Processing (ICSLP98). 30 November - 4 December 1998, Sydney, Australia.
http://www.shlrc.mq.edu.au/proceedings/icslp98/PDF/AUTHOR/SL980868.PDF

MASON, Marilyn and Jeff ALLEN. 2001. Standardized Spelling as a Localization Issue. In Multilingual Computing and Technology magazine. Number 41, Vol. 12, Issue 5. July/August 2001. Pp. 37-40.
http://www.multilingual.com/articleDetail.php?id=589

MASON, Marilyn and Jeff ALLEN. 2001. Closing the Digital Divide: Issues in expanding localization efforts to minority languages. In LISA Newsletter, Volume X, No. 2, April 2001 (pp. 23-24, 32).

MASON, Marilyn and Jeff ALLEN. 2001. Is there a Universal Creole for localization efforts? In LISA Newsletter, Volume X, No. 3, August 2001 (pp. 39-42).

MASON, Marilyn and Jeff ALLEN. 2002. Intra-textual Inconsistency: Risks of Implementing Orthographies for Less-Prevalent Languages. In Localization Industry Standards Association (LISA) Newsletter: Globalization Insider, Volume XI, No. 1.3, February 15, 2002, pp 1-5.

MASON, Marilyn and Jeff ALLEN. 2003. Computing in Creole Languages. In Multilingual Computing and Technology magazine. Number 53, Vol. 14, Issue 1. January/February 2003. Pp. 24-32.
http://www.multilingual.com/articleDetail.php?id=625



The Statistic-based MT approach, combined with Example-based MT (Translation memory in essence) with large amounts of training data, would have a big impact on how minority languages could be handled. The key issue is availability of the parallel language data.

Jeff


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
the mixing of TM and MT technologies Nov 27, 2008

as promised higher in this thread, here are comments on the following item :
4) the mixing of TM and MT technologies


* Commercial MT (Rule-based) + TM

- Internal TM modules: At least a couple of MT vendors (Systran, Promt) have created their own internal TM modules to pre-process content before the rest is sent through MT processing. Yet those internal TM modules have constraints compared with other TM tools.

- TM-MT vendor partnerships
... See more
as promised higher in this thread, here are comments on the following item :
4) the mixing of TM and MT technologies


* Commercial MT (Rule-based) + TM

- Internal TM modules: At least a couple of MT vendors (Systran, Promt) have created their own internal TM modules to pre-process content before the rest is sent through MT processing. Yet those internal TM modules have constraints compared with other TM tools.

- TM-MT vendor partnerships: The partnerships have been few between MT (rule-based systems, the majority of commercial MT vendors, like Systran, Promt, LEC, etc) and TM vendors.
For example, Promt also offers a plug-in for Trados TMs with their Expert level version.
I've heard of other plug-in tools but don't know if they have been implemented anywhere.


- TM + online access to MT: A couple of TM tools (Wordfast, Alchemy) offer access to online MT tools within their menus in order to preprocess text by MT and then edit and save in the TM. The disadvantage I see to this is that there is no way in that context to customize the MT system with existing transalated terminology lists/glossaries in order to reduce the mistranslations of multiword terms which rule-based MT systems have difficulty with.

- MT-like features in TM tools: Atril's Deja Vu has the assemble feature (an Example-based MT feature, which is basically the same as using the TM as a datasource for MT). That has been around for several versions.

- End-user orgs: various corporate end-users of MT systems have been creating bridges between MT and TM applications over the years in order to process their own TMs to use with MT. Symantec created their own bridge tool and I've heard of others that have tried experimenting with it.

The compatability between MT and TM tools is a key point. For those who want to venture into combining these tools in a workflow, you need to ask very specific questions about compatibility, and ask to see it work in a demo.


* Statistical MT + TM

As indicated in my post above, Statistics-based MT (SBMT) has better quality when it is trained on existing TMs.
This is why the SBMT vendors are now getting much participation from large corporations and large translation agencies for workshops over past 6 months or so on translation automation.

SBMT systems thrive on lots of data, and lots of TM content.


* convergence of TM and MT approaches

In general, there is an ever-increasing convergence of TM and MT approaches, yet a few of the challenges are the following:
1) how to distinguish between pre-processed TM and pre-processed MT if your customized workflow makes both appear in the same displayed set of content.
2) at what threshold do you turn off the TM and turn on the MT, or inversely, at what threshold do you stop accepting fuzzy TM matches, and instead send it through the MT system.
3) how to merge in terminology from Terminology management tools into such workflows, and not just how to use existing glossaries, but how to code it efficiently within an MT tool without overengineering the customized dictionary. This is an area where there is a significant lack in knowledge transfer and which ends up with lower quality MT output than which can be attained with really good dictionary building techniques.


In the various types of set-ups mentioned above, there are a few general approaches:
1) Either use the TMs to pre-process the translation and then send the non-matched content through MT
or
2) Process with MT first and then clean up in a TM environment and save as TM.
and/or
3) this does not exclude the cycle that I have proposed over the years to start with 2 and then progressively go to 1, yet also continue to use with 2.
4) take the statistical MT approach which simply requires lots of content



* who is really using these set-ups

- The large corporations are more mature now with existing TMs that they did not have 10 years ago, or even 5 years ago.

- some large translation companies and agencies have shown interest and done things along these lines based on what they said at conferences

- It's more challenging for the smaller agencies and freelancers whose TMs are likely smaller and more diversified than for the larger companies.

However, I do know some freelancers who have created their own small scale workflows to combine the different technologies. Based on their comments, these seem to vary in efficiency.

Jeff

[Edited at 2008-11-27 21:09 GMT]
Collapse


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
TM leveraging of 2-word verbs Nov 27, 2008

Jeff Allen wrote:
I've give an example, the term "eat up" which I say to my kids probably 3 - 4 times a week at minimum. Most RBMT systems are restricted to recognizing "eat up" as in "eat up your food now". However, I say that phrase different ways "eat up your food", "eat your food up". An SBMT system is able to recognize all occurrences of these in a parallel TM database, and that that "eat" and "up" occur X number of different ways, and Y number of times each, across the TM, and within the new set of content to translate. It matches all of this up in parallel and proposes the statistically best translation proposal from the analysis of the source and target segments. Again, this is not a percentage based threshold setting but a fully statistical analysis.


I came across another example last night with my sons again

"put your toys away"
"put away your toys"

The mixing of the surface forms of 2-word verbs (phrasal verbs) in English, and phrasal verbs are all over the place in the language, are a grammatical construction that TMs struggle with along the lines of fuzzy match thresholds.

Now of course, I have cited some examples of phrasal verbs which come from everyday speech, but these are found in technical writing as well. This was one of the topics I covered in training courses (using sample text examples from content written by the trainees) to help technical writers standardize their texts in view of the content being translated with translation tools (TM and MT).

At least one of the rule-based MT tools is able to set parameters on word distance for these kinds of sentences/segments.

Statistical based systems are what handle these types of linguistic constructions very well.

Jeff


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
why BabelFish might be falling behind Aug 30, 2009

ViktoriaG wrote:
I am still surpised at BabelFish - despite the many complaints on the quality of their translations, they don't seem to improve. Whether it is because they are already in their comfort zone or because they simply cannot improve what they already have, I find that BabelFish is not as good as some other MT engines.
...
To sum it up, I believe Google has lately been producing some pretty good translations. If I had to use MT again, I would start there. I would not touch BabelFish with a 39.5-foot pole. As for the rest, rthere are some good ones and some very disppointing ones, but I am not in a position to judge them because otherwise, I don't have enough MT experience to be able to judge.


Viktoria,
Babelfish was launched as the first online MT portal, powered by Systran, for AltaVista, back in Dec 1997. It was the pioneer of the online MT portals. Recall however that Yahoo than took over AltaVista. Systran has had several major version releases of it's MT software since that time. Much depends on which specific version of the Systran MT software that BabelFish is using today, and how much ongoing dictionary and grammar work if being implemented for it.

There is now a large number of such MT portals, and everything depends on the current deal between Systran and Yahoo. Maybe there isn't so much investment because Babelfish might getting less hits now than it did back then when I kicked off with the Clinton-Starr report.

Google is also now using its own Statistical based MT engine, after it progressive phased off of using the Systran system to its own inhouse build system. And with the massive website content that Google indexes, it's not surprising that they are making leaps and bounds in the area of Statistical MT progress.

Important to note that Systran just announced in June of this year (2009) its new entry into Statistical MT with a hybrid system of its Rule-based system with the Statistical approach.

Jeff


 
Shouguang Cao
Shouguang Cao  Identity Verified
China
Local time: 09:33
English to Chinese
+ ...
Dictionary is machine translation! Oct 26, 2009

In my opinion, machine translation may not be a perfect tool for end users, but it can be truly a friend of translators. In fact, if you are checking a word with your on-line bilingual dictionary, you ARE using machine translation!

Machine translation can seriously save time on typing, especially when translating things like country names and common chemical names. I made a little program that integrates Google Translate into Word and I was pretty excited about it. Select a source p
... See more
In my opinion, machine translation may not be a perfect tool for end users, but it can be truly a friend of translators. In fact, if you are checking a word with your on-line bilingual dictionary, you ARE using machine translation!

Machine translation can seriously save time on typing, especially when translating things like country names and common chemical names. I made a little program that integrates Google Translate into Word and I was pretty excited about it. Select a source phrase, hit a keyboard shortcut, and it will be replaced automatically with its translation.
Collapse


 
gianfranco
gianfranco  Identity Verified
Brazil
Local time: 22:33
Member (2001)
English to Italian
+ ...
I disagree Oct 26, 2009

Dallas Cao wrote:
In my opinion, machine translation may not be a perfect tool for end users, but it can be truly a friend of translators. In fact, if you are checking a word with your on-line bilingual dictionary, you ARE using machine translation!

No. When we consult a dictionary, whether it is on paper, on CD or on-line, we are mentally processing various alternatives, selecting what we consider the best choice amongst several options, or even discarding them all and taking another route (rephrasing, etc...). This is not the same as using a MT software, it is rather the opposite.

Dallas Cao wrote:
Machine translation can seriously save time on typing, especially when translating things like country names and common chemical names.

This is what a good terminology database does, or should do, considering the pathetic state of the current terminology tools.

Dallas Cao wrote:
I made a little program that integrates Google Translate into Word and I was pretty excited about it. Select a source phrase, hit a keyboard shortcut, and it will be replaced automatically with its translation.

If saving some time in typing is what you are after, this is a good idea. For a fast typist or for users of speech recognition tool, considering the time to read carefully the rough translation and the time necessary to editing, or rewrite, the time saved could be marginal or indeed nil.

bye
Gianfranco



[Edited at 2009-10-26 13:16 GMT]


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
Gianfranco is correct that online dictionaries do not equal MT Nov 22, 2009

Dallas Cao wrote:
In my opinion, machine translation may not be a perfect tool for end users, but it can be truly a friend of translators. In fact, if you are checking a word with your on-line bilingual dictionary, you ARE using machine translation!


gianfranco wrote:
No. When we consult a dictionary, whether it is on paper, on CD or on-line, we are mentally processing various alternatives, selecting what we consider the best choice amongst several options, or even discarding them all and taking another route (rephrasing, etc...). This is not the same as using a MT software, it is rather the opposite.


Gianfranco: you are very correct here.

Dallas: using an online dictionary is not using MT.

Back at the AMTA98 conference, there was a panel of MT vendors present who were frustrated that electronic dictionary producers had recently been claiming to be MT tool producers.
A committee (on which I participated) was formed to define the differences between the different types of translation-related systems with the intention to create a "MT label" that could only be put on a product if it met the criteria.

A significant amount of time was devoted to it, and it eventually was put into the descriptions in the Compendium of Translation Software (that was originally only for MT software).

There was simply no organization that was willing and ready to push for the testing of all translation systems and to qualify them for an MT label.

Dictionaries are simply a necessary component of MT software/systems.

Thanks Gianfranco for your valuable comments on this topic.

Jeff


 
Kirti Vashee
Kirti Vashee  Identity Verified
United States
Local time: 18:33
Understanding the basic rules vs statistical approach Jan 16, 2010

The web has many examples of small tests done by people to evaluate the various MT engines. If you are trying this on a regular basis you will notice that Google continues to get better whereas the Babelfish / SDL / Systran engines are barely different from what they were 5 years ago.

I think we will continue to see that Google outpaces the others because it is Google and because it is very likely that SMT will outpace the Rules based technology. Though Jeff Allen may disagree with
... See more
The web has many examples of small tests done by people to evaluate the various MT engines. If you are trying this on a regular basis you will notice that Google continues to get better whereas the Babelfish / SDL / Systran engines are barely different from what they were 5 years ago.

I think we will continue to see that Google outpaces the others because it is Google and because it is very likely that SMT will outpace the Rules based technology. Though Jeff Allen may disagree with me on this.

It may be useful for some also to take a look at an article that I wrote overviewing RbMT and SMT for the ATA-LTD. This article can be found at http://www.ata-divisions.org/LTD/documents/newsletter/2008-4_LTDnewsletter.pdf This provides a basic overview of the two approaches but we are at a point where we are seeing both camps trying build hybrids.

I am involved with a Japanese MT engine that has over 100 linguistic rules and attempts to parse source segments into parts of speech before processing in the statistical database.

There is enough momentum behind these initiatives that we can expect them to all improve.
Collapse


 
Kirti Vashee
Kirti Vashee  Identity Verified
United States
Local time: 18:33
What matters when evaluating MT engines? Jan 16, 2010

It is important to understand that while off-the-shelf MT does exist, the best quality and most effective MT systems are developed by people who understand that MT is a technology that needs to be tuned for specific purposes to be most useful.

Having said that, there are several companies that do provide “free” MT for general online consumers. These systems are trying to be all things to all people and thus are probably less than stellar in most domains. Most are optimized on �
... See more
It is important to understand that while off-the-shelf MT does exist, the best quality and most effective MT systems are developed by people who understand that MT is a technology that needs to be tuned for specific purposes to be most useful.

Having said that, there are several companies that do provide “free” MT for general online consumers. These systems are trying to be all things to all people and thus are probably less than stellar in most domains. Most are optimized on ‘news’ domain. From the other comments in the forum, it seems that Google is often perceived to be the best general baseline MT. I have seen many favorable comments on the new Microsoft Live (SMT) systems which has some new features like a very cool cross-lingual chat capability. My own tests on ‘PC software domain’ related material show that the Microsoft systems are significantly superior to the Google systems for this specific domain. So the results will depend a lot on what you are translating. Systran and Babelfish have been around for many years, and millions of people think it is good enough to use on a daily basis so for casual web browsing activity this may be sufficient. Most of these people are also not likely to be translators, they just want to get access to information in another language. Both the Microsoft and the Google SMT systems continue to get better regularly, and for most people, are easily superior to the RbMT solutions (Systran, ProMT, SDL, Worldlingo, Babelfish), especially as one strays outside of the FIGS/Russian region. The NIST has documented that Google and Microsoft have the best Arabic and Chinese baseline systems in the market.

Today, the easiest one for an individual consumer to customize to very unique and specific needs is Systran. The desktop version allows simple dictionary creation so that you could for example make the system perform better when you want to translate material that is a little off the beaten path e.g. sailboat manufacture or shoes.

The perspective that I find most interesting is to consider the systems that provide the greatest potential for enterprise users. The criteria that matter most to enterprise customers are:
1. Ease of customization to unique and specific enterprise domains,
2. Relative Quality (need to produce better quality than the ‘free MT’ systems at a minimum),
3. Ability to rapidly and continually improve the quality to levels that enhance and facilitate global business initiatives,
4. Language Coverage e.g. BRIC languages are gaining in importance but German & Japanese continue to be key,
5. Integration with business process and systems infrastructure,
6. Scalability (ability to handle 10 users or hundreds of thousands)

The criteria that matter most to enterprise customers are:
1. Ease of customization to unique and specific enterprise domains,
2. Relative Quality (need to produce better quality than the ‘free MT’ systems at a minimum),
3. Ability to rapidly and continually improve the quality to levels that enhance and facilitate global business initiatives,
4. Language Coverage e.g. BRIC languages are gaining in importance but German & Japanese continue to be key,
5. Integration with business process and systems infrastructure,
6. Scalability (ability to handle 10 users or hundreds of thousands)

From my perspective, there are a much smaller set of alternatives available when you set the question up this way. Perhaps the only real contenders from this viewpoint are Asia Online, Systran, ProMT and Language Weaver. SDL, to my mind, is actually a front to sell services and is not a real contender. My bias is clear.

The Localization community wants MT to be as close as possible to human draft quality at least, if not final professional human translation quality. We are still a long way from that, but a comprehensive translation platform gives the enterprise a fighting chance at producing compelling and accurate if not quite human quality. Any process focused platform that is a highly structured man-machine collaboration environment whose primary objective is to accelerate and enable massive amounts of content, to be converted at close to human quality into many new languages. MT cannot replace humans (if you want human quality or close to it), but it can empower the enterprise’s ability to communicate with its global customer base AND USING LINGUISTICALLY SKILLED HUMANS IS A KEY TO GETTING THE BEST QUALITY

The best way to find out what the truth is beyond the sales claims, is to engage these vendors in a pilot project to get your hands on the technology and see firsthand what is possible. See it work. Understand what it takes to customize the MT system, to get it to a point where it is clearly better than Google and understand what it takes to maintain and improve this system on a regular basis and then integrate it into your other business systems that touch the global customer.

Getting back to the original question , the “best” system that I have seen in the enterprise market is the PAHO (Pan American Health Organization) RbMT system. While it is limited to Spanish and Portuguese, it is a great example of how to do MT right. It flows from source content cleanup and preparation to automated and rapid post-editing with continuous feedback cycles. It is a must-use tool for all the translators that work at PAHO and is worth more than a casual look if you want to understand how to use MT successfully.

To my mind, as the PAHO system illustrates, the best MT systems will always get the respect of real translators, professional and amateur, and they will only be interested to use them because they understand that they can work faster, more efficiently and more effectively using the technology.

Also, real translators are the most competent people to judge what is good and what is not on issues related to translation. Until MT vendors are willing to submit to these judges and get an approval and even an endorsement from them, the MT market will stumble along in the doldrums as it has for the last 50 years.



[Edited at 2010-01-17 17:54 GMT]
Collapse


 
Kirti Vashee
Kirti Vashee  Identity Verified
United States
Local time: 18:33
An evaluation of MT engines Jan 16, 2010

Here is another opinion on what the best MT is. We should all pay attention to the "evaluation methodology" if there is one. As we start collecting and gathering these different opinions we can get to a point where we have a much better sense for what is likely to work best for us.

Again, these are only baseline comparisons and do not represent the best that is possible with either t
... See more
Here is another opinion on what the best MT is. We should all pay attention to the "evaluation methodology" if there is one. As we start collecting and gathering these different opinions we can get to a point where we have a much better sense for what is likely to work best for us.

Again, these are only baseline comparisons and do not represent the best that is possible with either the SMT or the RbMT technology.

http://www.transrio.com.ar/en-ingles/archives/153

The described methodology is:

"The testing I did was for my own purposes, to see which MT tool I wanted to use for day to day chores like cranking out e-mails. I used a passage from this blog, about 500 words, and carefully compared the output for readability and usability. Here’s how I ranked them"

1. Google
2. Babylon LW
3 Microsoft
4 RbMT
Collapse


 
Kirti Vashee
Kirti Vashee  Identity Verified
United States
Local time: 18:33
Comparitive Evaluation of 3 MT systems Jan 20, 2010

This is an excerpt from a blog article that may also help you understand ways to evaluate MT options. Again this clearly points to the fact that without human corrections and cleanup, MT is often bare... See more
This is an excerpt from a blog article that may also help you understand ways to evaluate MT options. Again this clearly points to the fact that without human corrections and cleanup, MT is often barely usable.

http://www.altalang.com/beyond-words/2010/01/19/evaluating-machine-translationthe-present-and-future-of-multilingual-search/

A recent study conducted by researchers at The University of Granada’s School of Translation and Interpretation attempts to analyze and evaluate the results of machine translations done with popular online tools such as Google Translator, Promt, and WorldLingo. The study was published in this month’s issue of Translation Journal, and it raised interesting questions for me about the possible uses for online machine translation.

Looking at the findings, it should come as no surprise that all of the machine translation tools produced poor results in terms of the number of errors, or that after the translations passed through a round of human editing, the number of errors were drastically reduced. What is interesting, though, is that certain online tools performed better than others, and specific language combinations produced varying results. The graph below shows results from German into Spanish (the researchers used EvalTrans Software). The best translation machine is the one showing the lowest word error percentage (WER). Check out the study for more charts and an explanation of the sentence error rate (SER).
Collapse


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
I think rule-based and stat based MT can be complementary Jan 21, 2010

Kirti Vashee wrote:

I think we will continue to see that Google outpaces the others because it is Google and because it is very likely that SMT will outpace the Rules based technology. Though Jeff Allen may disagree with me on this.


Hi Kirti,
Not really that I disagree. Rule-based MT has it place, and SMT has its place. Each can do things that the other cannot, depending on the context and many variables. And I see that it is possible to start from scratch with a RBMT system, in combination with TMs (ie, EBMT) along the way, and then after developing a big enough set of content, then possible to migrate to an SMT system.
Commercial RBMT software can be a low price investment, but do require the extra training how to use them. Without training, the risks are high.

Kirti Vashee wrote:
It may be useful for some also to take a look at an article that I wrote overviewing RbMT and SMT for the ATA-LTD. This article can be found at http://www.ata-divisions.org/LTD/documents/newsletter/2008-4_LTDnewsletter.pdf This provides a basic overview of the two approaches but we are at a point where we are seeing both camps trying build hybrids.


My only comment about your article that I did read a couple of months ago and didn't get to comment on. There is a different level of knowledge needed for the internal RBMT dictionaries vs the user dictionaries. The internal dictionaries always require various programming knowledge. However, several software/systems have dictionary creation modules that don't require programming skills. Some of them can guide the user by using the interface to make the appropriate linguistic encoding choices.

Jeff


 
Kirti Vashee
Kirti Vashee  Identity Verified
United States
Local time: 18:33
Detailed Overview of Asia Online SMT Feb 4, 2010

This is a link to a detailed video describing why MT makes sense for professional translation projects and it much more of an overview on SMT and how it works than a sales pitch. It is much less technical than the overviews that you will see at NLP conferences.

The webinar was well attended and the graphics received very good feedback as they show in very simple ways how things work and why humans will always be needed wherever trans
... See more
This is a link to a detailed video describing why MT makes sense for professional translation projects and it much more of an overview on SMT and how it works than a sales pitch. It is much less technical than the overviews that you will see at NLP conferences.

The webinar was well attended and the graphics received very good feedback as they show in very simple ways how things work and why humans will always be needed wherever translation quality matters

http://languagestudio.com/Webinars.aspx
Collapse


 
Susan Welsh
Susan Welsh  Identity Verified
United States
Local time: 21:33
Russian to English
+ ...
Promt Feb 22, 2010

I am trying out Promt Professional (a rule-based system), and I must say I am shocked at how bad it is, compared to Google Translate (a statistical system).
Going from German to English, it was nothing short of disastrous. A few examples:

plural of bonds: bondses
4.3 milliard pounds
the NPL (needy credits) - instead of non-payable loans
"one must be more dealted with over Spain much, than over Greece because his debts are..."
"Thatcher, centre edge, an
... See more
I am trying out Promt Professional (a rule-based system), and I must say I am shocked at how bad it is, compared to Google Translate (a statistical system).
Going from German to English, it was nothing short of disastrous. A few examples:

plural of bonds: bondses
4.3 milliard pounds
the NPL (needy credits) - instead of non-payable loans
"one must be more dealted with over Spain much, than over Greece because his debts are..."
"Thatcher, centre edge, and Bush" - that Mr. Centre Edge is none other than Francois Mitterrand!
Budget cuts are ... "a massive incision in the quality of life of the citizens."

Now I understand that you can edit this junk to teach the program how you want things done, what is not supposed to be translated (like names), etc. But editing one paragraph of this would take probably half an hour! And of course the syntax is all scrambled, but that's true of Google Trans also. I don't know how, or whether, you can train it to improve the syntax.

I have tried Promt briefly with Russian-English (which one would expect to work better, since Promt is of Russian origin). So far I found it better than the above, but I haven't done enough with it to be sure.


I was planning on trying SYSTRAN, but they don't offer a free trial version. They have one of these "deals" where you pay and then if you don't like it after 30 days, or whatever, then you get your money back. I find that marketing technique really offensive, for a $1,000 or so piece of software, and I think I'll just skip it.

Susan
Collapse


 
Jeff Allen
Jeff Allen  Identity Verified
France
Local time: 02:33
Multiplelanguages
+ ...
PROMT working great for me on several types of text and languages Feb 22, 2010

Hi Susan,

I've just installed PROMT v8 Expert and SYSTRAN v6 Professional desktop versions this past week.

And I've been using them for all kinds of email correspondence at work (DEEN) and it is working well.

I am also using them for the translation of the caption/subtitles of my 2 video seminars on Haitian Creole language technologies.
Results: 1.5 hours of time creating 1st round of user dictionary on FR-EN

French text: 6396 words... See more
Hi Susan,

I've just installed PROMT v8 Expert and SYSTRAN v6 Professional desktop versions this past week.

And I've been using them for all kinds of email correspondence at work (DEEN) and it is working well.

I am also using them for the translation of the caption/subtitles of my 2 video seminars on Haitian Creole language technologies.
Results: 1.5 hours of time creating 1st round of user dictionary on FR-EN

French text: 6396 words
English output (using translation software with custom dictionary applied): 6251 words

French transcript took 10-12 hours to create main draft
Additional cleanup, merging of files, and other processing issues: 2-3 hours

FR-EN Translation part:
custom dictionary of 55 entries took 1.5 hours to create, including source content analysis, term identification, dictionary creation, validating translation using new dictionary entries, and troubleshooting several issues.

2 English native speakers have read draft and stated in my Facebook entry:

"I have skimmed through the English text and while it is clunky and reads very much like French in terms of syntax, it is understandable."

and

"That English translation isn't pretty, Jeff, but someone with exposure to the Romance languages (word order, styles of idioms) should be able to follow it.

Juxtaposition of SEMEN and XXX is a bit jarring Might want to add (Fr. semaine = week) to the text before it goes into the titling mill.

The MT passed some raw French through in a few places. And better transcription of the questions (or just guesswork in square brackets) would make it flow better, otherwise the answers just seem to be Brownian motion. "

Note: this was a draft transcription of extemporaneous speech with lots of holes annotated by transcribers by XXXX and also there are Haitian creole examples of words that I'm tagging as Do-Not-Translate items but need to add another special tag for them to play the role of linguistic as they are in the text.

So now I've added 1.25 hours of time to clean up the FR text, strip out the problems, and have maybe another 30-60 min to do more of it. That should clean up about 85% of the problems. And I'm versioning each draft, so that it is easy to track the changes made, and the results of the time investment.

And then I plan to add an additional 30-60 min on fine-tuned dictionary building.

So in the end, if I complete everything in 5-6 hours of time for a 6300 document of this type of text, with no previous translated material, this will be a great benchmark for showing how fast it can be done.

then need to have a few people postedit the En version (I could of course do that, but I'm on a ton of other critical path action for Haitian Creole language content right now, so trying to delegate as much of this as possible).

And with a good draft EN version, then can take the translation in the opposite direction from EN-DE for example.

this is a great example of using a pivot language for MT. It's a very difficult type of textual content to deal with (non-scripted speech).

I can create an EN source dictionary entry template which can then be used for any target language for creating custom MT dictionaries. And it is is only 50-100 entries, then it will be great proof of how fast all of this can be done and the quality that is achieved per language on the original source text and the pivot language source text.


Susan, it's not the editing that counts, it's the analysis of the text and the careful choice of the right items to enter as dictionary entries.
since I'm doing this is a phased approach, and am documenting everything, it is all traceable and can show the quality attained at each step.

that's how successful MT projects have been done. But so many have unfortunately not done this way, and the results were pitiful. But when done right, with the right expertise, in time constrained circumstances, on a real production project, then we see the real value of how the tool can be helpful.

As for grammatical/syntax stuff, yes there are ways to do it. you just need to know how to customize it. It's all about being trained.

Do you want to help me on this?

And get some free knowledge transfer about using PROMT for it?

In fact, for anyone out there who is willing to help on this volunteer video caption translation project to help Haiti and the Haitian Creole languages, I will give you some free training on using MT systems to do it.
You will need to follow my exact instructions, and log/record info a few things that you do in order to have info on how to measure productivity (it's not that time-consuming, and I've got an excel sheet template to do it).

Let's stick to PROMT right now since they have the 30 day trial for it.

Any takers?

Jeff


Susan Welsh wrote:

I am trying out Promt Professional (a rule-based system), and I must say I am shocked at how bad it is, compared to Google Translate (a statistical system).
Going from German to English, it was nothing short of disastrous. A few examples:

plural of bonds: bondses
4.3 milliard pounds
the NPL (needy credits) - instead of non-payable loans
"one must be more dealted with over Spain much, than over Greece because his debts are..."
"Thatcher, centre edge, and Bush" - that Mr. Centre Edge is none other than Francois Mitterrand!
Budget cuts are ... "a massive incision in the quality of life of the citizens."

Now I understand that you can edit this junk to teach the program how you want things done, what is not supposed to be translated (like names), etc. But editing one paragraph of this would take probably half an hour! And of course the syntax is all scrambled, but that's true of Google Trans also. I don't know how, or whether, you can train it to improve the syntax.

I have tried Promt briefly with Russian-English (which one would expect to work better, since Promt is of Russian origin). So far I found it better than the above, but I haven't done enough with it to be sure.


I was planning on trying SYSTRAN, but they don't offer a free trial version. They have one of these "deals" where you pay and then if you don't like it after 30 days, or whatever, then you get your money back. I find that marketing technique really offensive, for a $1,000 or so piece of software, and I think I'll just skip it.

Susan

Collapse


 
Pages in topic:   < [1 2 3 4] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Machine translation: your experience with the various MT programmes? ("state of play")






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »