Pages in topic:   < [1 2 3 4 5 6 7 8] >
Slate Desktop: your personal MT engine
Thread poster: Mohamed

Tom Hoar
United States
Local time: 03:35
English
Re: I took a part of MTM13 Oct 1, 2015

Yes, these are popular approaches in MT academia that focus on the next hurdle to overcome and gloss-over the successes. After all, without hurdles, there's nothing to justify the next research grant.

Your papers also describe working with "online MT" and the university's Moses system that was trained with publicly available corpus. These are not the same as a translator working with an engine made from his/her private inventory of TMs.

Regardless of the corpus, you can
... See more
Yes, these are popular approaches in MT academia that focus on the next hurdle to overcome and gloss-over the successes. After all, without hurdles, there's nothing to justify the next research grant.

Your papers also describe working with "online MT" and the university's Moses system that was trained with publicly available corpus. These are not the same as a translator working with an engine made from his/her private inventory of TMs.

Regardless of the corpus, you can always pick individual sentence examples of errors. So in business, our customers have grown to rely on overall results across a job batches.

Translators start with blind testing where they work on a batch without MT assistance. When they're done, they count the percent of MT segments that match their own work. This way, the translators don't experience bias from post-editing, and they learn that the engine can do a good job.

As I said in the video, it's common for batch-by-batch results to range between 40% to 60% of segments are identical to the blind tests. In a controlled language environment, the results jump much higher.

[Edited at 2015-10-01 14:35 GMT]
Collapse


 

Bernhard Sulzer  Identity Verified
United States
Local time: 03:35
Member (2006)
English to German
+ ...
Question regarding high matching results of human translations Oct 1, 2015

tahoar wrote:

...

Translators start with blind testing where they work on a batch without MT assistance. When they're done, they count the percent of MT segments that match their own work. This way, the translators don't experience bias from post-editing, and they learn that the engine can do a good job.

As I said in the video, it's common for batch-by-batch results to range between 40% to 60% of segments are identical to the blind tests. In a controlled language environment, the results jump much higher.

[Edited at 2015-10-01 14:35 GMT]


You mean to say that in a controlled language environment, the results jump much higher?
I even doubt the 40-60% identical results in an uncontrolled environment.
What kind of control are you referring to? Almost identical segments from a previous translation?

Do you have an example? What type of text are you referring to? Lists? Correct matching of a human translator's complex sentences by a machine in the 40-60% range of what? A few sentences? Or exact match of 40-60% of the entire test text? Seems quite unbelievable.
Maybe you can clear some of this up. Thank you.

[Edited at 2015-10-02 00:37 GMT]


 

Tom Hoar
United States
Local time: 03:35
English
My conundrum... Oct 2, 2015

Bernhard Sulzer wrote:

... I even doubt the 40-60% identical results in an uncontrolled environment...
... Do you have an example?...

I understand your doubt and your request is fair. Seeing is believing, but I have a conundrum.

  • The real examples I have in my possession come from confidential data that a few clients entrusted to us. I won't share these.

  • I can share examples from public data with lower but still impressive (in my opinion) match rates. Of course, any examples directly from me become suspect as a marketing/sales manipulation.

  • I would prefer that my customers share their experience, but they face a different problem. They fear their customers might demand discounts if their results become public. Life's unfair!

  • Alternative: someone else shares his/her Moses experience. If you're still “listening” Patrick, can you share some details supporting your statement earlier in this thread?
    Patrick Porter wrote:

    ...But for a professional translator, an MT engine trained with your own translations can be a really powerful production tool...


In the absence of an independent party stepping up with examples, would you be okay with examples from public data?

Re “uncontrolled environment,” don't forget that a translator's personal choice of jobs is a powerful controlling force over in his/her translation environment. My use of “controlled language environment” referred to authoring and translation environments such as “Simplified Technical English.”


 

Tom Hoar
United States
Local time: 03:35
English
Examples from European Parlament Debates corpus Oct 3, 2015

The European Parliament Debates corpus (Europarl v5 04/1996-10/2009 at statmt.org) is one of the most studied sets of data for creating engines. Slate and Slate Desktop ship with a small randomly-selected subset consisting of about 60,000 Dutch-English sentence pairs. This and other example data support our instruction tutorials.

I created a spreadsheet (click here) wi
... See more
The European Parliament Debates corpus (Europarl v5 04/1996-10/2009 at statmt.org) is one of the most studied sets of data for creating engines. Slate and Slate Desktop ship with a small randomly-selected subset consisting of about 60,000 Dutch-English sentence pairs. This and other example data support our instruction tutorials.

I created a spreadsheet (click here) with 592 test examples followed by notes about how Slate Desktop creates the engine and other relevant analysis. Here are some numbers:

Final segment count in trained engine: ~43,000
Total test segments: 592
tests matching reference: 146
% match: 24.7%

Because of the small corpus size covering a variety of subjects (from terrorism to traffic laws to agriculture), the engine is missing vocabulary for 184 example segments. When adjusting for this, the percentage jumps 10%:

Test segments covered by engine vocabulary: 408
Tests matching reference: 146
% match: 35.8%

If you study the 592 test segments, you'll find some reference translations are just wrong, but the engine was closer to write. Likewise, you'll find some where the Slate segments were wrong because the engine learned incorrectly from bad examples in the training data.

Over the 12 1/2 years the training data represents, the data changed styles. They include outright wrong translations from a time before anyone imagined statistical translation engines. In the early years, translations were not created within the discipline of CAT tools. So, source and target sentences were matched using automated scripts. A quick review shows many poor matches. Despite all these flaws, this set achieves 24.7% raw, and 35.8% adjusted scores.

Change the training data to your TMs that were created aligned from conception. Increase the size from 40,000 to 100,000+ to increase the vocabulary coverage (reduce missing vocabulary errors). Add the quality control and consistency of one or very few translators' styles. Is jumping from 35.8% to my touted 40% to 60% unreasonable?

Slate Desktop
click here


I will share the original 60,000 randomly selected pairs with anyone who wants to validate the results in the report.

[Edited at 2015-10-04 02:30 GMT]
Collapse


 

Tom Hoar
United States
Local time: 03:35
English
TriKonf this weekend Oct 6, 2015

The TriKonf conference in Germany this weekend is a great opportunity to talk to your colleagues before your final decision about Slate Desktop. Several current customers and backers will attend. Some attendees have built their own Moses/Linux systems independently of Slate Desktop.

 

Richard Hill  Identity Verified
Mexico
Local time: 03:35
Member (2011)
Spanish to English
TM maintenance Oct 18, 2015

Hi Tom,

I’m thinking about doing some maintenance on my TMs to get them ready for when Slate comes out so wanted to check what would be worthwhile doing to get the most out of Slate. I mentioned maintenance in my previous post, but didn’t specify I was referring to pre-Slate maintenance.

One program to help with this, which I’ve just downloaded is Heartsome TMX Editor 8 ... See more
Hi Tom,

I’m thinking about doing some maintenance on my TMs to get them ready for when Slate comes out so wanted to check what would be worthwhile doing to get the most out of Slate. I mentioned maintenance in my previous post, but didn’t specify I was referring to pre-Slate maintenance.

One program to help with this, which I’ve just downloaded is Heartsome TMX Editor 8 https://github.com/heartsome/tmxeditor8 which seems like it may be useful. In particular I like the sound of the feature where you can delete duplicate entries and old inconsistent segments, keeping only the most recent, and also filter them so you can decide which ones to keep. I guess this could be really useful for Slate.

Another potential issue is that I have certain clients that ask for different translations of the same terms, meaning that there will be inconsistencies in my TM, so how can I make sure Slate only uses my preferred terms? For example, can I have Slate process a reliable termbase that has my preferred terms, and give priority to them over other translations of the same terms? If Slate can’t do this then will it randomly insert terms in the output file that have different possible translations or would it only use the most common versions?

If you could recommend any other pre-Slate TM maintenance tasks that would be useful for when they are processed with Slate that would be great, so I can do some work on them in my free time -as if I really had any:)- and get my TMs ready for January.

Lastly, how does Slate handle tags/codes? Would it help to batch delete all tags in the TM?

Rich
Collapse


 

Tom Hoar
United States
Local time: 03:35
English
Re: TM maintenance Oct 19, 2015

Richard Hill wrote:

Hi Tom,

... maintenance on my TMs to get them ready for when Slate...

... how can I make sure Slate only uses my preferred terms?
... can I give priority to them over other translations of the same terms?
... will it randomly insert terms in the output file?

... recommended any other pre-Slate TM maintenance tasks...

... how does Slate handle tags/codes?
Would it help to batch delete all tags in the TM?

Rich


The most important task to prepare for Slate Desktop: curate (organize into categories) TMs in a way that make sense for you. Forget about an idealistic taxonomy that "should be." Focus on categories that reflect your work habits. I find translators are pretty good at managing/organizing TMs this way. I'm surprised how often agencies overlook TM maintenance or worse yet corrupt TMs.
Sidebar: In our experience, localization engineers' most flagrant abusive practice blindly splices or concatenates TMs from multiple sources (translators/sub-agencies) into one master TM, instead of using tools that properly merge them. Agencies' pricing schema don't discourage this practice because TMs with more segments are likely to yield more matches and they pay you less. You know, throw enough mud on the wall...

Slate Desktop repairs abused TMs for use as SMT corpus. However, tools that accurately and automatically curate data are still a dream. There are some promising academic approaches, but none have proved reliable across a broad range of production tasks. We keep looking and experimenting and will include reliable ones in our future Slate Desktop upgrades.

There is a thread on another ProZ forum (but I can't find it now) discussing the pro's/con's of using small TMs vs merging them into one "mega" TM for use in your CAT. For use in SMT, megaTMs lose category resolution and segments can't be re-categorized after merged. So if anyone has done this, please find and use your original individual TMs. Slate Desktop will import them and maintain their identities throughout its preparation. In the final preparation stage, it merges them into one training corpus according to your engine configuration. This way you can remix the corpus for new engines.

Segue to preferred terms. You have many choices.

Option 1: set a baseline. Make one engine to support both clients and see how it performs. Why? SMT does not translate word-by-word (actually "tokens" but let's not split hairs here). SMT uses sentence fragments of up to a maximum length, typically 7 words. Think of these fragments as a concordance context window that slides across each sentence word-by-word. So, SMT actually translates by keyword in concordance context.

In practical terms, SMT might detect the source term within its concordance context and generate the proper client-specific term all on its own. Computers are really good at making these tedious detailed distinctions where humans overlook them. This is simple and you could find that the engine does a great job on its own.

Option 2: Use the same engine as above and enable Slate Desktop's regex feature. It adds simple regex search/replace the target language. This replace approach assumes the overall sentence word order will be the same for both terms.

Option 3: Use the same engine as above and enable Slate Desktop's forced-translation table (we don't have a cool brand name for this yet). With this approach, you create a table with your source/target terms (from your term base/glossary, etc). Slate Desktop forces translation of the term, then finds the most natural target language word order for the whole sentence ("natural" == probability).

Option 4: You could simply make different engines using only each customers TMs. This approach can become cumbersome and/or you might not have enough segments for client-specific engines.

I can think of 3 more options, but I think you get the picture... you have options. Remember, there's no cost to create more engines, and retraining the engine is relatively fast when you work with small focused corpora. This opens new use cases that are cost-prohibitive if you pay subscriptions to the cloud services.

RE: tags/codes handling

When creating engines, Slate Desktop strips all TMX and XLIFF inline elements from source and target text. It strips all HTML and RTF formatting markup. If you have segments with other kinds of formatting markup, you should remove them. You can do that with your tools or with Regex in Slate Desktop.

Slate Desktop keeps other placeholder tags, e.g. {1}. They become just another "word" (i.e. token) and SMT "learns" where to place them. I recommend you leave these in with your processing.

During production, Slate Desktop strips all TMX/XLIFF inline elements and HTML/RTF markup from the source language text. All other tags stay in the source for the engine to "translate." Handling TMX vs XLIFF diverges during translation. Slate Desktop does not re-insert the TMX inline elements. It is more intelligent with XLIFF and re-inserts the elements in the most-likely target text location based on some complicated rules algorithms that I don't even understand.

Other tags, like OmegaT's proprietary tags that approximate XLIFF tags, are problematic. We recommend you just turn off the tag feature in the CAT. Also, we support XLIFF 1.2. Supporting XLIFF 2.0 tags is one of those future upgrade features!

In conclusion, you shouldn't need to do much with a TMX editor. Any processing you so should be a minimal and limited to restoring the segments to a natural language state. So, if you can export your TMs with an option to substitute placeholders with real values, this would be a great benefit. Otherwise, just leave them there.

[Edited at 2015-10-19 06:36 GMT]


 

P-O Nilsson  Identity Verified
Sweden
Local time: 09:35
Member (2010)
English to Swedish
TM categorization for Slate Desktop Nov 25, 2015

Hi Tom,

Regarding your recommendation to "curate (organize into categories) TMs in a way that make sense for you", I was wondering whether it would be most useful to categorize TMs according to subject matter (e.g. "Telecommunications" or "Agriculture") or according to text type (e.g. "Online help", "Marketing", "Contract"). How does Slate Desktop perform in these respects?

Best Regards,
Per-Ola Nilsson


 

Tom Hoar
United States
Local time: 03:35
English
Re: TM categorization for Slate Desktop Nov 25, 2015

P-O Nilsson wrote:

Hi Tom,

Regarding your recommendation to "curate (organize into categories) TMs in a way that make sense for you", I was wondering whether it would be most useful to categorize TMs according to subject matter (e.g. "Telecommunications" or "Agriculture") or according to text type (e.g. "Online help", "Marketing", "Contract"). How does Slate Desktop perform in these respects?

Best Regards,
Per-Ola Nilsson


Your work with categories will be a 2-step process. First, you assign one or more labels to your TMs when you import them into the system. Then you select which labels to include when you create an engine. You can make as many engines as you like, across any combination of labels. Here are some examples:

label 1 (subject matter): telecommunications, agriculture or pharmaceuticals
label 2 (text type): finance, legal, annual reports, contract or marketing

TM 1: banking, legal
TM 2: medical devices, legal
TM 3: pharmaceuticals, technical
TM 4: banking, finance
TM 5: pharmaceuticals, finance

You can create an engine with one text type segments across all subjects, or one for a subject across all text types. There's no "right" way to do this. Whatever you choose, the taxonomy should complement your workload, not force you to learn a new process.


 

Tom Hoar
United States
Local time: 03:35
English
Slate Desktop release candidate 1 in testing Jan 25, 2016

If anyone is still monitoring this post, I'm happy to share that Slate Desktop release candidate 1 has been distributed to 5 testers. All of the testers are customers who backed the Indiegogo campaign. I posted some ideas and pictures here at this Linkedin Pulse post.

https://www.linkedin.com/pulse/final-stretch-tom-hoar

Also, until we release version 1, th
... See more
If anyone is still monitoring this post, I'm happy to share that Slate Desktop release candidate 1 has been distributed to 5 testers. All of the testers are customers who backed the Indiegogo campaign. I posted some ideas and pictures here at this Linkedin Pulse post.

https://www.linkedin.com/pulse/final-stretch-tom-hoar

Also, until we release version 1, the Indiegogo campaign is still open for backers to contribute:

http://igg.me/at/slate-desktop
Collapse


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:35
Member (2009)
Dutch to English
+ ...
Hi Tom Jan 26, 2016

tahoar wrote:

If anyone is still monitoring this post, I'm happy to share that Slate Desktop release candidate 1 has been distributed to 5 testers. All of the testers are customers who backed the Indiegogo campaign. I posted some ideas and pictures here at this Linkedin Pulse post.

https://www.linkedin.com/pulse/final-stretch-tom-hoar

Also, until we release version 1, the Indiegogo campaign is still open for backers to contribute:

http://igg.me/at/slate-desktop


I'd like to do some testing too!

Michael


 

Tom Hoar
United States
Local time: 03:35
English
Re: Slate Desktop release candidate 1 in testing Jan 26, 2016

Michael Beijer wrote:

I'd like to do some testing too!

Michael


Thanks Michael,

I didn't hear from you after my message to backers only. I figured maybe your schedule was booked. I'm happy to have you join the team.

I forgot to mention here that everyone can follow our support forms (https://pttools.freshdesk.com/helpdesk). We've had some really good feedback during the first few days. You'll see that we worked through some installer problems, tokenizer support, disk size limits and language attributes for TMX/XLIFF files.

I know everyone is eager to hear about translation quality performance, but these basic issues are really important to everyone's overall experience, not to mention wasting your valuable production time or spending our
time with customer support instead of new feature development).

I'm already preparing a new installer for RC2. I'll put you on the RC2 list in a few days.

Tom


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 08:35
Member (2009)
Dutch to English
+ ...
Thanks! Jan 26, 2016

tahoar wrote:

Michael Beijer wrote:

I'd like to do some testing too!

Michael


Thanks Michael,

I didn't hear from you after my message to backers only. I figured maybe your schedule was booked. I'm happy to have you join the team.

I forgot to mention here that everyone can follow our support forms (https://pttools.freshdesk.com/helpdesk). We've had some really good feedback during the first few days. You'll see that we worked through some installer problems, tokenizer support, disk size limits and language attributes for TMX/XLIFF files.

I know everyone is eager to hear about translation quality performance, but these basic issues are really important to everyone's overall experience, not to mention wasting your valuable production time or spending our
time with customer support instead of new feature development).

I'm already preparing a new installer for RC2. I'll put you on the RC2 list in a few days.

Tom


 

Richard Hill  Identity Verified
Mexico
Local time: 03:35
Member (2011)
Spanish to English
Installation Feb 17, 2016

Hi Tom,

I just received the link to the intaller but before intstalling, as per my previous question, I wonder if I can install it on both my PC and laptop?

Not to worry, I'm happy to see I can install on two machines.

I have another question, but I posted it on the pttools site

[Edited at 2016-02-17 17:42 GMT]


 

Tom Hoar
United States
Local time: 03:35
English
Re: Installation Feb 17, 2016

Richard Hill wrote:

Hi Tom,

I just received the link to the intaller but before intstalling, as per my previous question, I wonder if I can install it on both my PC and laptop?

Richard


Yes, Richard. We updated the license and the technology to activate on two machines with one license key. Here's a link to the EULA.

http://www.slate.rocks/end-user-license-agreement/

The each installation registers with our activation server and disables any new installations once you reach the limit of 2 installs.


 
Pages in topic:   < [1 2 3 4 5 6 7 8] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Slate Desktop: your personal MT engine

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search