Match algorithm expectations
Thread poster: Erik Freitag
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:07
Member (2006)
Dutch to German
+ ...
Apr 7, 2011

Dear colleagues,

following an interesting discussion with SDL staff, I'd appreciate your input about the evaluation of TM match values. As I said, I have already discussed this with SDL staff, but I'd like to check whether my expectations about how a CAT tool should work are unreasonable.

Imagine a source text consisting of the following three segments (which I'm quoting in their exact spelling - list numbers 1 to 3 are not part of the segments, obviously):

... See more
Dear colleagues,

following an interesting discussion with SDL staff, I'd appreciate your input about the evaluation of TM match values. As I said, I have already discussed this with SDL staff, but I'd like to check whether my expectations about how a CAT tool should work are unreasonable.

Imagine a source text consisting of the following three segments (which I'm quoting in their exact spelling - list numbers 1 to 3 are not part of the segments, obviously):


  1. Opstarten MER-studie
  2. Ad 1 Winningsvergunning
  3. Ad 2 Opstarten MER studie


I'd appreciate your input on the following questions:


  1. Would you want a CAT tool to present one or both of the segments 1 and 2 as a match for segment 3?
  2. Compared to segment 3, how high would you expect the respective match values for segment 1 and 2 to be?


Later on, it might also be interesting to check how different CAT tools actually evaluate these segments, but I'd prefer to gather your subjective expectations first.

Thanks for all your thoughts about this!

Kind regards,
Erik



[Bearbeitet am 2011-04-07 07:00 GMT]
Collapse


 
Grzegorz Gryc
Grzegorz Gryc  Identity Verified
Local time: 17:07
French to Polish
+ ...
A quick opinion and a link... Apr 7, 2011

efreitag wrote:

following an interesting discussion with SDL staff, I'd appreciate your input about the evaluation of TM match values. (...)

I'd appreciate your input on the following questions:


  1. Would you want a CAT tool to present one or both of the segments 1 and 2 as a match for segment 3?
  2. Compared to segment 3, how high would you expect the respective match values for segment 1 and 2 to be?


Later on, it might also be interesting to check how different CAT tools actually evaluate these segments, but I'd prefer to gather your subjective expectations first.


The problem is what's exactly is a match, i.e. most people don't fiddle with the default match values (e.g. 70% in Trados) which are often absurd, at least for some languages.
I.e. the segments 1 and 3 will be recognized and shown if you lower the threshold to, let's say 50% as I do.

Nonetheless, your problem is related to the crazy weight applied to numbers in Trados, i.e. the numbers count twice as much as word, so why the 1-3 match will be approx. 50%.
As a human, I expect someting like 70%.
For the 2-3 match, the match will be probably very low (probably under the 30% level) because of the length difference (the number of the words in the segment is one if the most important factors in the Trados algorithm).
As the "Ad 2" subsegment is very short, I don't complain it's below the 30% threshold although I would expect 30% for longer words.
The essential problem is for Trados algorithms a short word is counted exactly as a very long word.
E.g. If you compare a fake sentence with the same length (in words), e.g. "Ad 2 AA BB CC" or "Ad 2 AAAAAAAAAAAAAAA BBBBBBBBBBBBBBB CCCCCCCCCCCCCCC", you'll get approx. 40% for both although the number of strokes needed to write the second one is approx. 5 times bigger.
IMO t's one of the most important flaws in the Trados matching algorithms, some character related weighted value should be used.

A similar question was discussed here :
http://glg.proz.com/forum/sdl_trados_support/183991-how_to_have_trados_concentrate_on_relevant_text_rather_than_tag_material_for_finding_matches.html

Cheers
GG


 
Titia Meesters
Titia Meesters  Identity Verified
Local time: 17:07
Member (2005)
English to Dutch
Only segment 1 is relevant here Apr 7, 2011

Segment 2 has the same small (and unimportant) words at the beginning and is found as a match of segment 3 on that basis, although the sentence as a whole is not related. Also in my experience, the first words of a segment seem to get too much weight in the evaluations. I would not like to get all segments beginning with "Ad x" as matches of the third segment, as most of these will probably be completely unrelated.

Segment 1 should have a very high matching percentage (>80%) as it
... See more
Segment 2 has the same small (and unimportant) words at the beginning and is found as a match of segment 3 on that basis, although the sentence as a whole is not related. Also in my experience, the first words of a segment seem to get too much weight in the evaluations. I would not like to get all segments beginning with "Ad x" as matches of the third segment, as most of these will probably be completely unrelated.

Segment 1 should have a very high matching percentage (>80%) as it is almost the same sentence as segment 3.
Collapse


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:07
Member (2006)
Dutch to German
+ ...
TOPIC STARTER
Thanks Apr 11, 2011

Dear Gregorz and Titia,

Thanks for your contributions. I see that a related discussion has already taken place (which might be a reason why this thread hasn't attracted that much attention), the difference being that the case I described does not contain any tags.

Part of the problem here is due to the way the algorithm works (which has been explained to me by SDL staff).

"MER studie" is counted as two words, while "MER-studie" (mind the hyphen) is counted
... See more
Dear Gregorz and Titia,

Thanks for your contributions. I see that a related discussion has already taken place (which might be a reason why this thread hasn't attracted that much attention), the difference being that the case I described does not contain any tags.

Part of the problem here is due to the way the algorithm works (which has been explained to me by SDL staff).

"MER studie" is counted as two words, while "MER-studie" (mind the hyphen) is counted as a single word, which is seen as unrelated to the two words "MER studie".

I think that the hyphen is extremely overweighted.

One of the reasons I opened this thread is the line of argumentation that I have seen more than once when discussing with SDL staff. This is how it goes: 1. I describe a problem I perceive while working with the software. 2. Staff explains how the software works, and that the case I described complies with the way the software works, and that therefore there is no problem.

In this case, the software works as intended from the programmer's point of view, but not from the user's point of view.

Just for the records: Trados 2007 handles these segments differently and works as I as a translator would expect. The match algorithm of Studio is a change for the worse.
Collapse


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:07
Member (2006)
Dutch to German
+ ...
TOPIC STARTER
Match algorith complete nonsense - forwarded to development - is there any hope? Apr 15, 2011

Dear all,

I know that the issue I've raised hasn't attracted much attention, but I just need to let some steam off.

In a project I'm working on, I've set the minimal match values for both TM and concordance search to 30% (which is the smallest value available).

Still, no luck. Obviously, Studio calculates a match value of less than 30% for the following two segments (and hence does not offer a TM match):

Vluchtroute-geleidingssysteem
Vluc
... See more
Dear all,

I know that the issue I've raised hasn't attracted much attention, but I just need to let some steam off.

In a project I'm working on, I've set the minimal match values for both TM and concordance search to 30% (which is the smallest value available).

Still, no luck. Obviously, Studio calculates a match value of less than 30% for the following two segments (and hence does not offer a TM match):

Vluchtroute-geleidingssysteem
Vluchtroute -geleidingssysteem

A concordance search for the complete segment 2 returns some matches containing "Vluchtroute", but does NOT return segment 1.

I'm really fed up. SDL support has confirmed that my "suggestion regarding the search algorithm" will be forwarded to development, but I tend to believe that this is a euphemism for "thanks, but we can't be bothered with that now."

So, I'll have to continue searching for segments I've translated before with CTRL+F and manually copy the translation. This is core work for a CAT tool, not something I should have to do.

I'm still not prepared to abandon Trados due to the significant amount of work I invested in order to learn using the software, and due to the fact that it indeed offers some nice features. But seeing that this CAT tool has such serious flaws in performing the most important core function, I'd strongly advise beginners to chose a different tool.

Sorry for the rant, but I'm really beginning to get angry.

Best regards,
Erik



[Bearbeitet am 2011-04-15 09:56 GMT]

[Bearbeitet am 2011-04-15 09:57 GMT]

[Bearbeitet am 2011-04-15 09:58 GMT]
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 17:07
English to Hungarian
+ ...
Agree Apr 15, 2011

I understand your frustration. This should have been done better.
The problem seems to be that Trados treats the hyphen as a special thing, probably as part of the word. So if a hyphenated expression occurs next time with a space instead or in addition to the hyphen, the exression is not recognized at all: now it's two words, not one. Really, Trados should just index every hyphenated expression both with a hyphen and as if it had a space in
... See more
I understand your frustration. This should have been done better.
The problem seems to be that Trados treats the hyphen as a special thing, probably as part of the word. So if a hyphenated expression occurs next time with a space instead or in addition to the hyphen, the exression is not recognized at all: now it's two words, not one. Really, Trados should just index every hyphenated expression both with a hyphen and as if it had a space in it. Add some penalty points to the made-up "with space" version and BAM! problem solved.
Collapse


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:07
Member (2006)
Dutch to German
+ ...
TOPIC STARTER
hyphen Apr 15, 2011

Thanks for sympathising!

FarkasAndras wrote:

I understand your frustration. This should have been done better.
The problem seems to be that Trados treats the hyphen as a special thing, probably as part of the word.


Yes, that's exactly how the algorithm works according to SDL staff. Segment 1 contains one word, segment 2 contains two words. The fact that they resemble each other very much is not further taken into account at all.

FarkasAndras wrote:
So if a hyphenated expression occurs next time with a space instead or in addition to the hyphen, the exression is not recognized at all: now it's two words, not one. Really, Trados should just index every hyphenated expression both with a hyphen and as if it had a space in it. Add some penalty points to the made-up "with space" version and BAM! problem solved.


Yes, that might solve the problem. In addition to that, the possibility of writing the two words together as one (without space or hyphen) should also be accounted for.

I hope that SDL actually acknowledge this as a real problem.


 
Jonathan Hopkins
Jonathan Hopkins  Identity Verified
Germany
Local time: 17:07
German to English
+ ...
Hyphens and flawed algorithms Nov 23, 2011

Hello fellow translators,

and hi again, Erik. @Erik: I just saw that you had a similar example to the one I just posted today here:

http://www.proz.com/forum/sdl_trados_support/212725-tu_not_found_in_studio_2011.html

In my example Studio doesn't find a TU that is exactly the same, save one hyphen. Even when the th
... See more
Hello fellow translators,

and hi again, Erik. @Erik: I just saw that you had a similar example to the one I just posted today here:

http://www.proz.com/forum/sdl_trados_support/212725-tu_not_found_in_studio_2011.html

In my example Studio doesn't find a TU that is exactly the same, save one hyphen. Even when the threshhold is reduced to 30%, there's still no luck.

I second your petition to SDL that they should reconsider their algorithms.

I might add, that this is a much greater problem for languages such as German, which can often have rather large compound words with or without hyphens which would cause utter havoc with the fuzzy index.

Where's Paul? I'd like someone from SDL to join in on the discussion.

Kind regards,
Jonathan
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Match algorithm expectations







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »