Segment length analysis?
Thread poster: Mirko Mainardi

Mirko Mainardi  Identity Verified
Italy
Local time: 00:27
Member
English to Italian
Apr 26

Hi everyone.

For some reason I never really thought about this before... but are there CAT tools that provide an analysis of segments based on their length? I mean, everyone knows that translating 2500 isolated 1 word terms is totally different from translating 25 segments comprised of 100 words of cohesive text each, even if the total wordcount is the same.

And if the answer to the above question is "no"... why?


Jorge Payan
 

SDL Community  Identity Verified
United Kingdom
Local time: 00:27
Member (1970)
English
What would you like to see? Apr 26

You already have the analysis by character, word and segment. So if there were 2500 segments and 2500 words you'd know. How would you like to see the analysis?

Regards

Paul
http://xl8.one


 

Mirko Mainardi  Identity Verified
Italy
Local time: 00:27
Member
English to Italian
TOPIC STARTER
Agnostic Apr 26

Thank you for your reply Paul. At any rate, this was supposed to be an agnostic question (i.e. not specifically SDL-related).

Also, if I'm not mistaken, what the Studio analysis says is based on totals and "fuzzy bands", while what I meant is an analysis specifically based on number of words per segment.

In other words, if the analysis says a file has 100 segments, 2500 words, and 10000 characters, all I know is what the average length per segment is (25 words),
... See more
Thank you for your reply Paul. At any rate, this was supposed to be an agnostic question (i.e. not specifically SDL-related).

Also, if I'm not mistaken, what the Studio analysis says is based on totals and "fuzzy bands", while what I meant is an analysis specifically based on number of words per segment.

In other words, if the analysis says a file has 100 segments, 2500 words, and 10000 characters, all I know is what the average length per segment is (25 words), but in reality, I could have a few segments with big chunks of text and a lot of smaller/tiny segments.

So, what I'm talking about is a breakdown based on segment length rather than (or "in addition to", of course...) fuzzy matching, so that a translator would have an additional metric to discern how time consuming a task could be, at a glance.
Collapse


 

Philippe Etienne  Identity Verified
Spain
Local time: 00:27
Member
English to French
Warning Apr 26

A breakdown by source segment length could look like this:
< 5 words 18% (titles, software strings, headlines, tables: more time)
5-19 words 64% (sentences: standard)
> 19 words 18% (long sentences: perhaps more time to convey with style)

The middle band deserves discounts, I think.

Philippe


 

Jean Dimitriadis  Identity Verified
France
Local time: 00:27
Member
English to French
+ ...
Filter segments by length Apr 26

In CafeTran Espresso, additionally to the CAT file analysis (number of segments/words/characters) [and SDL Trados can also provide these details without any fuzzy matching/TM attached], which gives a good idea of the average words number per segment, you can quickly sort (filter) segments by length (short or long first). I think MemoQ does offer that as well.

You can also use a QA step for displaying only segments above a user-defined maximum character count.

This is no
... See more
In CafeTran Espresso, additionally to the CAT file analysis (number of segments/words/characters) [and SDL Trados can also provide these details without any fuzzy matching/TM attached], which gives a good idea of the average words number per segment, you can quickly sort (filter) segments by length (short or long first). I think MemoQ does offer that as well.

You can also use a QA step for displaying only segments above a user-defined maximum character count.

This is not a standard analysis as you mean it, but it does provide a rough overview that should be sufficient to understand at a glance whether the project has many short segments, many long segments, or a mix.

When speaking of translation difficulty, a quantitative analysis (especially total word count alone) can only get you that far.

I’m still refining my own pre-translation analysis process for time and translation difficulty estimation, it is a tricky subject for sure.

[Edited at 2019-04-26 18:23 GMT]
Collapse


 

Mirko Mainardi  Identity Verified
Italy
Local time: 00:27
Member
English to Italian
TOPIC STARTER
Additional metric Apr 26

Jean Dimitriadis wrote:

When speaking of translation difficulty, a quantitative analysis (especially total word count alone) can only get you that far.


Yes Jean, I do agree, and that's why I wrote this would be "an additional metric to discern how time consuming a task could be, at a glance".

However, good point about sorting segments by length, although I would much prefer a report.

Philippe Etienne wrote:

A breakdown by source segment length could look like this:
< 5 words 18% (titles, software strings, headlines, tables: more time)
5-19 words 64% (sentences: standard)
> 19 words 18% (long sentences: perhaps more time to convey with style)

The middle band deserves discounts, I think.


Yeah, something like that, even though I would like a detailed breakdown, especially for shorter (i.e. <7 words) segments. Also, I don't think this could be used to give or request further discounts (in addition to those for fuzzies...). Just to make an example, a lot of 1-2 words segments would basically amount to glossary building, or would however take more time compared to longer and cohesive text, so in my opinion it would be useful to have a quick way to check that (ideally before accepting a project...).

[Edited at 2019-04-26 20:05 GMT]


Philippe Etienne
 

Samuel Murray  Identity Verified
Netherlands
Local time: 00:27
Member (2006)
English to Afrikaans
+ ...
@Paul Apr 27

SDL Community wrote:
So if there were 2500 segments and 2500 words you'd know.


What if there were 1000 segments and 10 000 words? That's 10 words per segment, on average. But the time saving on very long segments does not cancel out the time wastage on very short segments. A 30-word segment does not really take more time per word than a 20-word segment, but a 3-word segment takes up much more time per word than a 10-word segment.

I mean, suppose 100 of those segments have only 1 word, and 100 have only 2 words, and 100 have only 3 words, then the average length of the remaining 700 segments (the remaining 9400 words) is 13 words per segment. The 300 short segments will take up far more time per word than the average.

It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.

So let's recalculate the the 10 000-word example:

100 x 1-word segments: 100 words actual, 500 words weighted
100 x 2-word segments: 200 words actual, 500 words weighted
100 x 3-word segments: 300 words actual, 500 words weighted
Other segments: 9400 words actual

The adjusted word count, then, is 10900 words (i.e. it would take two to three hours longer to complete the job than a strictly average 10 000 words).


[Edited at 2019-04-27 06:25 GMT]


SDL Community
 

SDL Community  Identity Verified
United Kingdom
Local time: 00:27
Member (1970)
English
All agreed... Apr 27

Samuel Murray wrote:

SDL Community wrote:
So if there were 2500 segments and 2500 words you'd know.


What if there were 1000 segments and 10 000 words? That's 10 words per segment, on average. But the time saving on very long segments does not cancel out the time wastage on very short segments. A 30-word segment does not really take more time per word than a 20-word segment, but a 3-word segment takes up much more time per word than a 10-word segment.

I mean, suppose 100 of those segments have only 1 word, and 100 have only 2 words, and 100 have only 3 words, then the average length of the remaining 700 segments (the remaining 9400 words) is 13 words per segment. The 300 short segments will take up far more time per word than the average.

It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.

So let's recalculate the the 10 000-word example:

100 x 1-word segments: 100 words actual, 500 words weighted
100 x 2-word segments: 200 words actual, 500 words weighted
100 x 3-word segments: 300 words actual, 500 words weighted
Other segments: 9400 words actual

The adjusted word count, then, is 10900 words (i.e. it would take two to three hours longer to complete the job than a strictly average 10 000 words).


[Edited at 2019-04-27 06:25 GMT]


That's why I asked what you'd like to see. In terms of helping with project estimation this seems like an interesting way forward. Perhaps this is something we could do as a small plugin so you have an additional analysis. Any developer could add this using the API... but assuming nobody here can develop perhaps I'll add it to our list of things to do.

Regards

Paul


 

Philippe Etienne  Identity Verified
Spain
Local time: 00:27
Member
English to French
MeToo Apr 29

Samuel Murray wrote:
...
It takes me (generally) just as long to translate a 1-word segment as a 3-word segment or even a 5-word segment. So for me, if I had wanted the weighted word count to be an accurate indication of the amount of time it will take to do the job, all segments of 5 words or less should be counted as 5 words.
...

While opposed to potentially getting weighted wordcounts higher than the actual wordcount for philosophical reasons, I see the point. In all fairness, small segments shouldn't be "discounted".

Simpler to visualise than segment wordcount breakdown, I think such a weighted wordcount would already lead to a much more accurate anticipation of the translation time required.

But CAT tool makers, when coming up with "partial matches", "analyses", "non-existing matches that will exist later", "tags/numbers that don't count" and stuff, haven't implemented a kind of threshold (I also think that around 3-5 words is realistic) below which the contents of small segments are reported as full words, neither weighted, nor discounted.
If there are only a few mini-segments, the buyer would "lose" a few pennies, and it there are a lot, the translator would actually be paid for the extra-time needed.
However, I am aware that weighted wordcounts have long lost their primary function of anticipating the time required: for instance, 80% discounts on 95-99% concordance matches seem to be common practice with a certain type of agencies, whereas 15 years ago, most used a single discount rate for all 75-99% matches.
To actually anticipate the time needed, I use a slightly amended historical version of the three-thirds 33/66/100, with fuzzies in the 75-99% concordance band.

Besides, I can't imagine any CAT tool maker implementing any small-segment threshold, because its analyses would consistently yield higher weighted wordcounts compared to the competition. Hardly a selling argument in the agency market, which to a significant extent shapes what translators buy as CAT tools.
After almost 20 years of daily use of CAT tools, I've never seen any "ground-breaking", "innovating" or "killer" feature increase weighted wordcounts! And don't start me on the "significant productivity gains" to justify the downward trend of weighted wordcounts together with the downward trend of discount grids together with the stagnation of the unit rate.

Philippe


Mirko Mainardi
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Segment length analysis?

Advanced search







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search