Customization of Segmentation (all tools)
Thread poster: Ella Luz
Ella Luz
Ella Luz
Germany
Nov 23, 2016

Hi everyone!
Do you happen to know whether it is possible to make a CAT-tool ignore pilcrows (paragraph marks) and tabs within sentences in Word Documents? I am translating documents containing sentences which span several lines but are interrupted by paragraph marks and tabs. How can I tell a tool to only accept a full stop (= actual end of sentence) as the end of a segment?
There are instances where there is a placeholder within sentences indicated by several full stops (…). How
... See more
Hi everyone!
Do you happen to know whether it is possible to make a CAT-tool ignore pilcrows (paragraph marks) and tabs within sentences in Word Documents? I am translating documents containing sentences which span several lines but are interrupted by paragraph marks and tabs. How can I tell a tool to only accept a full stop (= actual end of sentence) as the end of a segment?
There are instances where there is a placeholder within sentences indicated by several full stops (…). How can I add this exception to my segmentation rules, once I have customized the tool? I don’t want the segment to end after this set of full stops either, only after a single full stop.
Is such a customization possible at all and if so, is it possible with every CAT-tool? This question might affect our decision regarding which tool to go for. We need a collaborative tool and will most likely go for a cloud-based one. Do internet-based tools have this feature? How about Memsource?
Thank you a lot in advance!
Helen
Collapse


 
Stanislav Okhvat
Stanislav Okhvat
Local time: 12:47
English to Russian
Re: Customization of Segmentation (all tools) Nov 23, 2016

Hello Helen,

With tab characters, it is pretty easy to configure your CAT tool so that there is no segmentation on tab characters. Some CAT tools such as memoQ do not perform segmentation on tab characters by default, others like Memsource do.

With paragraph breaks (hard returns) it is more difficult. A paragraph is a distinct structural unit so when a CAT tool segments a document, it first breaks the text into structural units (e.g., in Word these are paragraphs, in H
... See more
Hello Helen,

With tab characters, it is pretty easy to configure your CAT tool so that there is no segmentation on tab characters. Some CAT tools such as memoQ do not perform segmentation on tab characters by default, others like Memsource do.

With paragraph breaks (hard returns) it is more difficult. A paragraph is a distinct structural unit so when a CAT tool segments a document, it first breaks the text into structural units (e.g., in Word these are paragraphs, in HTML these are div, p and similar structural tags, etc.). After the text is broken down into structural units, each piece is segmented according to segmentation rules. For this reason it is impossible to avoid segmentation on paragraph breaks -- you can create a segmentation rule for this, but it will never work. Note, however, that it is possible to avoid segmentation on line breaks (soft returns). In Word you can use Unbreaker for Word (part of a Word add-in) to remove hard returns in a semi-automatic way before the document is imported into the CAT tool, thus avoiding the above issue.

Regarding the ellipsis (…), it is easy enough to prevent segmentation on this character.

Memsource uses segmentation rules in SRX format. For Word, it segments on tab characters by default (it is easy to set up a default rule for a specific source language which will avoid segmenting on tab characters). By default, Memsource does not segment on ellipsis (Unicode character U+2026). You can customize Memsource and other CAT tools so that they segment only on specific punctuation marks, e.g. on . as in your case.

Best regards,
Stanislav Okhvat
TransTools – Useful tools for every translator
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 11:47
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
OmegaT and Anaphraseus Nov 23, 2016

OmegaT: yes for tabs and no for paragraph breaks (paragraphs are the largest possible translation units).
Anaphraseus: segments can be expanded both over tabs and paragraph breaks (in the latter case, a warning is issued). However, Anaphraseus has no collaborative features.

[Edited at 2016-11-23 14:52 GMT]


 
John Fossey
John Fossey  Identity Verified
Canada
Local time: 04:47
Member (2008)
French to English
+ ...
MemoQ Nov 23, 2016

MemoQ can usually join segments and replace the paragraph marks with a tag. Tabs can also be replaced with a tag. Segmentation can be adjusted with rules. MemoQ also has a cloud server version, although I don't know much about it.

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 09:47
Member (2006)
English to Afrikaans
+ ...
@Helen Nov 23, 2016

Helen_Of_Troy wrote:
How can I tell a tool to only accept a full stop (= actual end of sentence) as the end of a segment?


The only tool that I know of that can ignore hard line breaks is OmegaT, and then only for TXT files.

Try replacing all hard line breaks ^p with soft line breaks ^l, and then optionally change all double soft line breaks ^l^l back to hard line breaks ^p^p.

There are instances where there is a placeholder within sentences indicated by several full stops (…).


Try replacing the three dots ... with a single ellipsis character …. Then find all instances of the ellipsis plus a space, and replace it with just the ellipsis.

Is such a customization possible at all and if so, is it possible with every CAT-tool? This question might affect our decision regarding which tool to go for.


Most CAT tools require the user to fix the source files first (or on the fly) to deal with the limitations of the CAT tool.


 
Ella Luz
Ella Luz
Germany
TOPIC STARTER
Thank you! Nov 26, 2016

Dear Stanislav, esperantisto, John and Samuel,
Thank you very much for your comments, they are really helpful!
Kind regards,
Helen


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Customization of Segmentation (all tools)







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »