OmegaT segmentation behaviour
Thread poster: Valerijs Svincovs
Valerijs Svincovs
Valerijs Svincovs  Identity Verified
Latvia
Local time: 18:04
English to Latvian
+ ...
Oct 14, 2008

Can anyone give advice on how to make OmegaT (version 1.7.3, update 2) ignore a full stop (.) after a numeral and not break a sentence in separate segments? Is it possible at all and might it lead to any other complications?
In Latvian, ordinal numbers are followed by a full stop and I am experiencing a lot of trouble when sentences are broken in half where they shouldn't be.
Tried reading the Help, but my brain seems to be not that programming-oriented.
Many thanks in advanc
... See more
Can anyone give advice on how to make OmegaT (version 1.7.3, update 2) ignore a full stop (.) after a numeral and not break a sentence in separate segments? Is it possible at all and might it lead to any other complications?
In Latvian, ordinal numbers are followed by a full stop and I am experiencing a lot of trouble when sentences are broken in half where they shouldn't be.
Tried reading the Help, but my brain seems to be not that programming-oriented.
Many thanks in advance for any suggestions!

Valery
Collapse


 
Marc P (X)
Marc P (X)  Identity Verified
Local time: 17:04
German to English
+ ...
OmegaT segmentation behaviour Oct 14, 2008

In the segmentation rules dialog:

Uncheck the "Break/Exception" option
Pattern before: [0-9]\.
Pattern after: \s

Move this rule so that it is above the generic break rule (e.g. pattern before: [\.\?\!]+, pattern after: \s[A-Z]).

This should work. You may however prefer to do it the other way around, i.e. modify your generic break rule so that it only breaks after [any letter] followed by [any punctuation symbol], e.g.:

Check the "Br
... See more
In the segmentation rules dialog:

Uncheck the "Break/Exception" option
Pattern before: [0-9]\.
Pattern after: \s

Move this rule so that it is above the generic break rule (e.g. pattern before: [\.\?\!]+, pattern after: \s[A-Z]).

This should work. You may however prefer to do it the other way around, i.e. modify your generic break rule so that it only breaks after [any letter] followed by [any punctuation symbol], e.g.:

Check the "Break Exception option
Pattern before: [a-z][\.\?\!]
Pattern after: \s

This is the "simple" any lower-case letter variant - you may want to experiment, e.g. including characters with diacritics, and to make sure that I have got the syntax right.

HTH,
Marc


[Edited at 2008-10-14 12:29]
Collapse


 
Susan Welsh
Susan Welsh  Identity Verified
United States
Local time: 11:04
Russian to English
+ ...
It's in the user's manual under "segmentation" Oct 14, 2008

There's a section that tells how to specify exceptions to the default segmentation.

Susan


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 17:04
English to French
+ ...
Adding an exception Oct 14, 2008

valerius wrote:

Can anyone give advice on how to make OmegaT (version 1.7.3, update 2) ignore a full stop (.) after a numeral and not break a sentence in separate segments? Is it possible at all

Of course.
To quote the manual:
"Given the flexibility you may consider defining more exception rules for the language you translate from, to give you more meaningful and coherent segments."


and might it lead to any other complications?

No reason for that.


In Latvian, ordinal numbers are followed by a full stop and I am experiencing a lot of trouble when sentences are broken in half where they shouldn't be.
Tried reading the Help, but my brain seems to be not that programming-oriented.

Regular expressions are not only useful for "programming-oriented" people. They can be used in lot of situations, e.g., in Word.

What you need is to add an exception in the segmentation rules (Options/Segmentation).

Add a new rule to an existing set of rules, or create a new set, for instance for Latvian.

Don't check Break/Exception.
In Before, enter \d\. (number followed by a dot)
In After, enter \s (space)

That's all.

Didier


 
Valerijs Svincovs
Valerijs Svincovs  Identity Verified
Latvia
Local time: 18:04
English to Latvian
+ ...
TOPIC STARTER
Thanks to the experts! Oct 14, 2008

Thank you for the advice! it actually was a matter of 30 seconds, when you know what to do, and it seems to be working so far.
I did it according to Didier's explanation, only because I had noticed in the help file previously that d stands for numerals. Im just curious, what difference it would make, if I put [0-9]?
And also, for some reason, when I had created a new set of rules with the same exception for Latvian and moved it above the default, it did not work. So I just added a n
... See more
Thank you for the advice! it actually was a matter of 30 seconds, when you know what to do, and it seems to be working so far.
I did it according to Didier's explanation, only because I had noticed in the help file previously that d stands for numerals. Im just curious, what difference it would make, if I put [0-9]?
And also, for some reason, when I had created a new set of rules with the same exception for Latvian and moved it above the default, it did not work. So I just added a new rule in the Default group. Any ideas?
And thank you Marc for the insight in the functionality of this feature. I, however, do not feel like experimenting at this moment because the converting of .doc into .odt and vice versa already seems dodgy enough to me with all the layout changes, etc. There have been cases when I have been unable to open the translated document because of some accidentally deleted tags in the OmegaT interface (the meaning of which I have no clue about, but that is probably worth another forum thread...).

Once again, your help is very much appreciated!
Collapse


 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 17:04
English to French
+ ...
There is more than one way to do it Oct 14, 2008

valerius wrote:

I did it according to Didier's explanation, only because I had noticed in the help file previously that d stands for numerals. Im just curious, what difference it would make, if I put [0-9]?

None. That's two different ways of saying the same thing.


And also, for some reason, when I had created a new set of rules with the same exception for Latvian and moved it above the default, it did not work. So I just added a new rule in the Default group. Any ideas?

One possibility is that the language you entered for the new set of rules didn't match the source language of the project.
The default rules match anything, because they are ".*".
When you create a new set, by default it's "LN-CO", which of course doesn't match anything.

Didier


 
Marc P (X)
Marc P (X)  Identity Verified
Local time: 17:04
German to English
+ ...
OmegaT segmentation behaviour Oct 14, 2008

valerius wrote:

I, however, do not feel like experimenting at this moment because the converting of .doc into .odt and vice versa already seems dodgy enough to me with all the layout changes, etc.


When you open a .doc file in OOo (and convert it to .odt and edit the text), the layout may *appear* to be different, but you are likely to find after converting back and re-opening it in Word that the layout has remained the same, or that changes are only minor.

There have been cases when I have been unable to open the translated document because of some accidentally deleted tags in the OmegaT interface (the meaning of which I have no clue about, but that is probably worth another forum thread...).


Ctrl+T in OmegaT to check for tag errors.

Marc


 
Valerijs Svincovs
Valerijs Svincovs  Identity Verified
Latvia
Local time: 18:04
English to Latvian
+ ...
TOPIC STARTER
Tags Nov 10, 2008

Indeed, I had not realised this Ctrl+T functionality previously. I had been wondering what it was, but apparently the document I tried it in did not contain any tags, so, obviously, nothing happened.
Thank you all for your support and the time you spent explaining these things in this forum. As I see now, the manual contains this information, yet, I think it is difficult to read, if one does not have some previous knowledge/understanding of "what it's all about".

Valery


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


OmegaT segmentation behaviour






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »