Beyond paragraph in OmegaT - question about segmentation (OmegaT support)

Technical forums » OmegaT support »
Beyond paragraph in OmegaT - question about segmentation
Track this topic

Beyond paragraph in OmegaT - question about segmentation

Thread poster: Marcos Zattar

Marcos Zattar
Germany
Local time: 00:10
Member (2007)
German to Portuguese
+ ...

Aug 15, 2008

Hello,

a lot has been discussed about the segmentation of OmegaT, even that in *past versions* it did not have the ability to do it at sentence level, instead recognizing the segment by the paragraph mark.

Well, I have exactly the opposite problem: I need a segmentation method that ignores paragraph marks, because my file format has them in the middle of the sentences.

I checked the following site: http://www.omegat.org/en/howtos/new_filter.html

It teaches how to create new filters for "exotic" file types, just as mine. My question: is it possible to create a filter which ignores paragraph marks and don't consider them the end of a segment?

Please note that I cannot just erase those marks within my source language text, because that would mess up the formatting.

Thanks for hints!

Kind regards,
Marcos ▲ Collapse

Samuel Murray

Netherlands
Local time: 00:10
Member (2006)
English to Afrikaans
+ ...

Soft returns instead of hard returns

Aug 15, 2008

Marcos de Miranda Zattar wrote:
Well, I have exactly the opposite problem: I need a segmentation method that ignores paragraph marks, because my file format has them in the middle of the sentences.

1. Open your file in MS Word (because OpenOffice.org is rubbish)
2. Do find/replace that finds ^p and replaces it with ^l
3. Save, and reopen in OpenOffice.org

This changes hard returns into soft returns, which OmegaT regards as inline formatting. Remember to use paragraph segmentation in OmegaT for this.

Oh, I assumed your document is ODT. If it is simply a plaintext file with empty lines between the paragraphs, you ignore the above advice, and simply use Options -> File Filters, Click Text, click Options, and try some other option.

Marc P (X)

Local time: 00:10
German to English
+ ...

Probably not practical

Aug 15, 2008

I'm happy to be corrected, but I don't think that it is practical. The reason is that "paragraph segmenting" in OmegaT does not simply mean that OmegaT uses the paragraph marker as the point at which it segments. This is true only of plain-text files. In formatted file formats, paragraph segmenting instructs OmegaT to present the content between the opening and closing paragraph-level tags as the segment, and to ignore the markup between paragraphs. Removing these tags as markers for segmentation would mean that the entire text would be presented as one segment (which is what you want, segmentation then being assured by the regex syntax for sentence-level segmenting), but it would also mean that paragraph-level formatting would be displayed.

To correct this, I think (I'm not absolutely sure) that you would have to modify the filter such that paragraph-level segmenting tags were ignored (except the paragraph tag itself, which you would have to treat as an inline tag). I think that is unlikely to be practical, though it may be theoretically possible.

This is also the reason why Samuel's suggestion won't work (Sorry, Samuel). Replacing the hard paragraph break with a line break merges the paragraphs either side of the break, meaning that they must then have the same paragraph-level formatting. As a result, they both assume the formatting of either the paragraph before or the paragraph after (in Word, the paragraph after, it seems, which is logical, since the paragraph marker in Word "contains" the paragraph-level formatting).

You could try experimenting with the primitive Abiword filter provided as an example in the HowTo: create a simple file in Abiword, then successively define the non-inline XML tags as non-translatable. If you do, you'll realize just how primitive the existing Abiword filter is. For one job, the effort is unlikely to be worthwhile; it might be interesting to have such a filter for the future, although the only time I've encountered such texts is in files converted from PDF, and I usually find it easier to delete the unwanted breaks manually and reformat as necessary.

Marc ▲ Collapse

Marcos Zattar
Germany
Local time: 00:10
Member (2007)
German to Portuguese
+ ...

TOPIC STARTER

Sample - the use of formatting marks to delimit segments

Aug 15, 2008

Thank you Samuel and Mark for the good ideas.

I experimented with Samuel's suggestion. It is indeed not practical because to big chunks of text are regarded as segments.

Mark: before I try your idea out, maybe you could have a look at my text. Below is a sample of it.

Please note that at the end of *every single line* there is a paragraph mark. The file below counts 74 paragraph marks.

The codes B1, AS, AL and others that appear in the beginning of the line are formatting info. I intended to use them in the filter for segmenting.

So, what do you think?

/HTEXT
/:OBJECT TERM
/:NAME CONSCHECK_GEOMETRIC_INSTANCE
/:ID T01
/:LANGUAGE P
/:FORM S_DOCU_PRINT
/:STYLE S_DOCUS1
/:FIRST-USER
/:FIRST-DATE 00 00 0000
/:FIRST-TIME 00 00 00
/:LAST-USER
/:LAST-DATE 00 00 0000
/:LAST-TIME 00 00 00
/:TITLE ' '
/:TITLE1 ' '
/:TITLE2 ' '
/MTEXT
U1Konsistenzprüfung für Lageinstanzen
ASMit dieser Funktion überprüfen Sie, ob die Menge einer Positionsvariante
oder einer Baukastenposition (bzw. des jeweiligen Änderungsstands) mit
der Anzahl von Lageinstanzen, die dem Objekt zugeordnet sind und
bestimmte Filterkriterien erfüllen, übereinstimmt.
AL¬e&
ALFür die Konsistenzprüfung muss die Menge am übergeordneten Objekt in der
Mengeneinheit ST (Stück) oder einer anderen zählbaren Einheit angegeben
sein.
ASUm die Konsistenzprüfung zu implementieren, verwenden Sie den
Funktionsbaustein PPEHI_PVINS_CHECK_CONSISTENCY.
AL¬e&
ALDie Eingabeparameter sind in diesem Funktionsbaustein ähnlich wie im
Funktionsbaustein PPEHI_PVINS_GET_INST_BY_OBJECT.
ASEingabeparameter
/:INCLUDE IV_MSG_HANDLING OBJECT DOKU ID TX
IV_MSG_HANDLING
/:INCLUDE IV_COMPONENT_VARIANT_ID OBJECT DOKU ID TX
/:INCLUDE IV_ASSEMBLY_RELATION_ID OBJECT DOKU ID TX
B1IV_CHANGE_NUMBER
ALWenn Sie den Änderungsstand des übergeordneten Objekts kennen, geben Sie
diesen hier an. Andernfalls lassen Sie diesen Parameter leer und geben
das Gültigkeitsdatum im folgenden Parameter an.
B1IV_VALIDITY_DATE ▲ Collapse

Marc P (X)

Local time: 00:10
German to English
+ ...

Sample - the use of formatting marks to delimit segments

Aug 15, 2008

Marcos,

I'm not sure whether your sample has reproduced properly by being pasted here (quoting it reveals some additional codes), but I would treat this particular case as plain text. It should be quite easy to write a script (or regex S&R) to delete paragraph breaks except where they occur before B1, AS and AL (and any other paragraph-level formatting codes). Put the breaks back in again after translating, either manually or again by using a script. Certainly much easier than writ... See more

Login to reply/comment

There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »

Beyond paragraph in OmegaT - question about segmentation

Forum rules

Help and orientation

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Beyond paragraph in OmegaT - question about segmentation

Beyond paragraph in OmegaT - question about segmentation

You have native languages that can be verified

Your current localization setting

Select a language