Creating a translation memory from PDF documents
Thread poster: Elisa Fernández Vic
Elisa Fernández Vic
Elisa Fernández Vic  Identity Verified
Spain
Local time: 06:21
Member (2015)
English to Spanish
+ ...
Jul 1, 2015

Hello all!
So, I have the following ingredients:
- A number of PDFs in English and Spanish.
- A Mac computer.
- Omega T 3.1.8 (updating to 3.1.9 right now).
- No idea what I'm doing.
I want to create a translation memory for this project based on the PDF documents. How can I do this? Any help will be much appreciated.
Thanks in advance!


 
Susan Welsh
Susan Welsh  Identity Verified
United States
Local time: 00:21
Russian to English
+ ...
create TM Jul 1, 2015

First you have to convert the PDFs into .DOCX or .ODT format. I do this with ABBYY Finereader, which is software you have to buy. There are others that do the same thing, but that's what I use. (Maybe someone else will suggest something cheaper.)

Then you have to align the two files. LF Aligner is a good tool, and free: https://sourceforge.net/projects/aligner/
There are many
... See more
First you have to convert the PDFs into .DOCX or .ODT format. I do this with ABBYY Finereader, which is software you have to buy. There are others that do the same thing, but that's what I use. (Maybe someone else will suggest something cheaper.)

Then you have to align the two files. LF Aligner is a good tool, and free: https://sourceforge.net/projects/aligner/
There are many others. That will give you your TM.
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 07:21
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
ABBYY PDF Transformer Jul 1, 2015

Susan Welsh wrote:

I do this with ABBYY Finereader, which is software you have to buy. There are others that do the same thing, but that's what I use. (Maybe someone else will suggest something cheaper.)


If all you need is extracting texts from PDF files, ABBYY PDF Transformer may be a solution. It’s actually a trimmed-down version of Finereader, thus, its price is lower.


 
Meta Arkadia
Meta Arkadia
Local time: 11:21
English to Indonesian
+ ...
Cheaper like free Jul 1, 2015

Susan Welsh wrote:
...Maybe someone else will suggest something cheaper.

Casualtextractor should do the tick, especially if you extract to plain text, which is good enough for creating a TMX file. And it's free. It doesn't work for scanned ("dead") PDFs, though.



And yes, nothing can beat LF_Aligner, but I'm afraid it doesn't have a graphic interface in the Mac version, so you'll need the Terminal. The instructions Andras provides are very clear, though.

And then there's YouAlign, a free web service that also processes PDFs. That means you'd only have to upload the PDFs. Very good, perhaps not wise to use if you signed any NDAs.

Cheers,

Hans


[Edited at 2015-07-01 11:26 GMT]

[Edited at 2015-07-01 11:41 GMT]


 
Dan Lucas
Dan Lucas  Identity Verified
United Kingdom
Local time: 05:21
Member (2014)
Japanese to English
Depends on the PDFs Jul 1, 2015

Elisa Fernández Vic wrote:
- A number of PDFs in English and Spanish.

If the PDFs are image only PDFs you will have to OCR them as described by others. OCR is not much fun, whatever software you use. Check the ouput files very carefully for errors.

However, machine-readable PDFs can usually be saved as plain text files. How do you know if it's a machine-readable file? If you can select text with the mouse, it's machine-readable. Sometimes the file is protected from copying or exporting, in which case you're out of luck.

If it's machine readable and not protected, using the entirely free Sumatra PDF you can simply choose "Save As..." from the File menu to save text only. The screenshot below shows me doing just that with a publicly available Japanese document. If the formatting is not too complex saving to text might be both quicker and less effort than OCR.

Regards
Dan



 
Didier Briel
Didier Briel  Identity Verified
France
Local time: 06:21
English to French
+ ...
LF Aligner Jul 1, 2015

Elisa Fernández Vic wrote:
So, I have the following ingredients:
- A number of PDFs in English and Spanish.
- A Mac computer.
- Omega T 3.1.8 (updating to 3.1.9 right now).
- No idea what I'm doing.
I want to create a translation memory for this project based on the PDF documents. How can I do this?

What you need is an aligner. You can use LF Aligner:
https://sourceforge.net/projects/aligner/

If your PDFs contain text (not images), you will be able to align directly from the PDF files.

Didier


 
Milan Condak
Milan Condak  Identity Verified
Local time: 06:21
English to Czech
Editable PDF Jul 1, 2015

Susan Welsh wrote:

First you have to convert the PDFs into .DOCX or .ODT format. (Maybe someone else will suggest something cheaper.)


LF Aligner can extract a text from editable PDF and create TMX (in Czech):

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/02.html


Then you have to align the two files. LF Aligner is a good tool, and free: https://sourceforge.net/projects/aligner/


You can have files in two or in more languages.

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/00.html

There was only import into XLS file

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/04.html

now it is possible to use in-build align editor.

Milan


 
Elisa Fernández Vic
Elisa Fernández Vic  Identity Verified
Spain
Local time: 06:21
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
LF aligner issues Jul 1, 2015

Hello all,
Thank you very much for your valuable information
I have managed to convert the files into .txt with UTF-8 and download LF_aligner. But when I try to align the two files, there is an error that I don't know how to solve. I will copy it as it shows, only changing the client's and file's name for privacy reasons:

ERROR: Input file not found (No such file or directory) at line 52066
(file: /U
... See more
Hello all,
Thank you very much for your valuable information
I have managed to convert the files into .txt with UTF-8 and download LF_aligner. But when I try to align the two files, there is an error that I don't know how to solve. I will copy it as it shows, only changing the client's and file's name for privacy reasons:

ERROR: Input file not found (No such file or directory) at line 52066
(file: /Users/elisafernandezvic/Desktop/TRADUCCIÓN/CLIENTES/CLIENT/MATERIAL\ DE\ REFERENCIA\ INGLÉS/\(583153876\)\ 3020\ File\ name\ EN.txt)
Try again!

What can I do to solve it? Thank you very much in advance.
Collapse


 
Milan Condak
Milan Condak  Identity Verified
Local time: 06:21
English to Czech
Short path and short file name in ASCII Jul 1, 2015

Elisa Fernández Vic wrote:

ERROR: Input file not found (No such file or directory) at line 52066
(file: /Users/elisafernandezvic/Desktop/TRADUCCIÓN/CLIENTES/CLIENT/MATERIAL\ DE\ REFERENCIA\ INGLÉS/\(583153876\)\ 3020\ File\ name\ EN.txt)
Try again!

What can I do to solve it? Thank you very much in advance.


Elisa,

Try C:\name\EN.txt + second.txt

Possible issues: TRADUCCIÓN/, INGLÉS/\(583153876\)\

Milan


 
Elisa Fernández Vic
Elisa Fernández Vic  Identity Verified
Spain
Local time: 06:21
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
Success!! And now... how to merge tmx together? Jul 1, 2015

Thank you! I have managed to create my first translation memory and it seems to work properly! Do I get a cookie?
Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT, or do I have to merge them somehow?
Sorry if this is a stupid question - as I said, it's my first time trying to create my own
... See more
Thank you! I have managed to create my first translation memory and it seems to work properly! Do I get a cookie?
Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT, or do I have to merge them somehow?
Sorry if this is a stupid question - as I said, it's my first time trying to create my own TM from files.
Collapse


 
Milan Condak
Milan Condak  Identity Verified
Local time: 06:21
English to Czech
Auto sub-folder Jul 1, 2015

Elisa Fernández Vic wrote:


Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT,


Elisa,

put all your relevant TMs into folder tm\auto\

then look at "files in project", you will see if the TMXs are relevant or not. There is no need to merge TMXs.

Milan


 
Elisa Fernández Vic
Elisa Fernández Vic  Identity Verified
Spain
Local time: 06:21
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
Thank you very much! Jul 2, 2015

Milan Condak wrote:

Elisa Fernández Vic wrote:


Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT,


Elisa,

put all your relevant TMs into folder tm\auto\

then look at "files in project", you will see if the TMXs are relevant or not. There is no need to merge TMXs.

Milan


So it was actually this easy Thank you very much for your help!


 
Vaclav H
Vaclav H
Czech Republic
French to Czech
Thx for the topic Jul 3, 2015

and for the answers!
I'am working on my TM and most of the documents are in pdf.
This made it so much easier and faster, Thank you all


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Creating a translation memory from PDF documents






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »