How to extract terminology from a Word doc
Thread poster: Paula Ribeiro
Paula Ribeiro
Paula Ribeiro  Identity Verified
Local time: 11:45
English to Portuguese
+ ...
Dec 4, 2012

Hello everyone.

I am currently in the process of deciding which CAT tool to buy for a specific company. I've used Trados Studio and I am also trying Wordfast. The problem is that I want to take advantage of the huge resources of past translations the last translator did, but they weren't using any CAT. And I really want to organize it. What I have is: the original in PT and translated file in ES, for example, but these are huge PPT, with pictures and a lot of formatting, with 5177
... See more
Hello everyone.

I am currently in the process of deciding which CAT tool to buy for a specific company. I've used Trados Studio and I am also trying Wordfast. The problem is that I want to take advantage of the huge resources of past translations the last translator did, but they weren't using any CAT. And I really want to organize it. What I have is: the original in PT and translated file in ES, for example, but these are huge PPT, with pictures and a lot of formatting, with 5177 TU when I try to align them in Winalign. So, as they're to big, winalign simply crashes after supposedly aligning the project. So, I have all this past translations but no TM, and I want to create one for each language pair.

What I would like to know is: is it possible to open an original to translate, either in Wordfast or Trados, and then get the terminology out of the translated file, which can be a doc or PPT? So, I would still have to translate it again, but I would have the Spanish terminology in a TM. Am I making myself understood?? Or do I have to create a glossary anyway?

This is especially different as I am really a beginner in all these extra functionalities...
Collapse


 
Tony M
Tony M
France
Local time: 12:45
Member
French to English
+ ...
SITE LOCALIZER
Align Dec 4, 2012

I too am a relative novice here, so the following suggestions are only vague ideas.

I think going to all the trouble of re-translating seems like an awful lot of work!

I would set about it by using Werecat to extract all the text from the PPTs into DOC files (hoping and praying that they do stay at least reasonably well aligned, depending on how careful the previous translators were!), strip out the tags to leave just the wanted text; and then use PlusTools 'Align' func
... See more
I too am a relative novice here, so the following suggestions are only vague ideas.

I think going to all the trouble of re-translating seems like an awful lot of work!

I would set about it by using Werecat to extract all the text from the PPTs into DOC files (hoping and praying that they do stay at least reasonably well aligned, depending on how careful the previous translators were!), strip out the tags to leave just the wanted text; and then use PlusTools 'Align' function — I have never had any trouble with that crashing on even pretty large docs, but if necessary, manually split the main doc into smaller chunks; it won't be very difficult to stick them back together again at TM time.

I hope that helps!
Collapse


 
John Holland
John Holland  Identity Verified
France
Local time: 12:45
French to English
Save as text Dec 4, 2012

Have you tried saving the PPT files as plain text and then aligning just the text files?

For example, see the last option, "Export all Text in PowerPoint Slide including Text in Text Box," on this page:
http://www.lytebyte.com/2009/08/07/how-to-export-powerpoint-text-contents-to-word/


 
Tony M
Tony M
France
Local time: 12:45
Member
French to English
+ ...
SITE LOCALIZER
Maybe simple is better? Dec 4, 2012

John Holland wrote:
For example, see the last option, "Export all Text in PowerPoint Slide including Text in Text Box,"


That seems like an awfully cumbersome way of doing it, John, and knowing the very variable results you can get recovering text from PDFs, I'd be somewhat mistrustful of that.

I think the Werecat solution is much simpler and less prone to problems; basic text formatting will be kept, but of course page layout will not.

NB: I have no idea if Werecat still works in Office 2007 / 2010, I use it very successfully in Office XP, in conjunction with Wordfast Classic — although it functions totally independently, of course.


 
John Holland
John Holland  Identity Verified
France
Local time: 12:45
French to English
Double post Dec 4, 2012

....

[Edited at 2012-12-04 19:04 GMT]


 
John Holland
John Holland  Identity Verified
France
Local time: 12:45
French to English
It's all in the tools you have and know... Dec 4, 2012

I've never used Werecat.

I'm a free software person, and I use tools that run on Linux. For this kind of situation, I've used a command line program called catppt to extract text, then LF Aligner to align the files and export as TMX, which I then use with OmegaT.... See more
I've never used Werecat.

I'm a free software person, and I use tools that run on Linux. For this kind of situation, I've used a command line program called catppt to extract text, then LF Aligner to align the files and export as TMX, which I then use with OmegaT.

catppt: http://www.wagner.pp.ru/~vitus/software/catdoc/
LF Aligner: http://sourceforge.net/projects/aligner/
OmegaT: http://www.omegat.org/

For the files I've had, that was a good work flow.

I just mentioned the option of using MS Office to extract text because it uses a tool that Paula presumably has already. She might not have Werecat available, and she most likely does not have any of those Linux-y tools...

Is there as PPT text extraction tool in the SDL universe?
Collapse


 
Tony M
Tony M
France
Local time: 12:45
Member
French to English
+ ...
SITE LOCALIZER
Free software Dec 4, 2012

John Holland wrote:
I'm a free software person, ...


Me too!

Werecat is simply a plug-in that works under Word (and with PPT), and is free to download, even though no longer supported by its creator.

It's very basic, but a REALLY powerful little utility that takes a very short time to extract all text from the text boxes in either a DOC or a PPT — and will then neatly put them all back in again for you later if you want!


 
Guillaume Chareyron
Guillaume Chareyron  Identity Verified
France
Local time: 12:45
German to French
+ ...
MemoQ or AlignFactory Dec 4, 2012

Hi Paula,

Do you know memoQ? In this tool, you have what is called "livedocs", it’s a kind of alignment function of past translations, it could be quite useful in your case.

But my favourite tool for this kind of work is AlignFactory (from Terminotix) which is fast and reliable. With the documents’ pairs, it produces bitextes (html with source and target side by side) or a translation memory (TMX).

It works very well with ppt files too.

If
... See more
Hi Paula,

Do you know memoQ? In this tool, you have what is called "livedocs", it’s a kind of alignment function of past translations, it could be quite useful in your case.

But my favourite tool for this kind of work is AlignFactory (from Terminotix) which is fast and reliable. With the documents’ pairs, it produces bitextes (html with source and target side by side) or a translation memory (TMX).

It works very well with ppt files too.

If you want, just send me two small ppt files and I sent you the results back, so that you can see if it suits your need.

You can probably ask for demo version too.

Cheers
Guillaume
Collapse


 
John Holland
John Holland  Identity Verified
France
Local time: 12:45
French to English
Extract the text before aligning Dec 4, 2012

Tony M wrote:

John Holland wrote:
I'm a free software person, ...


Me too!

Werecat is simply a plug-in that works under Word (and with PPT), and is free to download, even though no longer supported by its creator.


I was imprecise. I meant this kind of free software: https://en.wikipedia.org/wiki/Free_software

In any case, Werecat does sound like a possible alternative, especially if the included text export features of MS Office are not adequate for Paula's PPTs.

The main idea so far here is to extract the text from the PPTs in one way or another and then use Winalign on the extracted text, if that hasn't already been tried.

Guillaume's suggestion of AlignFactory from Terminotix sounds like a good option for PPT files, too.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 11:45
Member (2009)
Dutch to English
+ ...
I always recommend AlignFactory Light Dec 4, 2012

Hi Paula,

I have tried many aligners, but none of them is as good as AlignFactory Light. I would recommend you email Jean-François Richard of Terminotix ([email protected]) for a free demo and try it yourself. AlignFactory Light supports .ppt and .pptx.

in
... See more
Hi Paula,

I have tried many aligners, but none of them is as good as AlignFactory Light. I would recommend you email Jean-François Richard of Terminotix ([email protected]) for a free demo and try it yourself. AlignFactory Light supports .ppt and .pptx.

info: http://www.terminotix.com/index.asp?name=AlignFactory_Light&content=item&brand=1&item=11&lang=en

memoQ's aligner (LiveDocs) is also very good and supports PowerPoint files, also without having to extract anything first.

info: http://kilgray.com/memoq/60/help-en/index.html?translation_grid.html

Michael

[Edited at 2012-12-05 00:02 GMT]
Collapse


 
Gabriel Catalan
Gabriel Catalan  Identity Verified
Spain
Local time: 12:45
English to Spanish
ppt size Dec 5, 2012

Paula, most probably winalign crashing is due to ppt size and ppt files size is caused by embedded pictures.
Have you try to remove pictures from ppt files so its size decreases.
your really only need text to be aligned.

Regards.


 
Paula Ribeiro
Paula Ribeiro  Identity Verified
Local time: 11:45
English to Portuguese
+ ...
TOPIC STARTER
PPTs too big and with too many pictures Dec 6, 2012

Hello everyone,

thank you so much for your inputs! I think I'll probably try both methods, the werecat and Align Factory.

Guillaume, I'd love to be able to send you the files, but as I can't disclose any confidential files, I really cannot send thse files anywhere... thank you though!

And Gabriel, the PPts are 120 slides long, with pictures over pictures sometimes... The point here is actually to go around that exact problem.. If I'm looking to save time,
... See more
Hello everyone,

thank you so much for your inputs! I think I'll probably try both methods, the werecat and Align Factory.

Guillaume, I'd love to be able to send you the files, but as I can't disclose any confidential files, I really cannot send thse files anywhere... thank you though!

And Gabriel, the PPts are 120 slides long, with pictures over pictures sometimes... The point here is actually to go around that exact problem.. If I'm looking to save time, as I'm translating while trying to organize things as I get spare time, that really woudn't help me...

Again, thank you people. Let's see how I do... As I need approval to download any app to the company's computer, I think I'll pobably get around to actually doing it next week :/ I'll try at home during the weekend!
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract terminology from a Word doc







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »