Working from PDF converting to Word documents
Thread poster: Will Volny
Will Volny
Will Volny  Identity Verified
United States
Local time: 23:38
Member (2018)
Czech to English
+ ...
Oct 24, 2018

On occasion I work with a PDF that an agency sends, which I convert to Word and then translate using Word. So far, I am not so pleased with this type of format. It very well may be that I have more to learn in this conversion process, because, the formatting is 'dodgy' and I have unknowingly sent to an agency a poorly formatted document. They mentioned that some of the lines were 'gibberish,' which, needless to say is horrifying to me because in my own copy of the text, this 'gibberish' was no... See more
On occasion I work with a PDF that an agency sends, which I convert to Word and then translate using Word. So far, I am not so pleased with this type of format. It very well may be that I have more to learn in this conversion process, because, the formatting is 'dodgy' and I have unknowingly sent to an agency a poorly formatted document. They mentioned that some of the lines were 'gibberish,' which, needless to say is horrifying to me because in my own copy of the text, this 'gibberish' was not there. It is difficult or impossible to format in the normal way that an original Word document can be. I often cannot make the text formatting match the original, and there is the mismatching of letters on occasion (such as 'e's becoming 'c's, or, for example, AAST turning into MST).
Has anyone else come across this issue? What might be done about it? What have you found that works? Some projects are last minute and urgent on the part of the agency and work needs to progress rapidly - and all they have is a PDF.
I welcome your thoughts as I cannot bear the thought of submitting such work, most especially when I don't even see it from my end.
regards,
Will
Collapse


 
Francis Marche
Francis Marche  Identity Verified
France
Local time: 06:38
English to French
+ ...
My Sympathy Oct 25, 2018

I can only say I sympathize. How I would hate to see my American Association for the Surgery of Trauma (AAST) if not (God forbids!) my Arab Academy for Science and Technology turning into a Maladie Sexuellement Transmissible (MST) under my eyes and in a flash!

OCR apps normally wouldn't do that to a straight text (no pics or charts) by a clear day and without advanced warning though. Try removing any fancy lettering in titles and anything that resembles an object (pics, diagrams, et
... See more
I can only say I sympathize. How I would hate to see my American Association for the Surgery of Trauma (AAST) if not (God forbids!) my Arab Academy for Science and Technology turning into a Maladie Sexuellement Transmissible (MST) under my eyes and in a flash!

OCR apps normally wouldn't do that to a straight text (no pics or charts) by a clear day and without advanced warning though. Try removing any fancy lettering in titles and anything that resembles an object (pics, diagrams, etc.) and see what you get. The pdf to Word conversion process sometimes yields text in tiny boxes whose borders are not evidenced. Remove your text (cut) from these boxes and paste them out on a fresh Word page. Good luck.
Collapse


 
Tony M
Tony M
France
Local time: 06:38
Member
French to English
+ ...
SITE LOCALIZER
Be more assertive! Oct 25, 2018

I generally refuse to work from PDFs, which are only a document exchange format, not a word processing format!

This leaves the agency to deal with the problem!

When for whatvere reason I am obliged to use OCR to convert the document, I use ABBYY FineReader, which is the only one I have found to give consistently high accuracy.

One of the problems is that it attempts to recreate the formatting of the original document, but uses all sorts of subterfuges to do
... See more
I generally refuse to work from PDFs, which are only a document exchange format, not a word processing format!

This leaves the agency to deal with the problem!

When for whatvere reason I am obliged to use OCR to convert the document, I use ABBYY FineReader, which is the only one I have found to give consistently high accuracy.

One of the problems is that it attempts to recreate the formatting of the original document, but uses all sorts of subterfuges to do so, including "random" text boxes, multipkle columns, curious styles, etc.

The first thing I do is tell the agency that I can only supply "raw" text in doc format, and they or their client will have to take care of the formatting afterwards — I'm a translator, not a secretary or DTP specialist!

Then when I do the OCR conversion, I choose "plain, unformatted text" — that gets rid of many of the problems mentioned above, which can interfere with CAT tool operattion. I then check and if necessary run a search-&-replace to get rid of the spurious returns that often appear at the end of lines; I also spell-check the document (in the correct language!) and read through it carefully to check for blatant errors such as you describe.

Once I have (relatively!) clean text, I can then start my job of translation — but always cross-checking carefully with the PDF as I go along.

Clearing the formatting means that you get rid of attributes like 'white' text you might not have spotted, odd text boxes, etc. etc. I should mention that even with an editable PDF, it is often easier just to OCR it, rather then try and copy the text out of it, as doing that often results in sections being taken out of order, unless you meticulously select each text block and copy-paste them each individually.

If you have had problems reported with certain documents in the past, then it would be well worth your while to go back and check these in order to understand where the problems arose from: your incorrect acronym does indeed sound like an OCR error, but you should have spotted that at once as you went through (didn't you ask yourself what it meant, and refer back to the PDF to check?) — it also helps if you learn to recognize common OCR fault mechanisms, so you are on the alert for that sort of thing — like; for example, 'd' being read as 'cl' and vice-versa; or in your case AA being read as M; you can anticipate a lot by thinking about the letter-forms of the particular typefaces used.
Collapse


Dan Lucas
Tradupro17
neilmac
B D Finch
 
achisholm
achisholm
United Kingdom
Local time: 05:38
Italian to English
+ ...
I use ABBYY and apply a surcharge, if necessary. Oct 25, 2018

Most of the PDFs I receive are from scanned documents, therefore there is no text in the document at all. So they all have to be processed by OCR. If it’s a lot of work, I charge extra. Or I offer the chance for the client to provide an editable document, and hold them to the quality of the document they provide.

Dan Lucas
Tradupro17
neilmac
Miranda Drew
 
Tina Vonhof (X)
Tina Vonhof (X)
Canada
Local time: 22:38
Dutch to English
+ ...
Why convert? Oct 25, 2018

I use ABBYY only to get a word count. Very occasionally, if the conversion contains only well formatted text and no numbers or special characters, text boxes, columns, or page breaks, conversion may save time but in my experience 'fixing' a converted document takes more time than simply starting from scratch and saving myself the aggravation.




[Edited at 2018-10-25 17:02 GMT]


 
patyjs
patyjs  Identity Verified
Mexico
Local time: 22:38
Spanish to English
+ ...
I agree with Tina Oct 25, 2018

I have worked, or tried to work with converted text; sometimes the formatting is so garbled it's practically impossible. I much prefer to start from scratch. I usually look at the properties of the pdf to find out which fonts have been used. Often they are available online although, personally, I would never pay for a new font. If the agency/client isn't willing to fork out, a Google search will come up with alternative, free fonts that are similar. And of course, informing the agency/client ass... See more
I have worked, or tried to work with converted text; sometimes the formatting is so garbled it's practically impossible. I much prefer to start from scratch. I usually look at the properties of the pdf to find out which fonts have been used. Often they are available online although, personally, I would never pay for a new font. If the agency/client isn't willing to fork out, a Google search will come up with alternative, free fonts that are similar. And of course, informing the agency/client assures them you've done your best to match the source.

I've learned a lot about DTP from trial and error, and although it is time-consuming and frustrating at first, it can be done, and knowing that I can deliver a competent rendition of the original makes it enjoyable and satisfying.

Best of luck...
Collapse


 
Christel Zipfel
Christel Zipfel  Identity Verified
Local time: 06:38
Member (2004)
Italian to German
+ ...
Save the document as .txt Oct 25, 2018

Will Volny wrote:

... the formatting is 'dodgy' and I have unknowingly sent to an agency a poorly formatted document. They mentioned that some of the lines were 'gibberish,' which, needless to say is horrifying to me because in my own copy of the text, this 'gibberish' was not there.


That's what I do because I was sick of having this issue with my customers when on my side the document looked quite ok. The penalty is that you lose all formatting, of course, which can be painful in a long document... But at least I didn't have complaints any more. I wonder whether there are other simpler solutions. I too, use ABBYY and apply always a surcharge (which in some case unfortunately proved to be ridiculous compared to the time involved).


 
Maxi Schwarz
Maxi Schwarz  Identity Verified
Local time: 23:38
German to English
+ ...
Myexperience Oct 26, 2018

90% of the material that I translate is in PDF form. I subscribe to a conversion program (I think with Adobe) and I convert it to the next-to-newest version since the newest is buggy, choosing the appropriate language. I translate in a clean Word file, typing as I usually do, creating my own formatting, but I will cut and paste things over such as numbers or proper names. I ditch the formatting since that can get weird. It works rather smoothly.

Some checking needs to be done.
... See more
90% of the material that I translate is in PDF form. I subscribe to a conversion program (I think with Adobe) and I convert it to the next-to-newest version since the newest is buggy, choosing the appropriate language. I translate in a clean Word file, typing as I usually do, creating my own formatting, but I will cut and paste things over such as numbers or proper names. I ditch the formatting since that can get weird. It works rather smoothly.

Some checking needs to be done. Numbers like "8" can turn into "B" and similar, but it still saves time.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 06:38
Member (2006)
English to Afrikaans
+ ...
@Will Oct 26, 2018

Will Volny wrote:
I have unknowingly sent to an agency a poorly formatted document. They mentioned that some of the lines were 'gibberish,' which, needless to say is horrifying to me because in my own copy of the text, this 'gibberish' was not there.


My first thought is (since this is an OCR'ed document) that your file contains text boxes, and that your word processor is set not to display text boxes. For example, in Word 2003, you have to go Tools > Options > View and make sure that "Drawings" is ticked. It is unticked by default (and Word often unticks it again by itself -- quite annoying).

If you find that there are too many text boxes, you might have to change your OCR settings to create a less well-formatted Word file which does not create text boxes, but then you're going to spend more time re-formatting the file manually.

It is difficult or impossible to format in the normal way that an original Word document can be. I often cannot make the text formatting match the original...


You're relying too much on the OCR program to do the hard work for you. An OCR program should only be used for converting the text, and you are supposed to take care of the formatting (and layout) manually. The OCR program will do its best, but it is programmed to focus on creating a document that *looks* the same as the original, and does not focus on making a document that is easily editable or translatable.

..., and there is the mismatching of letters on occasion (such as 'e's becoming 'c's, or, for example, AAST turning into MST).


That is normal with OCR. The OCR program is always guessing, and it is up to you (the user) to double-check the conversion to see what needs fixing.

You can sometimes improve the recognition by selecting the correct language. Some OCR programs then use a spell-checker to assist in the guesswork, although they won't fix spelling errors that they finds.

Some projects are last minute and urgent on the part of the agency and work needs to progress rapidly - and all they have is a PDF.


Perhaps such agencies believe that you will re-create the document on the fly, i.e. simply type the translation directly into an empty Word document and format the formatting and layout as you go.

--

Francis Marche wrote:
OCR apps normally wouldn't do that to a straight text (no pics or charts) by a clear day and without advanced warning though.


Yes, some OCR programs have the option to highlight letters that the program is unsure of, but in my personal experience a file either contains very many or very few of these instances -- either too many false positives, or too many false negatives -- and in both such cases highlighting possible errors isn't really useful for me.

Tony M wrote:
I tell the agency that I can only supply "raw" text in doc format, and they or their client will have to take care of the formatting afterwards ... Then when I do the OCR conversion, I choose "plain, unformatted text"...


I don't go quite as far as Tony. I tell the client that the translation's layout will be an approximation of the original, so that the various pieces of text are located more or less in the same locations on the page. This makes it easier for a DTP person who can't speak the language to redo the layout without wondering which piece of text belongs where.

But just like Tony I also usually start with the plain, unformatted text option. I then recreate the layout and paste the text into the layout, before starting on the actual translation. In other words, I first re-create or fix the source text file before I even start on the translation. I don't fix OCR errors (i.e. incorrectly recognised letters) beforehand, though, unless I believe it will seriously affect fuzzy matching.

In cases such as diplomas and certificates, where I want to mimic the layout more precisely, I take a screenshot of the PDF file, then use that image as a background image on the page, then paste the text on the page, and then move the text around so that it more or less lines up with the background. To do this in Word 2003, to Format > Background > Printed Watermark > Picture watermark. When I'm satisfied that the text is adequately placed, I remove the watermark again.

I should mention that even with an editable PDF, it is often easier just to OCR it, rather then try and copy the text out of it, as doing that often results in sections being taken out of order, unless you meticulously select each text block and copy-paste them each individually.


That is true. Older versions of Acrobat Reader (prior to version 5.5, I think) had an option to select blocks. You'd draw a rectangle around the text that you want to copy, and only the text inside the rectangle would get copied. Later versions of Acrobat Reader don't have this feature. I have yet to find another PDF viewer that can copy text like that.



[Edited at 2018-10-26 09:54 GMT]


 
Will Volny
Will Volny  Identity Verified
United States
Local time: 23:38
Member (2018)
Czech to English
+ ...
TOPIC STARTER
Thank you. Oct 31, 2018

Immense amount of information! I thank you very much. There are so many approaches to this issue and I feel well-armed now.

 
B D Finch
B D Finch  Identity Verified
France
Local time: 06:38
French to English
+ ...
Proofreading? Oct 31, 2018

Will Volny wrote:
I have unknowingly sent to an agency a poorly formatted document. They mentioned that some of the lines were 'gibberish,' which, needless to say is horrifying to me because in my own copy of the text, this 'gibberish' was not there.


I don't understand how you could have unknowingly sent a "poorly formatted document" or "gibberish", unless you failed to proofread the translated document that you sent to your client. Quite often, it's necessary to resize text boxes to reveal text that has overrun. That's best done by altering the text boxes in question so that they automatically fit the text, run on to the next page etc. You can, of course, fix the formatting of your document so that it displays the same way whatever software is used by your client, by making a PDF of your translated document.

You don't say what software you use. I find that WordFast will deal perfectly adequately with simple, editable PDFs created from Word documents. I use Abbby FineReader for anything more complicated and I charge an hourly rate for my time spent doing that conversion. I don't generally agree to translate poor-quality PDFs. I always compare the finished document to the original PDF to check that the layout matches up.

The worst ever PDF job I had was a few years ago. It was a very large job with very poor quality, scanned PDF files and I asked the outsourcer whether the original Word files might be available. He told me they weren't. So, I didn't use a CAT tool. When I delivered the first translated file and my invoice, he suddenly wanted to change the terms of payment, so that he only paid me once the end client had paid. I refused this and we ended up agreeing that I work directly for the end client (a large firm of solicitors). As soon as I was in touch with the end client, I mentioned the problem of the poor quality PDF files and they immediately offered me the original Word files!

[Edited at 2018-10-31 13:04 GMT]

[Edited at 2018-10-31 13:07 GMT]


Sandra & Kenneth Grossman
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Working from PDF converting to Word documents






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »