Optical character recognition

From ProZ.com Wiki

(Difference between revisions)
Jump to: navigation, search
Jared Tabor (Talk | contribs)
(Created page with 'Category: File management {| style="clear:both; background:none; color:black;" |- | width="13%" style="padding:1em 1em 1em 1em; border:1px solid #A3B1BF; background-color:#…')

Current revision as of 17:00, 26 November 2010

Note: This article is a joint project of ProZ.com members and guests. All translators are invited to add to this article. (Click "Edit" above; you must be logged in.)
If you don't know how wiki formatting works, see: http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet

Optical character recognition (abbreviated as OCR) is a means to convert handwritten, typewritten, scanned or otherwise non-editable text into an editable format. OCR is usually performed by OCR software. Although the main character recognition operation is performed automatically, in most cases, some human intervention is required. In some cases, extensive human intervention is required, depending on the type and the quality of the document to be recognized.

Working with non-editable documents

As translators, we often get PDF documents for translation. Although a fair number of translators refuse to translate PDFs, usually because they don't have OCR software at their disposal or because the document is too lengthy to be typed anew, an increasing number of translators uses OCR to process these documents. In fact, if the PDF document is in image format, the only way to convert it into an editable format is through OCR. Other file types, such as JPEG, BMP and TIFF can also be converted using OCR.

While a straightforward, entirely automated OCR process may be all it takes to produce a translatable file, increasingly, it is not enough to simply hit a button anymore. That is the case when the original file is not to be translated in its entirety. In such cases, you first need to define the parts that you do wish to convert and the parts you wish the OCR process to ignore (you would typically ignore pictures and convert the text only). The defining of process zones and ignore zones is usually referred to as zoning. In case you are dealing with a document that has complicated formatting, like a three-column layout, tables or text containing pictograms, you will also need to perform zoning. In the case of handwritten, photocopied or faxed documents, it is also likely that the document you are working with is of poor quality - speckled, too pale or too dark. In such cases, the recognition process will not be able to produce an error-free output file, and you will be asked to manually correct the recognition results.

For all the reasons described above, you should take into consideration that buying OCR software in itself will not take care of all your OCR needs. In fact, you will need to invest time into learning to use the software correctly, and even at that, it will most likely take some time to convert each document you will process. Thus, it is a good idea to charge extra for OCR services, that is, to charge a higher price for translation of PDFs and other such uneditable files than for the translation of source documents in editable formats.

OCR software

There are several OCR software packages available on the market, ranging from a few dozen to a few thousand dollars in price. Therefore, before making a purchase, it is wise to investigate the different products available to find the one that will satisfy your personal OCR needs without being too hefty an investment considering the benefits. Keep in mind that it is not because a particular package is more popular than others that it is necessarily best suited for you. Most OCR software can be downloaded as a free demo or trial version - make use of these to decide for yourself which one suits you most.

The following is a list of the most widely used OCR software in the translation industry:

Name License Operating systems Notes
ABBYY FineReader Commercial Windows For working with localized interfaces, corresponding language support is required.
OmniPage Commercial (Nuance EULA) Windows Product of Nuance Communications
Readiris Commercial Windows, Mac OS Product of I.R.I.S. Group of Belgium. Asian and Middle Eastern editions.
SimpleOCR Freeware and commercial versions Windows
TextBridge Commercial Windows, Mac OS Product of Nuance Communications

Discussion related to this article

Please note that ProZ.com forum rules apply to this area.

Personal tools