Count words in PDF

To count words in a PDF, the PDF has to be converted to a text format, or the text has to be extracted from the PDF to a text format.

Issues with PDFs

  • A PDF (portable document format) is an export file. It is meant as a non-editable file that ensures that a document printed on any computer will print the same way.
  • Content in a PDF can be either selectable text or embedded images. Most PDFs come with fonts embedded, which means that any text displayed in it displays at highest quality regardless of the size of the printout or zoom factor. Images in PDFs are raster images, which means that if you zoom in or print larger, the quality of the image deteriorates.
  • Sometimes embedded images in a PDF may look poor on screen, but if a PDF is optimised for printing, the printed version will look much better.
  • Text can be included in a PDF file as an image (in other words, it's not really text in a word processing kinda way, but only a picture of text). Such pieces of text will suffer the same problems as images, namely that zooming in or printing at a custom size will result in poorer display.
  • Text in PDFs can be locked so that you can't copy them. Such text displays at high quality (because it is real text), but if you copy and paste them, you end up pasting empty spaces.
  • Text and images in a PDF are not arranged in the underlying code in the same order as they are arranged on screen. For this reason, if you select and copy all text in a PDF, you may find that sections of the text appear in places you didn't expect them to be. The best way to copy text from a PDF, if you're doing it manually, is using the "block" or "column" select tool which allows you to select and copy individual blocks of text.
  • Some PDF2DOC programs simply copy the text and reformats it by guessing where line breaks should be. In such documents, pieces of text may appear out of place because of the above issue.
  • In a PDF, there is no such thing as a "paragraph". What appears to be paragraphs on screen, is actually just a bunch of lines grouped together. For this reason, if you copy text from a PDF, text in a "paragraph" will have hard-returns at the end of lines.
  • If you require a text with word wrap, you have to copy and paste the text from the PDF, and then remove the unnecessary hard-returns manually.
  • If a PDF2DOC program doesn't perform OCR, it will only be able to copy text from the PDF that is real text, not pictures of text. OCR programs may be able to distinguish between real text and pictures of text, but ultimately that doesn't matter because OCR tries to recognise the text in pictures of text anyway.

PDF word counters

Yep, some programs can actually count words straight from a PDF, or so they say.

PDF2TXT converters

There are some paid and free PDF2TXT (or PDF2DOC) converters. Some of them are more intelligent than others -- some try to reformat the text, whereas others simply extract the text.

  • List of PDF2TXT and PDF2DOC converters here
  • You can also use pstotext. It is a bit more difficult to use, so if you aren't very tech savvy, it probably isn't for you. You need to install GhostScript on your system and GhostView (both free) and then pstotext and then execute the extract function. This doesn't handle every type of PDF but it will handle many of them. You can find out more about it at: http://www.research.compaq.com/SRC/virtualpaper/pstotext.html

OCR converters

OCR is optical character recognition. It is a program that tries to read and extract text that is not real text but only a picure of text. OCR programs do better at simple fonts than at fancy fonts. Some OCR programs also try to format the text document similar to the image/PDF scanned.

Not all OCR programs can take PDFs as input files. In some cases you may have to convert the PDF to an image format before doing the OCR.

  • List of OCR programs here

Word counters

Once you have your text in a text format, such as plain text or a word processing format, you can count the words using your favourite word counting utility.

