This wiki has been archived. The articles are no longer editable.

Glossary management

From ProZ.com Wiki

Jump to: navigation, search


Note: This article is a joint project of ProZ.com members and guests. All translators are invited to add to this article. (Click "Edit" above; you must be logged in.)
If you don't know how wiki formatting works, see: http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet



To create a glossary of terms (words and phrases), you can either create it as you go along during a pilot project, or you can extract terms from the source text and compile them into a glossary.

The advantage of creating it during a pilot project is that the term list is created in context, but the disadvantage is that it is a slower method and it yields less terms (although most terms are useful). The advantages of automatic term extraction are that it is fast and yields potentially more relevant terms. The disadvantage is that terms are not viewed in context, and you have to initially search through many unimportant or useless terms.

The disadvantages of either method can be overcome, though.

Contents

Bulk term extractors

Bilingual term extractors

Since some texts are available in bitext format (which can be a TM or a PO file or similar), it would be possible through smart guessing to figure out which source words fit which target words. These tools are generally called "word aligners" in the computational linguistics industry, and sadly most "free" tools are free for academic purposes only. Word aligners have great potential because bitexts are usually translated by humans, and so the terms are likely to be accurate.

A good compromise may be to do monolingual term extraction on a TM -- I know of no such tools, however.

Monolingual term extractors

Monolingual term extractors usually look for words or phrases that occur more than X number of times in the text, or for words that occur in the text but do not occur in a dictionary or established term list. Such an extraction usually contains a large percentage of useless terms unless steps are taken to remove irrelevant repetitions from the result.

ExtPhr32 by Tim Craven

Tim Craven's ExtPhr32 for MS Windows is not GPL but it is freeware for all purposes. It is very fast. Unfortunately it converts all terms to uppercase. You can also use a stoplist. You can choose how many occurrences of a term must be the minimum, and you can choose the minimum number of words in a term. The output can be exported to two column plaintext (the second column contains a count).

PlusTools from Wordfast

PlusTools is a macro that runs within MS Word on MS Windows. It is not GPL but it is freeware for all purposes. It is slow but potentially useful for smaller texts because it can exclude words that occur in MS Word's spellchecker and/or thesaurus, or words that have less than X synonymns in the thesaurus. You can also exclude certain words (similar to a stoplist, but you can add any words to it), words beginning with a certain set of characters, and words that are smaller than X number of characters.

Other tools

GPL collection of corpus tools.

  • SDL MultiTerm 7 Extract Freelance

Extracts terms from Trados compliant bitexts. Pricetag: EUR 525.00 per single user licence.

Concordancers

A concordancer is a tool that displays a word in context. An excellent, easy to use concordancer, is Corpsis (previously Tenka Text). Or, if you're willing to pay some money, try Mike Scott's Wordsmith tools. Corpsis can display multiwords phrases using wildcards.

http://corpsis.sourceforge.net/

Glossary editors

Whichever way you look at it, a glossary is a database, and most comprehensive glossaries can be edited in a database editor tool. For simple, three-column glossaries, a spreadsheet program may be all that's necessary, though.


Glossary viewers

There are quite a few glossary viewers, but they often require that you convert your glossary to their weird format. Examples are StarDict, Jalingo, jDictionary and Pododict. If you're willing to pay for a non-free product, AnyLexic is an excellent glossary tool that supports simple formats too.