Regex for finding abbreviations/acronyms not in abbreviations list
Thread poster: Patricia Martin
Patricia Martin
Patricia Martin
United States
Japanese to English
+ ...
May 29, 2013

I know I really need to study regular expressions more so I don't have to ask on a forum for help, but bear with me, I'm still a newbie.

I translate a lot of technical documents and usually at the beginning there is a list of abbreviations.

Something like

API Application Programming Interface
DD Design Document
RC Remote Control

etc

However, throughout the document I'm starting to find lots of abbreviations that ar
... See more
I know I really need to study regular expressions more so I don't have to ask on a forum for help, but bear with me, I'm still a newbie.

I translate a lot of technical documents and usually at the beginning there is a list of abbreviations.

Something like

API Application Programming Interface
DD Design Document
RC Remote Control

etc

However, throughout the document I'm starting to find lots of abbreviations that are not in this list. Obviously the author should have taken better care of this beforehand and I probably should have looked over the document more carefully before accepting the job (newbie mistake) but it's too late now.

The client has no glossary of these abbreviations either (another newbie mistake by me). Although I've been told what a great idea it would be to have one (you don't say!?!) and to please provide them with one when I'm finished.

I'm not going through an LSP and I guess I could put the blame on the client and ask them to provide a list first, but it looks like I may be able to have a lot of work from them in the future (they need a lot of help) so I want to come off as superwoman. Unfortunately my computer skills are not there yet.

I'd like to be able to find all of these abbreviations beforehand so I can get confirmation from the client before I start translating to avoid a lot of back and forth emails which can take a while to get a response.

The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle)

What would be the regex I can use to find these abbreviations that are not in the list? There may not be a magical one that can find them all, but I could probably use a regex to find capital letters that are between 2-4 characters long.

HELP!
Collapse


 
RWS Community
RWS Community
United Kingdom
Local time: 18:04
English
You could try... May 29, 2013

... this to find complete abbreviations between 2-4 characters:

\b[A-Z]{2,4}\b

Or this is you think some may have lowercase characters in there as well:

[A-Z]{2,4}

The second is less exact but it may help.

Regards

Paul


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:04
Member (2009)
Dutch to English
+ ...
PerfectIT May 29, 2013

Hi Patricia,

Welcome to Proz!

You could also use PerfectIT for this:



ABBREVIATIONS

How to avoid confusion without extra effort

What happens when your clients find abbreviations that they don't understand? Sometimes they can figure it out from the context, but otherwise the work that you put into crafting your message can be wrecked. To prevent distractions, you should ensure that every abbreviation is defined when it's first used and you should provide a Table of Abbreviations with your documents.

The trouble is that it requires a lot of work. First you have to locate each abbreviation and then you need to find out whether that's the first use. If it's the first time it appears, the definition needs to be with that instance and not with the later ones. It's a time-consuming task that requires painstaking attention to detail. However, if your word processor could do it for you, readers would be guaranteed to understand the document better without additional time spent proofreading.

PerfectIt is a downloadable program that works with Microsoft Word. Without any templates, forms or markers, PerfectIt checks that abbreviations are:

Presented consistently
Defined on their first use
Defined only once
Used after they are defined
PerfectIt can also generate a Table of Abbreviations automatically from your document. When you're finished writing, just load PerfectIt. It will undertake all of the checking and generate your Table of Abbreviations. You can learn more about PerfectIt's other features, or to try PerfectIt for free, download it now.

http://www.intelligentediting.com/standardversion.aspx


PS: I don't work for them. Just another happy user.


 
Giles Watson
Giles Watson  Identity Verified
Italy
Local time: 18:04
Italian to English
In memoriam
Don't forget... May 29, 2013

Patricia Martin wrote:

The client has no glossary of these abbreviations either (another newbie mistake by me). Although I've been told what a great idea it would be to have one (you don't say!?!) and to please provide them with one when I'm finished.



... to agree a fee for compiling the glossary before you start. Even Superwoman has to eat


 
Patricia Martin
Patricia Martin
United States
Japanese to English
+ ...
TOPIC STARTER
case sensitivity in regex? May 30, 2013

SDL Support wrote:

... this to find complete abbreviations between 2-4 characters:

\b[A-Z]{2,4}\b


This actually found ones with lowercase letters in them! However it also found everything between 2&4 characters.

for, the, of with, on, as, any, has, been, this, etc.

Is there a way to make searches case-sensitive and look for only capital letters in regex? Or include a stop word list to filter out these words?

It also failed to find HC-LS but I guess with the hyphen that is 5 characters long.

Giles Watson wrote:

Even Superwoman has to eat


Ha. I was going to charge them a fee for the glossary as it is extra work even though it helps us both. Maybe I can use that fee to purchase the software Michael mentioned or a book on regular expressions.


 
Heartsome Support
Heartsome Support
Local time: 00:04
\b[A-Z]{2,4}\b Works May 30, 2013

\b[A-Z]{2,4}\b Works well in Emeditor。

 
Natron
Natron
Japan
Local time: 01:04
English to Japanese
+ ...
Match case May 30, 2013

What software are you using? Is there a "match case" option?

[Edited at 2013-05-30 08:06 GMT]


 
István Hirsch
István Hirsch  Identity Verified
Local time: 18:04
English to Hungarian
Software used May 30, 2013

Please specify where you are going to search. Your post is suggesting that in Word. However, I can reproduce (more or less) in SDL Studio 2011 what you have reported, if I search after checking ’Use’ (Regular Expressions), but do not checking ’Match Case’.

This may be important because regex flavours may be different between several versions of a software, leaving alone the softwares themselves.


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 18:04
English to Hungarian
+ ...
mixed case acronyms May 30, 2013

Patricia Martin wrote:
The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle)


Something like this should do the trick (untested):
\b[A-Z][a-Z]{0,2}[A-Z]\b

What this looks for is: a 2 to 4 letter "word" that starts with a capital letter, ends with a capital letter and may have capital or lower-case letters in the middle.
It will match AB, ABC ABCD and NTrS, but it won't match ABCd or aBC. You win some, you lose some.
You can use {0,3} to expand the total length to 5 letters, and you can add other characters to the character set in the middle as needed.
E.g.
\b[A-Z][a-Z-]{0,3}[A-Z]\b
will match HC-LS*


Enable "Match case", of course (assuming that you're doing this in Studio).

* You may need to use \b[A-Z][a-Z\-]{0,3}[A-Z]\b, I can't be bothered to test it now

[Edited at 2013-05-30 09:13 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:04
Member (2006)
English to Afrikaans
+ ...
In MS Word... May 30, 2013

Patricia Martin wrote:
The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle).


In MS Word, you can try this:

< [ A - Z ] @ >
and
< [ A - Z ] @ [ A - Z , a - z ] @ [ A - Z ] @ >


 
Patricia Martin
Patricia Martin
United States
Japanese to English
+ ...
TOPIC STARTER
text editor for Unicode block properties? Jun 4, 2013

Sorry, I should have been more specific. I just copied/pasted the text into Notepad++ and searched from there. Thanks to everyone's help here I was able to extract a list of abbreviations much quicker.

Now for my next question!



I've been studying more on regex flavors here

From the site:

Some languages are composed of multiple scripts. There is no Japan
... See more
Sorry, I should have been more specific. I just copied/pasted the text into Notepad++ and searched from there. Thanks to everyone's help here I was able to extract a list of abbreviations much quicker.

Now for my next question!



I've been studying more on regex flavors here

From the site:

Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.

However, it seems these are only supported in programming languages such as PHP or Perl. I'm not a programmer and am feeling a little overwhelmed.

Is there a text editor that allows you to use Unicode block property regexes???
Collapse


 
RWS Community
RWS Community
United Kingdom
Local time: 18:04
English
EditPadPro Jun 4, 2013

Hi,

I'm surprised if Notepad++ doesn't do this, but I don't have it and can't confirm. However, EditPadPro does : http://www.editpadpro.com/

Written by the same person who wrote this : http://www.regular-expressions.info/cookbook.html

Regards

Paul


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Regex for finding abbreviations/acronyms not in abbreviations list







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »