Regex for finding abbreviations/acronyms not in abbreviations list Thread poster: Patricia Martin
|
I know I really need to study regular expressions more so I don't have to ask on a forum for help, but bear with me, I'm still a newbie. I translate a lot of technical documents and usually at the beginning there is a list of abbreviations. Something like API Application Programming Interface DD Design Document RC Remote Control etc However, throughout the document I'm starting to find lots of abbreviations that ar... See more I know I really need to study regular expressions more so I don't have to ask on a forum for help, but bear with me, I'm still a newbie. I translate a lot of technical documents and usually at the beginning there is a list of abbreviations. Something like API Application Programming Interface DD Design Document RC Remote Control etc However, throughout the document I'm starting to find lots of abbreviations that are not in this list. Obviously the author should have taken better care of this beforehand and I probably should have looked over the document more carefully before accepting the job (newbie mistake) but it's too late now. The client has no glossary of these abbreviations either (another newbie mistake by me). Although I've been told what a great idea it would be to have one (you don't say!?!) and to please provide them with one when I'm finished. I'm not going through an LSP and I guess I could put the blame on the client and ask them to provide a list first, but it looks like I may be able to have a lot of work from them in the future (they need a lot of help) so I want to come off as superwoman. Unfortunately my computer skills are not there yet. I'd like to be able to find all of these abbreviations beforehand so I can get confirmation from the client before I start translating to avoid a lot of back and forth emails which can take a while to get a response. The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle) What would be the regex I can use to find these abbreviations that are not in the list? There may not be a magical one that can find them all, but I could probably use a regex to find capital letters that are between 2-4 characters long. HELP! ▲ Collapse | | | You could try... | May 29, 2013 |
... this to find complete abbreviations between 2-4 characters: \b[A-Z]{2,4}\b Or this is you think some may have lowercase characters in there as well: [A-Z]{2,4} The second is less exact but it may help. Regards Paul | | | Michael Beijer United Kingdom Local time: 17:04 Member (2009) Dutch to English + ...
Hi Patricia, Welcome to Proz! You could also use PerfectIT for this: ABBREVIATIONS How to avoid confusion without extra effort What happens when your clients find abbreviations that they don't understand? Sometimes they can figure it out from the context, but otherwise the work that you put into crafting your message can be wrecked. To prevent distractions, you should ensure that every abbreviation is defined when it's first used and you should provide a Table of Abbreviations with your documents. The trouble is that it requires a lot of work. First you have to locate each abbreviation and then you need to find out whether that's the first use. If it's the first time it appears, the definition needs to be with that instance and not with the later ones. It's a time-consuming task that requires painstaking attention to detail. However, if your word processor could do it for you, readers would be guaranteed to understand the document better without additional time spent proofreading. PerfectIt is a downloadable program that works with Microsoft Word. Without any templates, forms or markers, PerfectIt checks that abbreviations are: Presented consistently Defined on their first use Defined only once Used after they are defined PerfectIt can also generate a Table of Abbreviations automatically from your document. When you're finished writing, just load PerfectIt. It will undertake all of the checking and generate your Table of Abbreviations. You can learn more about PerfectIt's other features, or to try PerfectIt for free, download it now. http://www.intelligentediting.com/standardversion.aspx PS: I don't work for them. Just another happy user. | | | Giles Watson Italy Local time: 18:04 Italian to English In memoriam Don't forget... | May 29, 2013 |
Patricia Martin wrote: The client has no glossary of these abbreviations either (another newbie mistake by me). Although I've been told what a great idea it would be to have one (you don't say!?!) and to please provide them with one when I'm finished. ... to agree a fee for compiling the glossary before you start. Even Superwoman has to eat | |
|
|
case sensitivity in regex? | May 30, 2013 |
SDL Support wrote: ... this to find complete abbreviations between 2-4 characters: \b[A-Z]{2,4}\b This actually found ones with lowercase letters in them! However it also found everything between 2&4 characters. for, the, of with, on, as, any, has, been, this, etc. Is there a way to make searches case-sensitive and look for only capital letters in regex? Or include a stop word list to filter out these words? It also failed to find HC-LS but I guess with the hyphen that is 5 characters long. Giles Watson wrote: Even Superwoman has to eat Ha. I was going to charge them a fee for the glossary as it is extra work even though it helps us both. Maybe I can use that fee to purchase the software Michael mentioned or a book on regular expressions. | | | \b[A-Z]{2,4}\b Works | May 30, 2013 |
\b[A-Z]{2,4}\b Works well in Emeditor。 | | | Natron Japan Local time: 01:04 English to Japanese + ...
What software are you using? Is there a "match case" option?
[Edited at 2013-05-30 08:06 GMT] | | | Software used | May 30, 2013 |
Please specify where you are going to search. Your post is suggesting that in Word. However, I can reproduce (more or less) in SDL Studio 2011 what you have reported, if I search after checking ’Use’ (Regular Expressions), but do not checking ’Match Case’. This may be important because regex flavours may be different between several versions of a software, leaving alone the softwares themselves. | |
|
|
mixed case acronyms | May 30, 2013 |
Patricia Martin wrote: The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle) Something like this should do the trick (untested): \b[A-Z][a-Z]{0,2}[A-Z]\b What this looks for is: a 2 to 4 letter "word" that starts with a capital letter, ends with a capital letter and may have capital or lower-case letters in the middle. It will match AB, ABC ABCD and NTrS, but it won't match ABCd or aBC. You win some, you lose some. You can use {0,3} to expand the total length to 5 letters, and you can add other characters to the character set in the middle as needed. E.g. \b[A-Z][a-Z-]{0,3}[A-Z]\b will match HC-LS* Enable "Match case", of course (assuming that you're doing this in Studio). * You may need to use \b[A-Z][a-Z\-]{0,3}[A-Z]\b, I can't be bothered to test it now
[Edited at 2013-05-30 09:13 GMT] | | | Samuel Murray Netherlands Local time: 18:04 Member (2006) English to Afrikaans + ... In MS Word... | May 30, 2013 |
Patricia Martin wrote: The majority of them are 2 to 4 characters long and are all caps, but not always (for example NTrS with a lower case r in the middle). In MS Word, you can try this: < [ A - Z ] @ > and < [ A - Z ] @ [ A - Z , a - z ] @ [ A - Z ] @ > | | | text editor for Unicode block properties? | Jun 4, 2013 |
Sorry, I should have been more specific. I just copied/pasted the text into Notepad++ and searched from there. Thanks to everyone's help here I was able to extract a list of abbreviations much quicker. Now for my next question! I've been studying more on regex flavors here From the site: Some languages are composed of multiple scripts. There is no Japan... See more Sorry, I should have been more specific. I just copied/pasted the text into Notepad++ and searched from there. Thanks to everyone's help here I was able to extract a list of abbreviations much quicker. Now for my next question! I've been studying more on regex flavors here From the site: Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of. However, it seems these are only supported in programming languages such as PHP or Perl. I'm not a programmer and am feeling a little overwhelmed. Is there a text editor that allows you to use Unicode block property regexes??? ▲ Collapse | | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Regex for finding abbreviations/acronyms not in abbreviations list Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
| CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |