Looking for a regular expression to catch everything between < and >
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Dec 9, 2011

Hi,

I've imported an SDL TMX and I get lots of tags in the language pairs:



etc.

Can somebody please help me to define (non greedy) regular expression to replace all tags between < and > with nothing?

I've come this far:


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 13:58
Member (2009)
Dutch to English
+ ...
Hi Hans Dec 9, 2011

I will leave the regex to the experts, but have you tried Olifant or the new vTMXEditor (http://www.proz.com/forum/software_applications/212794-vtmxeditor_::_new_free_tmx_editor.html)? These can both remove tags/html from your TMX...

 
Csaba Ban
Csaba Ban  Identity Verified
Hungary
Local time: 14:58
Member (2002)
English to Hungarian
+ ...
MemoQ 5 does this beautifully Dec 9, 2011

MemoQ 5 offers a great and easy solution for turning such regular expressions into internal tags.
They offer a 45-day free trial. BTW, now the software package is priced at a 40% discount.

good luck,
Csaba


 
jean-marc aubertin
jean-marc aubertin
Local time: 14:58
Member (1970)
English
Hi Hans Dec 9, 2011

The regular expression to find tags is : <&>

I think it is a little bit dangerous to do it like this (replace all tags with nothing based on this regular expression) it may have some side effects if you've real < or > characters meaning really "lower or greater than" in your text...

Best regards,

Jean-Marc

[Edited at 2011-12-09 11:31 GMT]


 
Adam Podstawczynski (X)
Adam Podstawczynski (X)  Identity Verified
Local time: 14:58
Polish to English
+ ...
Your tags don't show :) Dec 9, 2011

Show the examples first, because they are invisible in your posting.

However, off the top of my head such an expression would go as follows:


s/\<.?\>//


This is a Perl-like, non-greedy expression which you needed. Need a Word-like one? It will look a bit different, please let me know.

[Edited at 2011-12-09 11:16 GMT]


 
Adam Podstawczynski (X)
Adam Podstawczynski (X)  Identity Verified
Local time: 14:58
Polish to English
+ ...
On second thoughts Dec 9, 2011

This should read

s/<.+?>//

I'm writing from memory, can't check now.


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:58
English to Hungarian
+ ...
Solutions Dec 9, 2011

If Transit does have regex support, I find the idea of switching to MemoQ for this reason... peculiar.

Anyway, this should work:
<[^>]*>

This is essentially the same as the non-greedy expression with the ? above, only I find it a bit more transparent, easier to adapt and more certain to work in more regex engines.
[] stands for character group, [^] stands for 'all characters except', and * stands for 'any number of'. So the expression translates to:
... See more
If Transit does have regex support, I find the idea of switching to MemoQ for this reason... peculiar.

Anyway, this should work:
<[^>]*>

This is essentially the same as the non-greedy expression with the ? above, only I find it a bit more transparent, easier to adapt and more certain to work in more regex engines.
[] stands for character group, [^] stands for 'all characters except', and * stands for 'any number of'. So the expression translates to: <, then any number of characters that aren't >, then >. If Transit can replace multiple regex matches in the same TU, then it should be enough to run this once.

[^] is of course much better for this sort of thing than trying to match every conceivable character positively. For instance, not even characters like éáőúóü are covered by [a-z]. They are covered by \w if the regex engine has \w (which I believe should capture all letters, all numbers and _) but even then, there are a myriad special characters you are never going to remember.

[Edited at 2011-12-09 13:00 GMT]
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Want to try it in Transit first Dec 15, 2011

Michael Beijer wrote:

I will leave the regex to the experts, but have you tried Olifant or the new vTMXEditor (http://www.proz.com/forum/software_applications/212794-vtmxeditor_::_new_free_tmx_editor.html)? These can both remove tags/html from your TMX...


Hi Michael and thanks for the suggestion. I want to try it in Transit first.

Transit NXT your one stop solution.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Don't compare a Ferrari with a Trabant Dec 15, 2011

Csaba Ban wrote:

MemoQ 5 offers a great and easy solution for turning such regular expressions into internal tags.
They offer a 45-day free trial. BTW, now the software package is priced at a 40% discount.

good luck,
Csaba


Thanks Csaba, you're surely not comparing Transit NXT with MemoQ?

They are nice guys at Kilgray but they'll have a long way to go to offer all the beauties that Transit NXT offers.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Greater than/lesser than don't show Dec 15, 2011

Adam Podstawczynski wrote:

This should read

s///

I'm writing from memory, can't check now.


I had already written to Proz Support that they should fix display of Greater than/lesser than ASAP. I had forgotten that, now the tags I've posted don't show up.

Thanks for the suggestion, I'll try it. Hmm, invalid syntax, I get.

Find: «machine123#4!!»

With: «.+» doesn't work either ...

Ah, probably this is what I need (like suggested in another reply here):

«&»

(Where « is greater than and » is smaller than)

Hans

[Edited at 2011-12-15 13:08 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Yep that is the one Dec 15, 2011

Warlock wrote:

The regular expression to find tags is :

I think it is a little bit dangerous to do it like this (replace all tags with nothing based on this regular expression) it may have some side effects if you've real < or > characters meaning really "lower or greater than" in your text...

Best regards,

Jean-Marc

[Edited at 2011-12-09 11:31 GMT]


Thanks for this one.

Please tell me, how do you insert « and » to show up in your message?

Hans


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:58
English to Hungarian
+ ...
< and > Dec 15, 2011

Hans Lenting wrote:

Adam Podstawczynski wrote:

This should read

s///

I'm writing from memory, can't check now.


I had already written to Proz Support that they should fix display of Greater than/lesser than ASAP. I had forgotten that, now the tags I've posted don't show up.

Thanks for the suggestion, I'll try it. Hmm, invalid syntax, I get.

Find: «machine123#4!!»

With: «.+» doesn't work either ...

Ah, probably this is what I need (like suggested in another reply here):

«&»

(Where « is greater than and » is smaller than)



There's no need to write to technical support as there is nothing for them to fix. The forum software parses tags in < ... > as HTML. If you want them to show up, use &lt; and &gt; as I did in my post above.

As to your problem, Transit's regex engine probably doesn't know non-greedy ?. Try the solution I suggested above.


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 14:58
English to Hungarian
+ ...
Don't think so Dec 15, 2011

Warlock wrote:

I think it is a little bit dangerous to do it like this (replace all tags with nothing based on this regular expression) it may have some side effects if you've real < or > characters meaning really "lower or greater than" in your text...


The source text seems to be some sort of tagged format, which means that "real" < and > characters will be encoded as character entities (&lt; and &gt;) and won't get caught in the crossfire.


 
jean-marc aubertin
jean-marc aubertin
Local time: 14:58
Member (1970)
English
Displaying < and > on the forum Dec 15, 2011

Hans Lenting wrote:

Warlock wrote:

The regular expression to find tags is : <&>

I think it is a little bit dangerous to do it like this (replace all tags with nothing based on this regular expression) it may have some side effects if you've real < or > characters meaning really "lower or greater than" in your text...

Best regards,

Jean-Marc

[Edited at 2011-12-09 11:31 GMT]


Thanks for this one.

Please tell me, how do you insert « and » to show up in your message?

Hans


Hans,

To display lower and greater signs you have to wrote them as HTML entities.

Regards,

Jean-Marc


 
msoutopico
msoutopico  Identity Verified
Ireland
Local time: 13:58
English to Galician
+ ...
regexp Mar 28, 2012

For the regexp in Transit NXT, I would use

<#([!>]+)0>

and replace with

<#0>

However, I don't see why you would need to do that in Transit.

Cheers, Manuel

[Edited at 2012-03-28 11:00 GMT]

[Edited at 2012-03-28 11:01 GMT]


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maya Gorgoshidze[Call to this topic]

You can also contact site staff by submitting a support request »

Looking for a regular expression to catch everything between < and >






Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »