Obfuscation: A valid way to protect sensitive data?
Thread poster: Hans Lenting

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
Aug 18

Some CAT tools (like CafeTran Espresso 10 Croissant) already offer a way to mask sensitive data at the segment level.

Following a recent discussion in this forum, I have been musing over a more thorough way to protect sensitive data at the document level. I'd welcome your opinion on the validity of this approach.

Task:
Some CAT tools (like CafeTran Espresso 10 Croissant) already offer a way to mask sensitive data at the segment level.

Following a recent discussion in this forum, I have been musing over a more thorough way to protect sensitive data at the document level. I'd welcome your opinion on the validity of this approach.

Task:

  • Translate a document of 5,000 words with sensitive data, subject field: legal, finance or patents.
  • Make sure that the originating document cannot be reconstructed.
  • Protect sensitive data.


Approach:

  • Mask all names (company names, street names, names of individual persons etc.) and numbers.
  • Merge the 5,000 words with 95,000 words of similar documents.
  • Sort the new document with 100,000 words (e.g. alphabetically, one segment per line)
  • Run the document through a public MT system to create a TM.
  • Use this TM to translate the originating document of 5,000 words.


Question:

  • What way would the public MT system have to retrieve sensitive data from the 100,000 words document.


(I'd be not surprised, if there actually is a way to reconstruct the originating document )
Collapse


 

Gary Evans  Identity Verified
Germany
Local time: 17:40
Member (2007)
German to English
Come again Aug 18

Hi Hans,

Seems quite complicated to me. Are you suggesting that the text becomes mixed up at the sentence level? Beats me how you can put that soup back together again. It also looks like 95% of the translation is pointless work for the MT.

I'd personally stick to not using MT for highly sensitive translations as others have stated elsewhere here.

Regards,
Gary


Véronique Guider
 

DZiW
Ukraine
English to Russian
+ ...
Fizzy fuzziness Aug 18

First, there's no one-to-one equivalence in translation.

Second, most CATs use "segments", significantly downgrading (=separating) the textual to lexico-grammatical level.

Third, even removing WHO-WHAT (names and numbers) is enough to overgeneralize the independent clauses, let alone garbling WHEN-WHERE-WHY-HOW specifications.

Fourth, while shuffled fragments, choppy, run-on, and loose sentences with excessive subordination and non-parallel structure render
... See more
First, there's no one-to-one equivalence in translation.

Second, most CATs use "segments", significantly downgrading (=separating) the textual to lexico-grammatical level.

Third, even removing WHO-WHAT (names and numbers) is enough to overgeneralize the independent clauses, let alone garbling WHEN-WHERE-WHY-HOW specifications.

Fourth, while shuffled fragments, choppy, run-on, and loose sentences with excessive subordination and non-parallel structure render any text meaningless, such tricks do take more time and efforts without much gains for the translator. Why third-party online(?) exotic(?) MT?

Fifth, a secret meta-language might help to some extent, yet if a perpetrator can access your TM, then how about other papers, intermediary works, and correspondence? It's just not worth it.

IMO
Collapse


 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Less effort than one might assume ... Aug 19

Gary Evans wrote:

Seems quite complicated to me. Are you suggesting that the text becomes mixed up at the sentence level?


Actually, nearly all the steps can be automated.

Beats me how you can put that soup back together again.


The beauty is that you don't need to put the soup back together, since you'll be using the TM to translate the originating document.

It also looks like 95% of the translation is pointless work for the MT.


That's right, but the only relevance here are the costs of these extra words.

However, I see a real problem: you can soon run out of fresh 'distraction' documents, since you can only upload every sentence once to the online MT system. After that, the MT system will be able to differentiate between new (real) segments and old (distracting) segments. Especially in combination with IP logging and other fingerprinting techniques (which I'm quite sure, they all use).


 

Samuel Murray  Identity Verified
Netherlands
Local time: 17:40
Member (2006)
English to Afrikaans
+ ...
@Hans Aug 19

Hans Lenting wrote:
  • Merge the 5,000 words with 95,000 words of similar documents.


  • 1. My annual Google Translate bill is $100. With your method, it'll be $2000.

    2. A malicious machine could use something similar to a plagiarism checker to identify which segments in your "text" are likely from public sources and therefore whatever remains are likely the confidential segments.

    3. Even if you then also sort all segments alphabetically, a neural system can conceivably exist to calculate probable original orders of segments.

    Hans Lenting wrote:
  • Mask all names (company names, street names, names of individual persons etc.) and numbers.


  • You could go further, and alter the 1000 most commonly used adverbs and adjectives. If a CAT tool does this, you can ensure that a specific adverb is not always replaced with the same dummy adverb. I wonder if such a thing is feasible.


    DZiW
     

    Hans Lenting  Identity Verified
    Netherlands
    Member (2006)
    German to Dutch
    TOPIC STARTER
    Some answers Aug 20

    Samuel Murray wrote:

    Hans Lenting wrote:
  • Merge the 5,000 words with 95,000 words of similar documents.


  • 1. My annual Google Translate bill is $100. With your method, it'll be $2000.


    The 95,000 words was just an arbitrary number. But indeed, your costs will increase.

    2. A malicious machine could use something similar to a plagiarism checker to identify which segments in your "text" are likely from public sources and therefore whatever remains are likely the confidential segments.


    True. That would indeed be a likely scenario.

    3. Even if you then also sort all segments alphabetically, a neural system can conceivably exist to calculate probable original orders of segments.


    When I posted my first message in this thread, I was indeed aware of this possibility. But how would this work?

    Hans Lenting wrote:
  • Mask all names (company names, street names, names of individual persons etc.) and numbers.


  • Samuel Murray wrote:
    You could go further, and alter the 1000 most commonly used adverbs and adjectives.


    I cannot see the point of doing that ...

    Anyway: Let's wait for someone with a better idea .


     

    DZiW
    Ukraine
    English to Russian
    + ...
    Timestamps/authors/synonyms Aug 21

    Hans, are you talking about local (offline) TMs or shared/online ones?

    If the former, there's no use to get on all fours barefooted, showing one's flexibility to please numerous two-centers and spongers. Providing that a hacker can access TMs, it's not a big deal to reveal the original communication /shadow copies /invoices/ temporary files whatever.

    If the latter, it's still about local security policy and individual habits/practices of every participant--includ
    ... See more
    Hans, are you talking about local (offline) TMs or shared/online ones?

    If the former, there's no use to get on all fours barefooted, showing one's flexibility to please numerous two-centers and spongers. Providing that a hacker can access TMs, it's not a big deal to reveal the original communication /shadow copies /invoices/ temporary files whatever.

    If the latter, it's still about local security policy and individual habits/practices of every participant--including the servers.

    In all, what about clients/agencies practices? How they could prove their protection is adequate and no sensitive data leakage is possible on their side, I wonder?


    While a targeted attack or a custom order is still possible, no real malefactor would even consider hacking very your PC to get TMs, unless (1) it's far too easy and (2) he knows with what big fish you deal.
    Collapse


     

    Hans Lenting  Identity Verified
    Netherlands
    Member (2006)
    German to Dutch
    TOPIC STARTER
    Late answer Aug 24

    DZiW wrote:

    Hans, are you talking about local (offline) TMs or shared/online ones?


    Sorry for my late answer, but I'm talking about TMs created by MT systems. My CAT tool (CafeTran Espresso 10 Croissant) allows creation of TMs by running all segments of a translation project through MT systems. E.g. when you want to use MT and you know that you won't have access to the internet later (e.g. during a flight, in the bush, etc.).


     

    Samuel Murray  Identity Verified
    Netherlands
    Local time: 17:40
    Member (2006)
    English to Afrikaans
    + ...
    On CAT tools creating MT'd TMs Aug 24

    Hans Lenting wrote:
    My CAT tool (CafeTran Espresso 10 Croissant) allows creation of TMs by running all segments of a translation project through MT systems.


    My CAT tool (WFC) doesn't have that feature but I accomplish it in 5 minutes using a combination of its features plus a little AutoIt script. There is a new feature in Wordfast Pro 5 which allows for MT to be used during pre-translation, so I suppose one could use that as well (then extract the TM from the TXLF file). Trados has an option to "use automation" during pre-translation, but I couldn't get it to work.


     


    There is no moderator assigned specifically to this forum.
    To report site rules violations or get help, please contact site staff »


    Obfuscation: A valid way to protect sensitive data?

    Advanced search







    Wordfast Pro
    Translation Memory Software for Any Platform

    Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

    More info »
    SDL Trados Studio 2019 Freelance
    The leading translation software used by over 250,000 translators.

    SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running, helps experienced users make the most of the powerful features.

    More info »



    Forums
    • All of ProZ.com
    • Term search
    • Jobs
    • Forums
    • Multiple search