DGT translation memories (CAT Tools Technical Help)

Forum tecnici » CAT Tools Technical Help »
DGT translation memories
Track this topic

Pagine: [1 2] >

DGT translation memories

Iniziatore argomento: Dominique Pivard

Dominique Pivard

Local time: 21:46
Da Finlandese a Francese

Feb 6, 2013

Old news already, but here is how to create DGT TM's in any language pair (23 EU languages available):

http://wordfast.fi/blog/cat-tools/2013/02/06/how-to-create-dgt-translation-memories/
or
http://youtu.be/GNj07W2ZqhQ?hd=1

The sample Finnish-Slovenian TMX used in the video has more than 2 million translation units (though probably lots of duplicates). ▲ Collapse

Tomás Cano Binder, BA, CT

Spagna
Local time: 20:46
Membro (2005)
Da Inglese a Spagnolo
+ ...

2 million in English-Spanish

Feb 6, 2013

Thanks a lot Dominique! This is great information and I appreciate it. I downloaded the package and made English-Spanish. Contains nearly 2 million segments as well. I also plan to make my other main pair, German-Spanish.

I am keeping this memory as background information for EU related translations in our memoQ server here.

The extraction took just over an hour on my machine.

Dominique Pivard

Local time: 21:46
Da Finlandese a Francese

AVVIO ARGOMENTO

memoQ

Feb 6, 2013

Tomás Cano Binder, CT wrote:
The extraction took just over an hour on my machine.

You must have a much faster computer than the one I used (a three-year old laptop with an AMD processor and 4 GB of RAM)!

Let me know how the import in memoQ goes, because when I tried, I wasn't able to complete it. Not a problem for me, because I'm searching the DGT (and other very large TM's) with dtSearch, but I think memoQ (and probably several other tools) may have problems dealing with TM's that big.

Meta Arkadia
Local time: 02:46
Da Inglese a Indonesiano
+ ...

No problems, and problems

Feb 6, 2013

Dominique Pivard wrote:
but I think memoQ (and probably several other tools) may have problems dealing with TM's that big.

I use the DGT for GER>DUT as one of three TMs in CafeTran without problems. I assigned 6 GB of RAM to Java, and DGT "pre-translates" (another strange "Igor term" which means auto-assemble) as a database with a low priority. I also set it to Read Only.
Searching within DGT provides instant results.

Not that I don't have problems, though: http://www.proz.com/forum/apple_mac_operating_systems/242687-automated_search_help_needed.html but they have nothing to do with DGT, and everything with searching in databases.

Cheers,

Hans

Michael Beijer

Regno Unito
Local time: 19:46
Membro
Da Olandese a Inglese
+ ...

re: importing large (DGT) TMXs into memoQ

Feb 6, 2013

In order to get the really big ones into memoQ you need to cut them in half in a text editor and import them in two goes. For really big files, I recommend EmEditor (which can handle sizes that even UltraEdit can't).

Michael

http://www.emeditor.com/

[Edited at 2013-02-06 10:06 GMT]

Dominique Pivard

Local time: 21:46
Da Finlandese a Francese

AVVIO ARGOMENTO

Other tools?

Feb 6, 2013

Michael Beijer wrote:
In order to get the really big ones into memoQ you need to cut them in half in a text editor and import them in two goes. For really big files, I recommend EmEditor (which can handle sizes that even UltraEdit can't).

Yes, now you're mentioning it, I remember you talked about splitting the TMX before importing in memoQ. Did you remember how long it took to import each half? Did you import the 2nd half into the same TM as the 1st half, or to a separate memoQ TM? Do you find you get useful LSC hits from the DGT TM's?

Have you tried importing the DGT TMX into other tools, eg. DVX2 or Studio 2011? If so, how long did it take (for instance compared to memoQ)?

Stanislav Pokorny

Repubblica Ceca
Local time: 20:46
Da Inglese a Ceco
+ ...

Studio positive

Feb 6, 2013

Dominique Pivard wrote:
Have you tried importing the DGT TMX into other tools, eg. DVX2 or Studio 2011?

No problems with Studio; took about four hours on an older i3 2.5 GHz machine with 3 GB RAM.

FarkasAndras

Local time: 20:46
Da Inglese a Ungherese
+ ...

Studio struggles above 2M

Feb 6, 2013

Stanislav Pokorny wrote:

Dominique Pivard wrote:
Have you tried importing the DGT TMX into other tools, eg. DVX2 or Studio 2011?

No problems with Studio; took about four hours on an older i3 2.5 GHz machine with 3 GB RAM.

I've done a couple of tests with Studio (2009 only). It slows down as the number of segments goes up. So it might do 100,000 segments in 2 minutes and 1 million segments in an hour and a half (random figure), but it will take six hours to import two million. In my experience, about two million is the upper limit. I tried to import 6 million TUs once, and killed it after sixteen hours. It was not even halfway done IIRC. Maybe 2011 brought improvements in this regard, I will soon test it.
I'm not sure if lookup performance is better with multiple smaller TMs but I suspect it might.
In any case, the size of the DGT-TM is right about where Studio starts to crap out.

I asked about this in a separate thread here: http://www.proz.com/forum/cat_tools_technical_help/237113-very_large_tms_~10_million_tu.html

Michael Beijer

Regno Unito
Local time: 19:46
Membro
Da Olandese a Inglese
+ ...

as far as I can remember...

Feb 6, 2013

Dominique Pivard wrote:

Yes, now you're mentioning it, I remember you talked about splitting the TMX before importing in memoQ. Did you remember how long it took to import each half? Did you import the 2nd half into the same TM as the 1st half, or to a separate memoQ TM? Do you find you get useful LSC hits from the DGT TM's?

Have you tried importing the DGT TMX into other tools, eg. DVX2 or Studio 2011? If so, how long did it take (for instance compared to memoQ)?

Hi Dominique,

1. I can't remember exactly how long it took for each half (of 330MB), maybe around 40 minutes or so each (on a 64-bit desktop with a 3.07 GHz i7, 16GB of RAM and an SSD).

2. I imported the 2nd half into the same TM as the 1st half.

3. I have LSC (longest substring concordance) switched off. I find it never has anything useful to report. Incidentally, I also have Predictive Typing & AutoPick (and the Muse) switched off, as I find they just get in my way when translating.

4. I tried importing it into Déjà Vu X2, but gave up after 8 hours.

Michael

[Edited at 2013-02-06 13:54 GMT]

Grzegorz Gryc

Local time: 20:46
Da Francese a Polacco
+ ...

DVX

Feb 6, 2013

Michael Beijer wrote:

4. I tried importing it into Déjà Vu X2, but gave up after 8 hours.

As a big DVX fan, I can but confirm than a large TMX import in DVX is a PITA

Now, after some months, I don't remember exactly but the header of the DGT TMX is/was incorrect and the file can't be imported "as is", it was necessary to edit in a decent text editor.
A good practice is to import a smaller TMX, compact the DVMDB, then import another smaller TMX, compact the DVMDB, etc.

Cheers
GG

Meta Arkadia
Local time: 02:46
Da Inglese a Indonesiano
+ ...

A very short screencast of DGT GER-DUT

Feb 7, 2013

in CafeTran. I go to the next segment, DGT (and my other TMs and glossaries) Auto-Assembles. Next, I select a word to search in DGT (and other resources).
The screencast is short because, er, CT doesn't take much time to arrive at the desired results…

http://www.screencast.com/t/E4IDfKcMueF

Cheers,

Hans

trhanslator (X)

No indexing?

Feb 7, 2013

Do you mean that CafeTran with no indexing of the TM is that fast?

How about opening the TMX file, how many hours did that take?

Meta Arkadia
Local time: 02:46
Da Inglese a Indonesiano
+ ...

Seconds

Feb 7, 2013

trhanslator wrote:
Do you mean that CafeTran with no indexing of the TM is that fast?

Well, yes. But it's set to "pre-translate", and that explains the very fast auto-assemble results. However, searching within the DGT is fast as well, as you can see in my miserably short screencast.

How about opening the TMX file, how many hours did that take?

Seconds. With 6 GB of RAM assigned to Java. And CafeTran loads TMs in RAM.

Cheers,

Hans

Michael Beijer

Regno Unito
Local time: 19:46
Membro
Da Olandese a Inglese
+ ...

@Hans (Meta Arkadia):

Feb 7, 2013

And how about the amount of TMs that CafeTran can access in a project simultaneously? In memoQ I have around 8,000,000 segments across all of my connected TMs and experience no slowdowns. How does this work in CT?

Michael

[Edited at 2013-02-07 11:06 GMT]

Meta Arkadia
Local time: 02:46
Da Inglese a Indonesiano
+ ...

Eight million? WOW!

Feb 7, 2013

Michael Beijer wrote:
And how about the amount of TMs that CafeTran can access in a project simultaneously?

I never tried more that three TMs (.tmx) and two glossaries (tab delimited .txt) at the same time, Michael. And that doesn't present any problems. However, the total number of TUs never came close to 8 million. I don't think I can even try it, because I probably don't have that number of TUs in one language pair. I hope somebody else can answer your question.

Cheers,

Hans

Pagine: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderatore(i) di questo Forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

DGT translation memories

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Recenti | FAQ | Regole | Moderatori | Archivio articoli

Your current localization setting

Italiano

Select a language

More languages...

DGT translation memories

DGT translation memories

You have native languages that can be verified

Your current localization setting

Select a language