One of very nice features in a CAT product is a possibility to pretranslate the text and find the most frequent segments in it. To avoid inconsistent translations one can export and translate them before actually translating the documents themselves. This speeds up the translation process and ensures that frequent segments are translated in a consistent manner.

How about something similar for Term bases? Adding to them while on the way through the document is always possible, but it is rather distracting. To do a proper job one has to concentrate on a the word alone, maybe go check in a dictionary or two, ask friends – and that usually means one enters some prop at those places and then forgets to tackle them later.

Here's a simple method to avoid this. It involves a text processing program – for example Microsoft Word – and some functionality from Microsoft Excel, specifically its pivot table.

Cutting up the source into single words

What we are looking for, is eventually a list (and as a consequence a dictionary) of words present in the text to be processed. Attention: it always pays to have a copy made of what you are working on.

The first step is relatively simple: to order the text into single words, one replaces all blanks and tabs by carriage return/line feeds. In Word this is achieved by replacing blank with ^p. You would do the same replacement for other kinds of separators, like tabs, commas, columns etc. After the global replace you should have your original text changed to lines, consisting of single words, bracketed just by carriage returns

Let us take the first paragraph from The tale of two cities:

It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going direct
the other way--in short, the period was so far like the present
period, that some of its noisiest authorities insisted on
its being received, for good or for evil, in the superlative degree
of comparison only.

Making the suggested change to one word to a line, the text is converted to (sparing you some or most of the 119 words):

It
was
the
best
……
of
comparison
only.

There may be some exotic single cases like the combination "way—in" above, which in my Gutenberg version of Dickens' text was missing the blanks. As the rule of this game is "Do less well", don't bother.

Now copy this text to the clipboard (^A and ^C) and start Excel.

Creating vocabulary and word frequencies

According to Vicipedia, vocabulary est verba et translationem verborum in linguas alias docens – which means it's telling you about words and their translations into other languages. We are not that far yet, as we need the words first, and here's where Excel comes in handy: it will namely reduce all the word repeats to their single occurrences and on top of that show us how common they are.

To get this list, you will need the services of a pivot table. I will assume you have some experience with them, so I hope the following description is sufficient, if not even superfluous. With Excel open and our one-word-per-line text copied into clipboard:

select one of the tables, make sure it is empty, and enter "words" into A1
activate the cell A2
press ^V to paste in the text you have in the clipboard from before
select the complete A column
in Data menu

select the pivot chart
press "next" one time in the first window
press "next" to confirm A column as the data selected
press Layout

You should see now the layout of the pivot table and somewhere at 2'oclock "words"

drag "word" rectangle to "line"
drag it to "data" – it changes to "count of words"
press "OK" and "finish" in the next window

A new spreadsheet appears, containing distinct words from your text and their frequencies, i.e. how often they have occurred in the original text. In case of Charles Dickens' Tale of two cities, the top of this list looks like this:

Count of words
words	result
so	1
age	2
all	2
authorities	1
before	2
being	1
belief	1
best	1

The program found 57 different words in the text,–so evidently some of them turn up more than once. Ordering the pivot table by "result" (by copying its contents and sorting the copy in decreasing order of "result") shows the following:

the	14
of	12
was	11
It	10
we	4

which is what one would expect and what does not need to be translated - typing "es" outright in German for instance is of course faster than using term base to look up "it".

Harvesting – a real case

Here's a real example - 1500+ words of a MSDS text, with the usual suspects at the top:

and	47
to	46
the	39
in	38
of	33
be	33
with	31
or	20

… and then here and there some words, we will be pleased to add to our term base:

water	17
material	11
reaction	9
diisocyanate	8
respiratory	7
isocyanate	6
heat	6
carbon	6
avoid	5
polyol	5
dioxide	5
container	5
pressure	5

Conclusion

One can of course build whole machinery around this simple solution, adding for instance:

i) exclusion tables – "ignore words and, it, the…."

ii) exclusion rules – "ignore words shorter than…"

iii) start automatic search for translations

However, just by taking care of the above table (with "water", "material" etc.) we are 95 pretranslates richer.

Not bad for a 10 minutes job.

Comments on this article

Knowledgebase Contributions Related to this Article

No contributions found.

Want to contribute to the article knowledgebase? Join ProZ.com.


	X Sign in to your ProZ.com account... Username: Password: Forgot your password? Or create a new account