Using Google's Translation Service for Foreign Internet Analysis

by

OSIN

While working on my next big project I got burned out and needed a break. So this project was ideal because it was very quick, simple, and yielded some interesting results. The question in my mind was this- could one do Internet reconnaisance against a foreign country's websites, even if you don't speak their language? I started thinking about this and eventually my mind turned to Google's translation service. This is actually a wonderful tool which will translate your words, in this case English, into the target language. Then you can actually search Google for pages which contain the foreign language translation of the words you want to search for and then translate some of those pages back into your language. Now the translation is far from perfect, but you can actually glean information if you use some creative thinking in what is returned. In this project I will be more concerned in translating pages from the target language into English. Currently, Google supports these target languages- English, German, Spanish, French, Italian, Portuguese, Japanese (beta), Korean (beta), and Simplified Chinese (beta). Let's use an example.

China is an upcoming Super Power-wannabe, so let's see what they've been up to lately. Go to http://translate.google.com/translate_t and in the second text box enter in the words "Chinese air force base" (don't use the quotes though) and in the dropdown box below that, choose "English to Chinese", then click the "Translate" button. What you will get in the top text area are the closest Chinese translation of your English words. Now click on the "Google Search" button near the top. Google returns those pages in Chinese that contain those Chinese characters. As a side note, Windows users will have to install the Chinese language pack in order to see the correct characters. Linux users using Mozilla shouldn't have any problem, but depending on your distribution, you may have to install it.

Anyway, you'll notice from the pages that Google brings up there is a "Translate this page" link over to the right. If you click on that link, you'll get Google's English translation of that Chinese page, but I warn you, it's not for the faint of heart. That's why the English to Chinese choice is marked Beta. If you go back to the page which displayed the Chinese results, you may notice that the more those Chinese characters are found in those pages, the more relevant that page might be to your actual intended search. And Google even highlights in red those found characters, sometimes showing that actual grouping you're searching about. Here was an interesting link about satellite photos of Taiwanese Air Force Bases. But this service is not without its problems. One problem is that if you try to click on one of those links on the translated page, you get an error. I could only see them by going to the original page, copying the link, and using Google's "Translate this web page" option at the bottom of their translation page. Other problems I saw were that sometimes the pages weren't translated at all or only partially, so you're bound to see that as well. But even with those problems this tool can be helpful in many ways.

Automating the Process

Let's take this idea even further. Let's automate the process so that we can do automatic searches while doing other stuff. In this next example I will be using German as the target language. We'll come back to Chinese later after I explain the special circumstances of using this next procedure on the Asian languages. Because I feel comfortable doing certain actions in different programming languages, this automation will consist of a combination of shell, java, and perl. People far more knowledgeable in programming should be able to produce better scripts than I can, but this is only one way to automate the process. I'll also be using other tools such as wget, privoxy, and tor to fetch the webpages. Here is the flow process of the script:

1. Since we'll be using the Tor system, we need to create a .wgetrc file and define our proxy settings.
2. Open a text file of English phrases that we want to translate into German.
3. For each of those phrases the following will be performed:

	a.  Use wget to call the Google page responsible for translating that English phrase to German and save to an html file.
b. Open the previously saved html file and parse out the foreign language words.
c. Now use Google to search for web pages with that foreign phrase and save the search results to an html file.
d. Open the previously mentioned file and for each of those results returned, parse out the link for the "Translate this page" item and write them to a file.
e. Using wget, visit each of those parsed out links and save to an html file for later analysis.

1. The main script is written in perl. It takes three variables of input- the proxy:port settings, your native language, and the target language. Here is the code for the first part of the perl script, which I've named translate.pl:

#!/usr/bin/perl
#Usage: translate.pl proxy:port native_lang target_lang

#define some variables
#The LANG values must correspond to what Google uses as its language definitions
!!!
# To date, these are the languages supported:
# English - en
# Chinese (Simplified) - zh-CN
# Korean - ko
# Japanese - ja
# German - de
# Spanish - es
# French - fr
# Italian - it
# Portuguese - pt

use utf8;

$proxy=$ARGV[0];
$NATIVE_LANG=$ARGV[1];
$TARGET_LANG=$ARGV[2];

print "Setting proxy to $proxy\n";

concat(PROXY,">/user-home-dir/.wgetrc"); #set this to whatever user's home dir you are using...
print PROXY "http_proxy=$proxy";
close(PROXY);

As an example, if I'm using the Tor system and I want to translate between English to German, I would run this command: ./translate.pl 127.0.0.1:8118 en de

2. Now, we will open a file called list.txt and grab our English phrases that we want to translate into German. For this example I will use these:

prison+system
electrical+grid
German+air+force+base
Note the '+' signs that replace spaces. Those are critical. Now,we open the file and set a variable called 'count' to 1:

concat(FILE, "list.txt");
$count=1;
3a. This part uses wget to call the correct translate page and save it to a file called trans.html:
while() {

        chomp();#gets rid of the carriage return char

        $nativewords=$_;

$cmd1="/usr/bin/wget --user-agent=Mozilla -O trans\.html http://translate\.google\.com/translate_t?text=$nativewords\\&langpair=$NATIVE_LANG%7C$TARGET_LANG\\&hl=$NATIVE_LANG\\&ie=UTF8";

        $result1=`$cmd1`;
3b. At this point I run a specialized shell script which runs another perl script and java application to parse out the translated text from trans.html:


$translatedtext=`/bin/sh grabencodedlines.sh`;
print "Translated text is $translatedtext\n";
Here is the code for grabencodedlines.sh:


#!/bin/sh

finaltext=`perl grabencodedlines.pl`;

CMD=`java parsetext 'wrap=PHYSICAL>' '</textarea>' "$finaltext" '1'`

echo "$CMD";

The source code for grabencodedlines.pl is located here. The java source code is here. Because wget doesn't really handle utf8 very well, you will not see the correct characters if you just opened that file, so grabencodedlines.pl encodes the trans.html file to the correct charset. Then you use a java application to pull out the translated text that falls between a certain html coding section. The '1' tells the java application that we want only the first occurrence it finds. We will use this application again later when we go to pull out the "Translate this page" links by passing a zero as the 4th argument.

3c. Continuing with the code for translate.pl, we now call Google again to search for webpages with the translated German text:


$cmd4="/usr/bin/wget --user-agent=Mozilla -O hits\.html http://www\.google\.com/search?q='$translatedtext'\\&hl=$NATIVE_LANG\\&ie=UTF8\\&oe=UTF8\\&num=10";

$result4=`$cmd4`;

3d. Note that in this example I am only grabbing 10 results. The maximum search results per page for Google is 100. At this point we open up the hits.html file that contains Google's search results for the German text and parse out the links and dump them to a file:


#call the java class again to pull out the Translate this page URLs
        $cmd5="/bin/sh grablinks.sh";
        $result5=`$cmd5`;

        #output the parsed out html links to a file called translatedlinks.txt
        $cmd6="echo '$result5' > translatedlinks.txt";
        $result6=`$cmd6`;

This is the code for grablinks.sh:


#!/bin/sh

finaltext=`cat hits.html`

CMD=`java parsetext ' - [ <a href=' ' class=fl>Translate this page' "$finaltext" '0'`

echo "$CMD";

Keep in mind that those spaces between the '' are critical!!!

3e. And finally, we will use wget to visit those translated pages. But first, because Google is using frames to display its results, we must parse each of those links and replace the correct page call to the translated page:

       concat(OUT,"> tmp.txt");
        concat(SPFILE, "translatedlinks.txt");
        while (<SPFILE>){
                chomp();
                s/translate\?/translate_c\?/g;
                print OUT "$_\n";
        }

#so now, replace the old links file with the new
        $newcmd2=`/bin/mv tmp.txt translatedlinks$count.txt`;

                #use wget to get and save the pages.
        concat(FILE2,"translatedlinks$count.txt");
        $link=1;
        while(<FILE2>) {
        chomp();
        $cmd8="/usr/bin/wget --user-agent=Mozilla -O finalresults$count$link.html '$_'";
        $result8=`$cmd8`;
        $link++;
        }
$count++;
}
#end while loop

This process is by no means a fool-proof process, but it does get results. I have found the biggest problem is with missing pages or DNS resolution failing somewhere along the Tor system. You can find the entire code for translate.pl here.

However, the script causes problems if you try to use it against the three Asian languages that Google now supports. The only way I was able to get the process to work was to manually translate each English phrase via Google's translate tool, then copy and paste the characters into a file called "asian.txt" on a seperate line for each. There's something in the character encoding that is returned by wget that does not generate the correct codes, but it does in Mozilla. This is a minor problem but is easily corrected in the script. An Asian version of the script can be found here.

Other variations on this project would be to add a random value in the searches so that nothing suspicious shows up in Google's logs. Also, you might want to try combining the wget url calls with other Google url values such as "site" and "filetype". This is not meant to be an exhaustive discussion on how Google's URLs are structured, because that in itself would fill a book.