Data Mining with Perl

by Luckycan

The idea of mining the web has been a popular topic of study since the first search engines were designed.

Data miners use specialized programs to download web pages and then extract data from them for later use.  The successes achieved by Google are probably the best example of how data mining can be profitable and productive.  The founders of Google have done extensive research in the field of data mining and have all sorts of neat little tricks to make their search engine work as well as it does.  The algorithms used by Google take advantage of mathematics that require more than a high school education to understand and are probably not suitable for use in personal projects.

If you want to do some data mining, you basically have two options.

You can either program your own custom C program, complete with low-level socket code and low-level code to communicate with the HTTP protocol.  The other way would be to use some quick and dirty method that would only take an hour of your time to perfect and would fit perfectly into your busy schedule of feeding receipt printer paper into the shredder at work and looking at girls on the Internet.

This article will give a brief introduction to the LWP modules for use with Perl.

LWP allows you to program a quick-and-dirty data miner so you can, for example, grab 1200 drink recipes from some bar tending site in only five minutes instead of spending two hours clicking "Next -> Copy -> Paste' (I'm glad I spend my time in such a productive manner).

If you are not familiar with Perl, then I suggest you become familiar with it.  Perl is by far one of the coolest languages you will ever learn, and can be used for almost everything.

To download Perl go to ActiveState (www.activestate.com), and download the LWP modules from www.cpan.org.

This article will not describe how to install this stuff.  I think you can figure it out.  It will also not go into great detail about how to extract the data from the pages you receive.  This article is aimed at being an introduction to using LWP to acquire data from the web.

A brief introduction to the HTTP protocol will make it a little easier to understand what is really going on when you start hacking away.  You have two main methods of requesting data: GET and POST.  In both, you are asking the web server for a specific web page.  If the page is static, the server just returns the content of the page.  But if the page is dynamic, variables need to be passed to the server (sometimes by cookies) so it can dynamically generate the content.

The difference between GET and POST lies in the way variables are passed to the server.  With a GET request the variables are encoded and passed through the query string.

http://www.somesite.com?var1=val1&var2=val2

When you do a POST request, the variables and values cannot be seen in the query string, but are sent to the server as content.  The server-side program can read its content over the STDIN.

It is not important to understand all the ins-and-outs of the server side for our purposes.  It is important, however, to understand the difference between the POST and GET requests when you are trying to figure out what you need to tell a server and how you need to tell it in order to get data back.

The first thing we need to do in order to start using LWP is set up our basic tools.

Here is a code snippet:

use LWP;
$browser = LWP::UserAgent->new();
$browser->agent("Mozilla/4.76[en] (Windows NT 5.0; U)");

The first line gives you access to LWP's box of tricks.

The second line creates a new LWP browser which you will use to browse the web.

The third line is not absolutely necessary, but if the agent is not set then the HTTP_USER_AGENT environmental variable will tell the server that LWP is trying to access the site.  I have found that a lot of sites will deny access if you are not using a popular browser, so it's best just to go ahead and set the agent.

Now that you have a browser object, you can use the GET and POST methods.  So let's look at a simple example that uses GET to... get a web page:

$url = URI->new("http://www.google.com");
$response = $browser->get($url);
if ($response->is_success) {
  print $response->content; 
}
else {
  die "WTF? $response->status_line\n<BR>"; 
}

The first line creates a URI object (note that the http:// is necessary).

The actual URL could just as easily be directly passed to the GET method as a string, but the use of the URI object allows you to do some cool stuff.  It breaks the URL into all of its individual parts (i.e., scheme, userinfo, hostname, port, path, query) and its use is generally good practice.

The is_success attribute represents just what you think; it is true on success and false otherwise.

$response->content returns a string containing the content of the page.

If the command ./foo.cgi > google.html (assuming the file containing the code is called foo.cgi) is issued, then by opening the newly created file google.html in a web browser you will be looking at the Google front page (minus the pics).

If you run a local web server and run the script from the server, then you don't have to bother with the two steps.  Just request the page from your local server via your favorite web browser.

If the request fails, then the above program will print out $response->status_line, which contains the status returned by the server.

Passing variables through the GET request is just as easy as getting a static page.

For example, to search for "2600" on Google, you would use a URL like the following:

$url = URI->new("http://www.google.com/search?q=2600");

Similarly:

$url = URI->new("http://www.google.com/search");
$url->query("q=2600");

Both examples accomplish the same thing.  The latter takes advantage of the URI objects query method.  When the page is returned, you will have results 1-10 of the Google search for "2600".

If you want to get results 11-20, try this:

$url = URI->new("http://www.google.com/search?q=2600&start=10");

For 21-30 you have start=20 and so on.

It is obvious by this example how easy it would be to loop through all the pages of results.  We could, for example, collect all the hyperlinks on each page up to the first 100 results.

Here is some example code:

$url = URI->new("http://www.google.com/search");
foreach $num (0..9) {
  $url->query("q=2600&start=".10 * $num);
  $response = $browser->get($url);
  if ($response->is_success) {
    &ParsePrint($response->content); 
  }
  else {
    die $response->status_line;
  }
}

If there is no error then the ParsePrint subroutine will parse out the data and print it to STDOUT.

This is of course not a built in function and will have to be written by us.  This article is not going into detail about extracting the data, but one example should sum up the basic idea.

sub ParsePrint {
  $con = $_[0];
  while ($con =~ /href=\"{0,1}(.*?)\"{0,1}>/g) {
    print $1."<BR>\n";
  }
}

This subroutine takes one argument, which in our case is a page of hypertext, and then uses a regular expression to extract every <href>.

Regular expressions can be really cool and extremely powerful, but the one above is overly simplified and would not work in every situation.  The above subroutine also outputs hyperlinks that we don't want, such as the links to the pages we used to get our list in the first place.  It is left as an exercise to the reader to figure out how to weed out the undesirables.

If the example for Google used a POST instead of a GET, then the only difference would be the syntax difference between the POST and GET methods.  The rest of the program would be functionally the same.

Here is an example of POST:

$response = $browser->post($url, ['q'=>'2600', 'start'=>'10'],);

The URI object is created the same as shown previously.  The left values are names of variables and the right values are the values of the variables.

LWP also supports the use of cookies.

Most of my experience has shown that I don't need any cookies that are kept around for longer than the execution of the program.  Some websites use cookies for everything and you can't get the data you want without them.  If you request a website via your browser and get the site you expect and then do the same with LWP and do not, it is probably some cookie that needs to be set.

All you need to do is tell your LWP browser to use a cookie jar:

use HTTP::Cookies;
$cookie_jar = HTTP::Cookies->new();
$browser->cookie_jar($cookie_jar);

Now a server can set and receive cookies to and from you, and hopefully you won't have any problems with them.

I hope this introduction was helpful.

LWP has a nice set of tools that is good to be familiar with for quick and simple data extraction projects.  It also has a nice set of tools to use for larger, more complex projects.

The examples above illustrate the extreme basics of accessing web pages with LWP.  There is a lot of cool stuff you can do with LWP, and there is a book called Perl & LWP by Sean M. Burke that you can find it in.

There is also, of course, support on the web.  LWP gives you the ability to issue a lot of control over what is sent to the server and at the same time takes care of all the gory details so you don't have to.

Good luck.

Return to $2600 Index