- check out the stylin' NEW Collusion haxor gear at Jinx Hackwear!!! -
- sign up on the Collusion Syndicate's infotainment discussion lists!!! -

Volume 40
Mar 2003


 HOME

 TechKnow
 Media Hack
 Parallax
 Reviews
 Fiction
 Humor
 Events
 Offsite

 Mission
 Responses
 Discussion
 #Collusion
 NEW!

 Submit a Story
 Collusioneers
 © & TM Info
 Contact Us


SETI@Home

Join the
Collusion
SETI Team!




PageRank -- Google
 by Google Watch

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual pages value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Google goes on to admit that other variables are also used, in addition to PageRank, in determining the relevance of a page. While the broad outlines of these additional variables are easily discerned by webmasters who study how to improve the ranking of their websites, the actual details of all algorithms are considered trade secrets by Google, Inc. It is in Googles interest to make it as difficult as possible for webmasters to cheat on their rankings.

[...]

What objective criteria are available?

Ranking criteria fall into three broad categories. The first is link popularity, which is used by a number of search engines to some extent. Googles PageRank is the original form of "link pop," and remains its purest expression. The next category is on-page characteristics. These include font size, title, headings, anchor text, word frequency, word proximity, file name, directory name, and domain name. The last is content analysis. This generally takes the form of on-the-fly clustering of produced results into two or more categories, which allows the searcher to "drill down" into the data in a more specific manner. Each method has its place. Search engines use some combination of the first two, or they use on-page characteristics alone, or perhaps even all three methods.

Content analysis is very difficult, but also very enticing. When it works, it allows for the sort of graphical visualization of results that can give a search engine an overnight reputation for innovation and excellence. But many times it does not work well, because computers are not very good at natural language processing. They cannot understand the nuances within a large stack of prose from disparate sources. Also, most top engines work with dozens of languages, which makes content analysis more difficult, since each language has its own nuances. There are several search engines that have made interesting advances in content analysis and even visualization, but Google is not one of them. The most promising aspect of content analysis is that it can be used in conjunction with link pop, to rank sites within their own areas of specialization. This provides an extra dimension that addresses some of the problems of pure link popularity. Link popularity, which is "PageRank" to Google, is by far the most significant portion of Googles ranking cocktail. While in some cases the on-page characteristics of one page can trump the superior PageRank of a competing page, it is much more common for a low PageRank to completely bury a page that has perfect on-page relevance by every conceivable measure. To put it another way, it is frequently the case that a page with both search terms in the title, and in a heading, and in numerous internal anchors, will get buried in the rankings because the sponsoring site is not sufficiently popular, and is unable to pass sufficient PageRank to this otherwise perfectly relevant page. In December 2000, Google came out with a downloadable toolbar attachment that made it possible to see the relative PageRank of any page on the web. Even the dumbed-down resolution of this toolbar, in conjunction with studying the ranking of a page against its competition, allows for considerable insight into the role of PageRank. Moreover, PageRank drives Googles monthly crawl, such that sites with higher PageRank get crawled earlier, faster, and deeper than sites with low PageRank. For a large site with an average-to-low PageRank, this is a major obstacle. If your pages do not get crawled, they will not get indexed. If they do not get indexed in Google, people will not know about them. If people do not know about them, then there is no point in maintaining a website. Google starts over again on every site for every 28-day cycle, so the missing pages stand an excellent chance of getting missed on the next cycle also. In short, PageRank is the soul and essence of Google, on both the all-important crawl and the all-important rankings. By 2002 Google was universally recognized as the worlds most popular search engine.

[....]

Daniel Brandt is founder and president of Public Information Research, Inc., a tax-exempt public charity that sponsors NameBase. He began compiling NameBase in 1982, from material that he started collecting in 1974, and is now the programmer and webmaster for PIRs several sites. He participates in various forums where webmasters share observations about the often-secretive algorithms, bugs, and behavior of various search engines. Brandt has been watching Googles interaction with NameBase ever since Google, in October, 2000, became the first search engine to go "deep" on PIRs main site by crawling thousands of dynamic pages.