PageRank relies on the uniquely democratic nature of the web by
using its vast link structure as an indicator of an individual
pages value. In essence, Google interprets a link from page A to
page B as a vote, by page A, for page B. But, Google looks at more
than the sheer volume of votes, or links a page receives; it also
analyzes the page that casts the vote. Votes cast by pages that are
themselves "important" weigh more heavily and help to make other
pages "important."
Google goes on to admit that other variables are also used, in
addition to PageRank, in determining the relevance of a page. While
the broad outlines of these additional variables are easily
discerned by webmasters who study how to improve the ranking of
their websites, the actual details of all algorithms are considered
trade secrets by Google, Inc. It is in Googles interest to make it
as difficult as possible for webmasters to cheat on their rankings.
[...]
What objective criteria are available?
Ranking criteria fall into three broad categories. The first is
link popularity, which is used by a number of search engines to
some extent. Googles PageRank is the original form of "link pop,"
and remains its purest expression. The next category is on-page
characteristics. These include font size, title, headings, anchor
text, word frequency, word proximity, file name, directory name,
and domain name. The last is content analysis. This generally takes
the form of on-the-fly clustering of produced results into two or
more categories, which allows the searcher to "drill down" into the
data in a more specific manner. Each method has its place. Search
engines use some combination of the first two, or they use on-page
characteristics alone, or perhaps even all three methods.
Content analysis is very difficult, but also very enticing. When it
works, it allows for the sort of graphical visualization of results
that can give a search engine an overnight reputation for
innovation and excellence. But many times it does not work well,
because computers are not very good at natural language processing.
They cannot understand the nuances within a large stack of prose
from disparate sources. Also, most top engines work with dozens of
languages, which makes content analysis more difficult, since each
language has its own nuances. There are several search engines that
have made interesting advances in content analysis and even
visualization, but Google is not one of them. The most promising
aspect of content analysis is that it can be used in conjunction
with link pop, to rank sites within their own areas of
specialization. This provides an extra dimension that addresses
some of the problems of pure link popularity.
Link popularity, which is "PageRank" to Google, is by far the most
significant portion of Googles ranking cocktail. While in some
cases the on-page characteristics of one page can trump the
superior PageRank of a competing page, it is much more common for a
low PageRank to completely bury a page that has perfect on-page
relevance by every conceivable measure. To put it another way, it is
frequently the case that a page with both search terms in the
title, and in a heading, and in numerous internal anchors, will get
buried in the rankings because the sponsoring site is not
sufficiently popular, and is unable to pass sufficient PageRank to
this otherwise perfectly relevant page. In December 2000, Google
came out with a downloadable toolbar attachment that made it
possible to see the relative PageRank of any page on the web. Even
the dumbed-down resolution of this toolbar, in conjunction with
studying the ranking of a page against its competition, allows for
considerable insight into the role of PageRank.
Moreover, PageRank drives Googles monthly crawl, such that sites
with higher PageRank get crawled earlier, faster, and deeper than
sites with low PageRank. For a large site with an average-to-low
PageRank, this is a major obstacle. If your pages do not get
crawled, they will not get indexed. If they do not get indexed in
Google, people will not know about them. If people do not know about
them, then there is no point in maintaining a website. Google starts
over again on every site for every 28-day cycle, so the missing
pages stand an excellent chance of getting missed on the next cycle
also. In short, PageRank is the soul and essence of Google, on both
the all-important crawl and the all-important rankings. By 2002
Google was universally recognized as the worlds most popular
search engine.
[....]
Daniel Brandt is founder and president of Public Information
Research, Inc., a tax-exempt public charity that sponsors NameBase.
He began compiling NameBase in 1982, from material that he started
collecting in 1974, and is now the programmer and webmaster for
PIRs several sites. He participates in various forums where
webmasters share observations about the often-secretive algorithms,
bugs, and behavior of various search engines. Brandt has been
watching Googles interaction with NameBase ever since Google, in
October, 2000, became the first search engine to go "deep" on PIRs
main site by crawling thousands of dynamic pages.
|