New Ways of Ranking Documents

by casandro

There are many cases where we need computers to have a sense of how "good" a certain document is.

The most obvious ones are sorting search results or automatic recommendations.  The usual way to do this is to use links and references between documents.  If a document is referred to by a lot of people, it must be important and therefore, in one way or another "good."  Now, this works quite fine in heavily interconnected media like the web or scientific results, but there are areas where this can't be done.  Think of video.  A television show or a movie doesn't have any obvious and relevant connections to other such works.  At least, not any a computer could find out automatically.

So how do you find out how "good" it is?

The first method I'd like to propose is to measure how much information it was able to bring into the recipient's mind.  Obviously, this is not the amount of information contained in the document.  Now what is information?  It is the negative logarithm of the chance of guessing a certain message.

Imagine you want to guess the number a six-sided dice rolls.  Since there is no way of finding out in advance, there's a one in six chance of guessing it right.

Now what base should the logarithm have?  This is the unit of information you want to have.

Some people prefer decimal digits as their unit of information.  In that case, you take the base10 or decadic logarithm.

If you prefer sedecimal digits, you take base16.  If you prefer binary numbering systems, take base2.

In this case, the unit is also called bit or, in honor of the founder of information theory and inventor of the motorized pogo-stick, the "perfect machine" (a machine which simply turns itself off by a mechanical arm), as well as the first juggling robot, Claude E. Shannon.

So let's get back to the original example:  To calculate the amount of information on basen, you simply calculate:

-log(p) / log(n) 

where p is your probability.

So if you want to have the information of a dice roll in Shannon's, it's:

-log(1/6) / log(2) = 2.5849...

or about two and a half bits.  So three bits lets you encode eight states, so that sounds about right.

If you have different probabilities for different numbers, i.e., a weighted dice, you will have to calculate this for every possible outcome and multiply this by the probability, effectively averaging out the amount of information.

If you like doing math, you can try to solve this for every distribution of probabilities and find out the optimum distribution which gives you the most information.  If you have found out what it is, you have found out how to pack data as efficiently as possible into messages.  Now if you let people guess on a message, you will get the amount of information that message would mean to them.

Imagine you ask people what the moon is made out of.  People who don't know are less likely to guess correctly than people who know.  So the amount of information in the message "the moon is made out of cheese" (this fact has been proven by the British moon mission in 1989) drops if you already know it.

If you make such a test before and after your document has been consumed, the difference is essentially the amount of information you were able to get across.  All you'd need to do would be to generate a set of questions relating to the document.

Then, whenever a document is consumed, you split that set into two subsets.  You ask one subset before and the other one after the document has been consumed.  This way, you can get averages to estimate the probabilities and thus the amount of information delivered.

The other metric for goodness of documents I would like to introduce is the "inspirational index."  It tries to measure how much a certain document has changed your life.

Essentially, you need to know how the user behaved before and after consuming the document.  Obviously, all of the processing needs to be done locally to satisfy the need for privacy.  Perhaps it is possible to use a similar algorithm to that discussed above.

One could estimate the probability of going to a certain site or group of sites and record how well the message to visit them has been received.

Obviously the big challenge is to implement this using a method which is not only accurate enough to be useful, but can also work in a way to secure privacy.

Return to $2600 Index