« 100Mbps Broadband in Korea | Main | "Hey, Koike!" - Japanese Police go online »

Using Zipf's Law to forecast website usability


// Advertising

I read with interest Alex Barnett's blog entry on "How RSS thickened my long tail" in particular with regards to how we are able to see patterns emerging in website traffic data that maps RSS feed metrics onto the 'long tail' shape in traffic. The long tail shape essentially defines the pattern we see in website traffic - the homepage receives high traffic volume, with a sharp decline in traffic the deeper into the site we get; in the case of Alex he uses this to define the popularity of blog entries - but the patterns are the same.

It's the first time I've come across this theory, and was very happy to see evidence that RSS adds to website value by increasing the visibility of deep content in the case of Blogs.

203227-174079-thumbnail.jpg
Alex Barnett: The Long Thick Tail, thickened by RSS

The chart to the left (click to open) shows the effect of RSS (Green) on his blog page views (Blue), ordered by rank. This demonstrates that RSS is amplifying the effect of the site by driving awareness of older / deeper / archived content via syndication.

I'm doing plenty of research into how we can effectively track, measure and report on the usage and value of RSS to a brand (and therefore read with interest Alex's comments on Web 2.0) - and I don't have my own data to review this and corroborate. RSS is still new for me, so what really got me interested in the comment made by Jakob Neilson where he states an interesting relationship between Alex's findings and some research he'd done on Zipf distributions a couple of years ago:

It's quite likely that your pageviews follow a Zipf distribution with classic long tail usage, since most websites have worked this way since at least 1996 (the first time I analyzed such data). 

Basically, if the data shows as a straight line on log-log plots, then you have the expected distribution. If the curve droops on either end, then something else is going on. 

 So, after spending some time researching the Zipf distribution model (see references below) I decided I had my own theories.

I figured that I ought to test a null hypothesis relating to website traffic rather than just looking at Blogs - primariliy because I spend most of my time trying to determine ways to improve site usability through web analytics. So this seemed an interesting way to go, I would test the distribution of page views by rank for some of my clients...

Note - all data is plotted on a log scale and no reference is made to any client

...what if I could test a number of sites, knowing some if the things I know about their web infrastructure, marketing activity, acquisition channels, usability, IA etc.

203227-174100-thumbnail.jpg
Site Z: Good Visitor Satisfaction
This should give me enough of a basis to test out a theory that I might be able to read a pattern into the data, that will allow me to create a usability or content value model for clients.

I tested four different sites (including my own), over reporting periods ranging from 1 month to 1 year, across a number of different Asian markets and with different know IA or usability issues. The results were astonishing to say the least.

Site Z is a site that has very high customer satisfaction levels, high repeat usage and a well constructed IA that allows for easy navigation. The pattern here is as expected - a inversely proportional linear path when plotted on two log scales. I am using visitor satisfaction as a benchmark for quality (though a regression against satisfaction scores for this site does show that predictors of satisfaction include layout and design)

203227-174105-thumbnail.jpg
Site X: Poor usability
I tested similar sites to ensure that the results were going to be consistent, and not a one off anomaly.

Site X on the other hand is a site that has known usability problems, lower (and declining) customer satisfaction scores. The pattern we observe in this test is very, very different - no longer to we see the linear plot, but a very steep decline. The reason for this is obvious - poor navigation and content hierarchy means that many pages are seen infrequently and so contribute to the very steep angle of decline in the chart.

So, now that I have established a pattern for a site with know poor usability, I need to consider what this means - how can I use this data to build a model that may take into account other factors worthy of consideration - exit rate / time spent on page / conversion value to help refine the model. At this stage all I have done is loosely concluded that there is a pattern - but it may be a one off? it may be inaccurate? I understand this, but I'm going to keep looking to see what I can do with the data - in this example there is a usability and IA review currently happening, so it would be interesting to monitor the pre / post changes and perhaps map them together?

Posted on Wednesday, September 21, 2005 at 03:12AM by Registered CommenterJames Dutton in | Comments2 Comments | References3 References

PrintView Printer Friendly Version

EmailEmail Article to Friend

References (3)

References allow you to track sources for this article, as well as articles that were written in response to this article.
  • Response
    // Slicecast - The Asian Marketing Blog // Slicecast - Using Zipf's Law to forecast website usability laat zien hoe het aantal pageviews per pagina in een log-log diagram er voor goede sites er als een rechte lijn uitziet, terwijl...
  • Response
    Chris Anderson has posted on the relation between the Long Tail and the Zipfian distribution.It reminded...
  • Response
    Anidea to use a plot of page view count on the y axis versus“ranked” page view count on the x axis as a metric for usability orcustomer/visitor satisfaction. The thesis is that a web site whichfollows Zipf’s law and whose graph i...

Reader Comments (2)

Hi James - for this study (or anything further u have done on this) have you considered what effect of external media has on the same? Meaning if instead of taking customer satisfaction index on the x axis , if u try taking the media exposure (or rank) a particular page gets on the media buy (meaning one or 2 pages maybe the most commonly linked page from external banners/ paid ads etc whenever the client advertises this site) - do these page still get a disproportionaly high traffic ? does the inverse double log relationship still hold?
July 9, 2006 | Unregistered Commenterdatta
I have trouble understanding your charts - what is "rank"? And do your charts really say something new, or is the conclusion simply "pages with bad usability have a bad rank"?
June 11, 2007 | Unregistered CommenterTichy

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.