Using Zipf's Law to forecast website usability
// Advertising
I read with interest Alex Barnett's blog entry on "How RSS thickened my long tail" in particular with regards to how we are able to see patterns emerging in website traffic data that maps RSS feed metrics onto the 'long tail' shape in traffic. The long tail shape essentially defines the pattern we see in website traffic - the homepage receives high traffic volume, with a sharp decline in traffic the deeper into the site we get; in the case of Alex he uses this to define the popularity of blog entries - but the patterns are the same.
It's the first time I've come across this theory, and was very happy to see evidence that RSS adds to website value by increasing the visibility of deep content in the case of Blogs.
![]()
Alex Barnett: The Long Thick Tail, thickened by RSS
The chart to the left (click to open) shows the effect of RSS (Green) on his blog page views (Blue), ordered by rank. This demonstrates that RSS is amplifying the effect of the site by driving awareness of older / deeper / archived content via syndication.
I'm doing plenty of research into how we can effectively track, measure and report on the usage and value of RSS to a brand (and therefore read with interest Alex's comments on Web 2.0) - and I don't have my own data to review this and corroborate. RSS is still new for me, so what really got me interested in the comment made by Jakob Neilson where he states an interesting relationship between Alex's findings and some research he'd done on Zipf distributions a couple of years ago:
It's quite likely that your pageviews follow a Zipf distribution with classic long tail usage, since most websites have worked this way since at least 1996 (the first time I analyzed such data).
Basically, if the data shows as a straight line on log-log plots, then you have the expected distribution. If the curve droops on either end, then something else is going on.
So, after spending some time researching the Zipf distribution model (see references below) I decided I had my own theories.
- On a Zipf’s Law Extension to Impact Factors
- The Long Tail and Web 2.0
- Power Laws, Weblogs and Inequality
- http://www.nslij-genetics.org/wli/zipf/index.html
- http://cs-www.bu.edu/faculty/crovella/paper-archive/TR-95-010/paper.html
- Jacob Neilson's Orginal article
I figured that I ought to test a null hypothesis relating to website traffic rather than just looking at Blogs - primariliy because I spend most of my time trying to determine ways to improve site usability through web analytics. So this seemed an interesting way to go, I would test the distribution of page views by rank for some of my clients...
Note - all data is plotted on a log scale and no reference is made to any client
...what if I could test a number of sites, knowing some if the things I know about their web infrastructure, marketing activity, acquisition channels, usability, IA etc.
![]()
Site Z: Good Visitor SatisfactionThis should give me enough of a basis to test out a theory that I might be able to read a pattern into the data, that will allow me to create a usability or content value model for clients.
I tested four different sites (including my own), over reporting periods ranging from 1 month to 1 year, across a number of different Asian markets and with different know IA or usability issues. The results were astonishing to say the least.
Site Z is a site that has very high customer satisfaction levels, high repeat usage and a well constructed IA that allows for easy navigation. The pattern here is as expected - a inversely proportional linear path when plotted on two log scales. I am using visitor satisfaction as a benchmark for quality (though a regression against satisfaction scores for this site does show that predictors of satisfaction include layout and design)
![]()
Site X: Poor usabilityI tested similar sites to ensure that the results were going to be consistent, and not a one off anomaly.
Site X on the other hand is a site that has known usability problems, lower (and declining) customer satisfaction scores. The pattern we observe in this test is very, very different - no longer to we see the linear plot, but a very steep decline. The reason for this is obvious - poor navigation and content hierarchy means that many pages are seen infrequently and so contribute to the very steep angle of decline in the chart.
So, now that I have established a pattern for a site with know poor usability, I need to consider what this means - how can I use this data to build a model that may take into account other factors worthy of consideration - exit rate / time spent on page / conversion value to help refine the model. At this stage all I have done is loosely concluded that there is a pattern - but it may be a one off? it may be inaccurate? I understand this, but I'm going to keep looking to see what I can do with the data - in this example there is a usability and IA review currently happening, so it would be interesting to monitor the pre / post changes and perhaps map them together?
References (3)
-
Response: Interessant idee voor benchmarking// Slicecast - The Asian Marketing Blog // Slicecast - Using Zipf's Law to forecast website usability laat zien hoe het aantal pageviews per pagina in een log-log diagram er voor goede sites er als een rechte lijn uitziet, terwijl... -
Response: Long Tails and Zipfian distributionChris Anderson has posted on the relation between the Long Tail and the Zipfian distribution.It reminded... -
Anidea to use a plot of page view count on the y axis versus“ranked” page view count on the x axis as a metric for usability orcustomer/visitor satisfaction. The thesis is that a web site whichfollows Zipf’s law and whose graph i...




Reader Comments (2)