Archive for the ‘stats’ Category

Valleywag got it all wrong with Alexa comparison

Peter Rip used Alexa traffic charts in one of his blogs to show that web 2.0 has peaked and he got hammered by Valleywag. Unfortunately, Valleywag got it all wrong: it compared Alexa Reach (which represents the number of unique visitors per day) with Techcrunch’s sitemeter pageview number. As a matter of fact, Techcrunch’s pageviews from Alexa and pageviews from its sitemeter stats match quite well.

Here is the pageview chart from Alexa:Alexa for Techcrunch

Here is the pageview chart from Techcrunch’s sitemeter web stats:sitemeter

This example shows that the challenges one faces when comparing numbers from different sources.

Given the recent heated debates on Alexa, I would like to offer a few tips for using Alexa’s stats:

1. Understand sampling biases

Alexa collects data from its free Alexa toolbar which is used as webmaster and developer tools. As a result, its sampling panel is heavily biased towards websites targeted to webmasters and developers. There is also a regional bias as the distribution of Alexa toolbars is not proportional to the number of users in different countries. In general, Alexa is useful only when one compares sites with similar audience.

2. Know what metrics to look for

Alexa tracks “daily unique visitors” (called Alexa Reach) and “daily pageviews”. Note that many stats packages (like sitemeter or Google analytics) use “number of visits”. A visit is defined as a session; a session ends when a user is inactive on a website for 30 mins. The number of visits can be much higher than the number of unique visitors as a user can visit a website many times in a day on some websites (e.g., facebook.com where time-rich people actually spent 500 hours in six months on it).

Note that “Alexa Reach” is expressed as “relative share” (the percentage of all Internet users who visit a given site) rather than absolute number of users. For example, if a site like yahoo.com has a reach of 28%, this means that if you took random samples of one million Internet users, you would on average find that 280,000 of them visit yahoo.com in a day.

One problem with this relative share approach is that the size of the “total pie” is growing over the time. In particular, Alexa’s international base is growing much more rapidly. As a result, US websites’ Alexa numbers tend to increase slower or even show some decline even internal stats indicate the number of visitors are growing.

Here is the user base data share from Alexa. The share of Alexa toolbar users in US has dropped from 37% to 14% in the last three years.

Country 2007 2004
China 16.44% 18.46%
United States 14.28% 36.91%
Brazil 3.82% N/A
Japan 3.64% 3.80%
United Kingdom 3.11% 4.49%
Taiwan 2.91% 1.72%
Hong Kong 2.55% 4.59%

3. None can get stats 100% right

Many people are understandably not happy with Alexa’s numbers. But third-party traffic stats are inherently inaccurate, particularly for smaller sites, where the number of web users in any sample is too small to have a good margin of error. Even for large sites like Youtube, Hitwise and ComScore have very different results.

Any sampling based panel has its biases too. For example, the way ComScore gets their user panel does not give me much confidence either although companies pay hundred of thousand a year for the services.

Advertisements

Measuring Internet traffic: where are the biases?

There have been quite a few discussions on traffic measurement lately. The general consensus is that all of them have some sort of problems. It would be an interesting exercise to see where are the biases and how we may be able to compensate for them.

ComScore and Hitwise are two leading paid services. They use two different approaches: ComScore is “Panel based” and Hitwise is “ISP based”.

1. ComScore

ComScore has over 2 million users who have installed ComSore’s data collection software on their computers (although their US panel sample is 120K in the US and global panel is 500K outside the US). Their users are randomly selected. ComScore recruits them over the web by offering virus protection scanning, web acceleration or sweepstakes prizes under a number of channels (e.g., PermissionResearch, OpinionSquare and Marketscore).

ComScore’s demographic tends to be skewed toward naive Internet users as more sophisticated users are less likely to install ComScore’s toolbar. Serious security issues have been raised with their software. If you are interested in the details of how ComScore collects user data and the security implications, I would recommend you to read the articles by Stanford, Cornell and Forbes.

2. Hitwise

Hitwise gets its user data from ISPs that it has partnered with. According to Hitwise it have over 10 million US and 25 million worldwide users.

While Hitwise has a much larger and diverse pool of sample users, its data partners are mostly small ISPs and has much more dial-up users in the data set.

In general, Hitiwse’s data tends to be more skewed towards home use and underestimates broadband or work use.

3. Alexa

Alexa offers a free traffic data service and is a subsidiary of Amazon.com. Alexa collects information from over 20 million users who have installed the “Alexa Toolbar”. The Alexa toolbar is available on Internet Explorer and an extension (Status Bar) can be used for Firefox.

Alexa toolbar is offered as a webmaster tool and its user panel is biased towards techies/geeks and webmasters in particular. Alexa’s number can not be used to compare two sites with very different demographic.

Another problem with Alexa is that its numbers are relative shares (percentage of the total population). Because Alexa’s international base is growing much more rapidly, US websites’ Alexa numbers tend to increase slower than their internal stats or even show some decline.

4. Compete and Quantcast

Compete and Quantcast are two smaller free services. Compete.com tries to combine toolbar panel and ISP data whereas Quantcast requires websites to install a tracking pixel. For some websites, they offer good numbers whiles for others, their numbers can be way off. It is still unproven that their approaches offer better results. You can read detailed discussions from Venturebeat, Matt Cutts and Traffick.