bepress Download Totals: Numbers You Can Count On
While download counts matter more and more in online scholarship, increasingly sophisticated robots and other internet processes make downloads harder and harder to count accurately. The Berkeley Electronic Press (bepress) recognizes this problem, and is the first to implement truly rigorous and reliable filters. Thanks to advanced filtering technology and months of computing time, download counts in the bepress system are now the most accurate in the industry.
Download Counts Matter
Download counts are taking center stage as a measure of academic excellence and site usage. They matter to faculty: authors track them and watch them grow, and many use full-text downloads to support their case for tenure. They matter to institutions: deans, chairs, and center directors measure interest in their specific field, and librarians and development officers measure interest in their institution's overall output.
We understand the importance of download counts, and that is why bepress sends a report on full-text downloads to every author and every site administrator, every month.
Download Counts Are Increasingly Hard to Measure
In a world with proliferating metrics of academic quality, download counts are one of the few tangible measures of success. Yet download counts are increasingly hard to measure accurately. Many artificial forces inflate usage statistics: not just double-clicks, but Internet robots, automated-processes, crawlers, and spam-bots (RACS). By the end of 2007, bepress predicts that, without filtering, one out of every two logged downloads from academic sites will be made by machine or mistake.
For years, bepress has filtered out and rejected millions of downloads from robots and crawlers. But in early 2007, after analyzing thousands of log files and hundreds of gigabytes of data, we discovered that this is no longer enough. RACS are increasingly sophisticated, and they artificially inflate article download counts far more than anyone has realized. Put simply: it is no longer possible to filter out all RACS simply by searching for strings or by using other unsophisticated methods. So we completely rewrote our filtering technology.
bepress's Improved Filtering Technology
At bepress, we feel strongly that the same care and rigor should go into download counts, as goes into traditional peer review and other quality measurements. Thus far, the only established industry standard for download statistics comes from COUNTER, a major international initiative whose code of practice has significantly helped libraries and publishers report accurate usage. We implemented all of the explicit requirements for COUNTER, and could have stopped there, but, in line with the spirit of COUNTER’s mission, we went further. Our new filters are not only COUNTER-compliant, but they also filter out thousands of known robots. Even more importantly, they perform sophisticated heuristics to automatically discover unknown, unidentifiable, and disguised robots, automated-processes, crawlers, and spam-bots (RACS), and they apply new algorithms to remove other download abuses and anomalies that would otherwise decrease the accuracy of our numbers.
Typical filtering (if done at all) merely compares log files against a known, and by definition always outdated, list of robots. In the real-world, this can't work: new robots come online every day, old robots change their names, and still others masquerade as legitimate browsers. Our filtering goes far beyond existing methods: it performs daily Bayesian analysis. As a result, we can remove additional RACS that have not yet shown up on lists, as well as those that try to hide as legitimate processes. In the 12 months from August 2006 through August 2007, we estimate that our additional filtering removed 11% more RACS than if we had used even a completely up-to-date robot list. This 11% is up from 8% from the previous year and suggests to us that the sophistication of RACS is growing in a way that lists, no matter how current, are no longer a feasible way to filter out all RACS.
The following graph shows the extent of the inflation and how much needs to be corrected. The red line represents the number of full-text downloads that would be reported without any filtering at all. By implementing filters that are COUNTER-compliant, the number of reported downloads decreases to the green line. However, bepress filters go still further, as represented by the blue line.
The next graph shows that subscription-controlled downloads, while better protected than open access, still stand to be corrected even beyond COUNTER requirements.
Once we realized the extent of the inflation even above bepress's already good filtering practices, we decided to reprocess all papers' download logs for our entire usage history, in order to ensure the integrity of all our numbers. We spent months of computing time in 2007 to calculate new numbers that reflect as accurately as possible the true measure of interest in a paper. When we applied these more rigorous filters to our prior filtering, we discovered an additional inflation of about 25% for bepress open access content, and about 7% for bepress subscription-controlled content. Other publishers and sites may see much larger corrections if they implement similarly rigorous filters, especially if they did not filter at all beforehand.
In October 2007, bepress published the results of our new counts, and alerted our users to expect decreases. We were willing to be the first to draw attention to the problem of inflated download counts, and the first to correct our own numbers. Though no one likes to be the bearer of bad news, it's our job to give users numbers they can trust. In the long run, this is good and important news: we are confident that download counts in the bepress system are now the most accurate in the industry.
A New Standard for Counting All Downloads
Thanks to these new filters, bepress meets and significantly exceeds COUNTER standards for all bepress content, whether open access or subscription-based. This is especially significant because COUNTER has largely been applied to subscription-based content, while open access content has not received the same attention. At bepress, we're committed to the same rigorous standard for all web-based academic content.
Questions for Other Publishers
We hope other publishers will follow suit by implementing equally rigorous new filters. In the meantime, here are some questions you can ask other publishers and site administrators to determine the accuracy of your non-bepress papers' download counts.
- Do you filter downloads at all?
- Are you COUNTER-compliant?
- What percentage of full text download hits do you discard?
- What do you do to identify and filter robots?
- What do you do to identify and filter robots that masquerade as legitimate processes?
Once all academic publishers share the same commitment to rigor and accuracy, download counts can become one trusted and powerful tool to measure academic value and interest.