« Ch 13: Futurology | Home | Deleted Scenes »

Archive for the Data Category

AOL accidentally releases data - provided by Google search

Posted on Tue, Aug 08, 2006 at 12:52 PM by Andrew Chadwick

A couple of weeks ago the AOL research department released this dataset:

"500k User Queries Sampled Over 3 Months. This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by ananomized user id... The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research."

It was made available as a free download for non-commercial use only. It was quickly withdrawn but is widely available as a bittorrent download.

The data are reasonably anonymous. I say 'reasonably' because there are no strict personal identifiers in these data, but personal details like social security numbers, phone numbers, addresses and so on do feature in search requests. And there are plenty of those in here. The data are also uncensored.

This is going to send huge ripples through the regulatory debate, not least because AOL's search technology is provided by Google, the globe's number one search engine. These are a very good guide to the kind of search queries that run through Google. And Google has kept very tight wraps on this kind of thing in the past.

As an academic I'm torn: these data would provide a wonderful snapshot of search activity. But is it ethical to use them if the users have not consented? They were designed for an academic audience, but are these 'public' data? They will undoubtedly be publicly available for many years to come. It's also highly likely that both law enforcement and market research companies will be working with them already. Eszter Hargittai, an expert on the sociology of search, points out some of the problems.

|

Some Internet Politics Related Links

Posted on Mon, May 22, 2006 at 12:53 PM by Andrew Chadwick

|

Blogs and News Sites I Read

Posted on Mon, May 22, 2006 at 12:50 PM by Andrew Chadwick

Note: these are news feed links. If you have Firefox I recommend the Feedview extension. Internet Explorer 7 will handle these in much the same way.

|

Data

Posted on Mon, Mar 06, 2006 at 3:30 PM by Andrew Chadwick

Data preview

These data form the basis of some of the analysis of the global digital divide in Chapter 4 of Internet Politics. They are all in the public domain but I thought it might be fun to bring them together here to enable you to generate your own analysis and charts.

I've included a sample chart in the Excel spreadsheet to give you an idea of what you can do.

Download the data spreadsheet (MS Excel format).

|

« Ch 13: Futurology | Top | Deleted Scenes »