AOL Goofup leades to Google Highest Keywords Leak?!


In what must be a stunning development across the web, AOL in its foolhardiness (depends on how you look at it) has released a research paper showing data of “top searched queries” on AOL by 650000 users resulting in a data set of nearly 20 million search terms.

According to AOL:

500k User Session Collection
———————————————-
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

Why is this important? Because AOL uses Google as their search engine. Which means anyone laying their hands on this data potentially has nearly 20 million queries to work with to come up with list of top keywords people are searching for!

Imagine the possibilities with this - anyone can manipulate this data with the search terms and make websites for those queries. Its obvious that Google is now going to be majorly pissed at AOL for leaking this data. But whats more concerning is the way AOL has released this keyword list, with not enough concerns about any privacy. (Links below on this).

I will be keeping a tab on this story as it develops and people come up with more and more data analysis of this data, and general views on this subject.

More resources on this huge development:

Forums.Digitalpoint.com is carrying a thread on this topic, and naturally everyone is stunned on being able to access the goldmine.

Plentyoffish.wordpress.com  has already begun to analyze the search data and is posting the results of his analysis. Some of the queries are downright disturbing. Specific posts include “Aol data shows users planning to commit murder”, “AOL data showing Myspace growing SEO spam” “Myspace killing dating sites

Head over to http://www.gregsadetsky.com/aol-data/ to find links to download this huge 500 Mb behemoth file.

Techcrunch is on top of the story as well, and echoes my sentiments as to why this is a huge concern as well:

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

Adios for now. More updates later.
Update: 8.7.2005

Reuters now reports that AOL is facing backlash from people. Why do I feel this will turn out to be a public embrassment for AOL - are lawsuits next?

Meanwhile, an AOL spokesperson has posted an apology on http://plentyoffish.wordpress.com

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Yeah - a good screwup, but a boon for MFA’s I would believe. I will post a analysis soon on this.

-->
  1. Ryan 8.7.06 / 6pm

    I saw this and I still can’t believe that they would just release it to everyone in the world. I would be pretty damn pissed if one of my queries were ever released for anyone to see, I can only imagine what all these AOL users are feeling right now.

  2. apexad 8.9.06 / 5am

    AOL is bad mmkay…? people really need to get this through their heads. (that’s a period)

    The only good thing they do is AIM, and in truth AIM, by default is not secure at all. It’s only good because it’s popular.

    Sadly though, bad publicity can always be spinned, always be forgotten. We are all tech nerds/geeks. Do you think that average joe schmo internet user has read this article? My guess is no.

  3. Vivek 8.10.06 / 7pm

    Privacy is a overabused bitch, at least in this case.

    I don’t understand why AOL is being nailed over releasing the information which cannot be traced back to to it’s originators. [ If it was, I say, that would’ve been a serious violation ].

    It was not like AOL released the names and addresses of people, which is exactly what most utilitiy services like Telephone, electricity, gas, cable etc., in reality, do! Half of the junk mail I get is courtesy my electricty service ( I figured by mis-spelled name as in my bills). Even hospitals and friendly neighborhood doctors do the same, i.e. selling customer contact information for a price.

    In fact when you enter your phone no to validate card transactions at some store they use it to figure out where you’re based (in which city), so that they figure out if it would be economical to distribute flyers in that particular city or not. May be later on, these numbers get sold to telemarketeers!

    On a personal level, I don’t know what conclusions can be drawn from a collection of 20 million queries, which seriously lacks context. For eg. a query on Murder can mean different things. A search for a movie? pics of an actress? or an actor? theses study by a student of criminology? pre-crime? It’s difficult to define.

    Most likely we’ll re-discover what we’ve known for ages. That is, most searches are for porno sites, movie icons, sports stars, and that most people search internet just to kill boredom.

    I don’t know if Google would be affected in business, as this info has been available for free on Yahoo! for ages (most searched movies, actors etc.) Of course if they were hoping to sell the info, it could mean a loss to them.

    Of course people who search google with their SSNs and credit cards are absolute idiots who live in caves. AOL could’ve filtered them off easily. But I suppose they didn’t believe people could be so idiotic.

  4. Cornflakes 8.22.06 / 9am

    Here’s a *quick* site where you can search the AOL data for yourself:

    http://www.frogspy.com

  5. Torrie Wilson in a thong 12.9.06 / 1pm

    Torrie Wilson in a thong

    http://torriewilson.cn.to

Have your say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>




Safari hates me

AOL Goofup leades to Google Highest Keywords Leak?!


In what must be a stunning development across the web, AOL in its foolhardiness (depends on how you look at it) has released a research paper showing data of “top searched queries” on AOL by 650000 users resulting in a data set of nearly 20 million search terms.

According to AOL:

500k User Session Collection
———————————————-
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

Why is this important? Because AOL uses Google as their search engine. Which means anyone laying their hands on this data potentially has nearly 20 million queries to work with to come up with list of top keywords people are searching for!

Imagine the possibilities with this - anyone can manipulate this data with the search terms and make websites for those queries. Its obvious that Google is now going to be majorly pissed at AOL for leaking this data. But whats more concerning is the way AOL has released this keyword list, with not enough concerns about any privacy. (Links below on this).

I will be keeping a tab on this story as it develops and people come up with more and more data analysis of this data, and general views on this subject.

More resources on this huge development:

Forums.Digitalpoint.com is carrying a thread on this topic, and naturally everyone is stunned on being able to access the goldmine.

Plentyoffish.wordpress.com  has already begun to analyze the search data and is posting the results of his analysis. Some of the queries are downright disturbing. Specific posts include “Aol data shows users planning to commit murder”, “AOL data showing Myspace growing SEO spam” “Myspace killing dating sites

Head over to http://www.gregsadetsky.com/aol-data/ to find links to download this huge 500 Mb behemoth file.

Techcrunch is on top of the story as well, and echoes my sentiments as to why this is a huge concern as well:

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

Adios for now. More updates later.
Update: 8.7.2005

Reuters now reports that AOL is facing backlash from people. Why do I feel this will turn out to be a public embrassment for AOL - are lawsuits next?

Meanwhile, an AOL spokesperson has posted an apology on http://plentyoffish.wordpress.com

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Yeah - a good screwup, but a boon for MFA’s I would believe. I will post a analysis soon on this.

-->
  1. Ryan 8.7.06 / 6pm

    I saw this and I still can’t believe that they would just release it to everyone in the world. I would be pretty damn pissed if one of my queries were ever released for anyone to see, I can only imagine what all these AOL users are feeling right now.

  2. apexad 8.9.06 / 5am

    AOL is bad mmkay…? people really need to get this through their heads. (that’s a period)

    The only good thing they do is AIM, and in truth AIM, by default is not secure at all. It’s only good because it’s popular.

    Sadly though, bad publicity can always be spinned, always be forgotten. We are all tech nerds/geeks. Do you think that average joe schmo internet user has read this article? My guess is no.

  3. Vivek 8.10.06 / 7pm

    Privacy is a overabused bitch, at least in this case.

    I don’t understand why AOL is being nailed over releasing the information which cannot be traced back to to it’s originators. [ If it was, I say, that would’ve been a serious violation ].

    It was not like AOL released the names and addresses of people, which is exactly what most utilitiy services like Telephone, electricity, gas, cable etc., in reality, do! Half of the junk mail I get is courtesy my electricty service ( I figured by mis-spelled name as in my bills). Even hospitals and friendly neighborhood doctors do the same, i.e. selling customer contact information for a price.

    In fact when you enter your phone no to validate card transactions at some store they use it to figure out where you’re based (in which city), so that they figure out if it would be economical to distribute flyers in that particular city or not. May be later on, these numbers get sold to telemarketeers!

    On a personal level, I don’t know what conclusions can be drawn from a collection of 20 million queries, which seriously lacks context. For eg. a query on Murder can mean different things. A search for a movie? pics of an actress? or an actor? theses study by a student of criminology? pre-crime? It’s difficult to define.

    Most likely we’ll re-discover what we’ve known for ages. That is, most searches are for porno sites, movie icons, sports stars, and that most people search internet just to kill boredom.

    I don’t know if Google would be affected in business, as this info has been available for free on Yahoo! for ages (most searched movies, actors etc.) Of course if they were hoping to sell the info, it could mean a loss to them.

    Of course people who search google with their SSNs and credit cards are absolute idiots who live in caves. AOL could’ve filtered them off easily. But I suppose they didn’t believe people could be so idiotic.

  4. Cornflakes 8.22.06 / 9am

    Here’s a *quick* site where you can search the AOL data for yourself:

    http://www.frogspy.com

  5. Torrie Wilson in a thong 12.9.06 / 1pm

    Torrie Wilson in a thong

    http://torriewilson.cn.to

Have your say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>




Safari hates me
ADS
Conversation
 
Trackback
Trackback URI