Embrace Messy Data (To Reach Internet Scale)

I believe I have discovered an important reason why the government has a hard time making good use of the Internet: people who design systems for the government abhor messy data.  They are used to setups in which every record is complete because the people populating the records are forced by law or regulation to fill them out entirely.

As we are learning on the web though, the best systems arise from rather messy data.  Let’s say you are a marketplace site.  Should you have a listing process that extracts a ton of data but in order to do so has many pages of forms or make it easy for people to list items? Clearly the latter approach has won out every time.  You get way more listings that way and then you can be smart after the initial listing process on how to get additional data about the item.

Once you have a listing you have many options for how and where to get more data. 

  • Do some simple text analysis on the raw listing text.

  • Prompt the seller to provide more complete info later to get better exposure

  • Possibly: ask the seller to pay to provide additional info in the form of keyword advertising (and generate revenues from that!)

  • Collect information via tags from people browsing listings (enroll the buyers!)

  • Correlate with similar items from this and other sellers And probably a bunch more.

LinkedIn didn’t get to 100 million profiles by asking everyone to enter their entire CV upon sign up!

What all of these have in common though is that they are messy.  They don’t create neat and clean and deterministic records but rather probabilities and confidence intervals.  Imagine for a second if Twitter required users to add hashtags such as #EWR to identify what their tweets are about.  The quantity of tweets would decline dramatically. But some people doing so on their own adds a tremendous amount of signal while not reducing total activity.

It would be great to see the government embrace messy data as the way of the Internet instead of fighting it.  Come to think of it - this is not just for the government - it should be a fundamental premise for anyone wanting to reach Internet scale.

Enhanced by Zemanta
Loading...
highlight
Collect this post to permanently own it.
Continuations logo
Subscribe to Continuations and never miss a post.
#data#scale#howto