Data problems - whether they be inaccurate data, incomplete
data, data categorization issues, duplicate data, data in need of
enrichment - are age-old.
IT executives consistently agree that data quality/data
consistency is one of the biggest roadblocks to them getting full
value from their data. Especially in today's information-driven
businesses, this issue is more critical than ever.
Technology, however, has not done much to help us solve the
problem - in fact, technology has resulted in the increasingly fast
creation of mountains of "bad data", while doing very little to
help organizations deal with the problem.
One "technology" holds much promise in helping organizations
mitigate this issue - crowdsourcing. I put the word technology in
quotation marks - as it's really people that solve the problem, but
it's an underlying technology layer that makes it accurate,
scalable, distributed, connectable, elastic and fast. In an article
earlier this week, I referred to it as "Crowd Computing".
Crowd Computing - for Data Problems
The Human "Crowd Computing" model is an ideal approach for newly
entered data that needs to either be validated or enriched in
near-realtime, or for existing data that needs to be cleansed,
validated, de-duplicated and enriched. Typical data issues where
this model is applicable include:
- Verification of correctness
- Data conflict and resolution between different data
sources
- Judgment calls (such as determining relevance, format or
general "moderation")
- "Fuzzy" referential integrity judgment
- Data enrichment or enhancement
- Classification of data based on attributes into categories
- De-duplication of data items
- Image data - correctness, appropriateness, appeal, quality
- Transcription (e.g. hand-written comments, scanned
content)
In areas such as the Data Warehouse, Master Data Management or
Customer Data Management, Marketing databases, catalogs, sales
force automation data, inventory data - this approach is ideal - or
any time that business data needs to be enriched as part of a
business process.
Human Crowd Computing is NOT Outsourcing or "Hiring
Temps"
Human Crowd Computing is completely different than outsourcing
the problem or hiring a large number of temporary workers.
Human Crowd Computing is instantly scalable - up and down.
Outsourcing is the equivalent of renting some other company's data
center. And "hiring temps" is the equivalent of bringing in a
temporary data center. Both approaches take time to "turn on". They
can't scale "up" very well. And they're not elastic. And you pay
for the resource whether you use its full capacity or not.
CrowdFlower - Scalable, Elastic Human
Computing
I'm most familiar with a San Francisco CA-based firm called
CrowdFlower that fits the description of Human Crowd Computing.
It consists of a software platform that includes a workflow
engine, quality monitoring and "contributor" rating that manages
the distribution of work across a community of 2,000,000 "active
contributors" in dozens of countries across the world.
At each step in the workflow, multiple workers' (or
"contributors") judgments are algorithmically aggregated to one
trusted answer based the contributor's individual accuracy.
Individual contributor accuracy ratings are assessed in a
competition-style model. At the random points, data are audited by
"gold standard" workers to ensure accuracy and quality.
In a verification study done with a leading digital media
company to verify, correct and enrich business listings, the
CrowdFlower platform was able to raise accuracy levels of data from
typically 75% to over 99%.
The CrowdFlower implementation of Human Crowd Computing is
highly effective, and proves out the applicability of this model
for a wide variety of data verification, enrichment, cleansing and
remediation projects.
A Leading Online Marketplace and Human Crowd
Computing
A second proof point of this type of technology is an example of
an implementation at a leading online marketplace, which has
hundreds of millions of listings live at any given moment.
This marketplace has an incredible variety of items listed - in
the past, those items have included old gum, entire towns, and even
spouses. The fact that anyone can list almost anything makes this
marketplace the place to go to find rare or outlandish items.
Major Product Categorization Problems
It's no doubt, then, that one of the biggest challenges this
marketplace faces is product categorization. Product categories are
a key way that people search for items.
Depending on the month, this marketplace requires upwards of
100,000 new products to be categorized into something called a
Global Trade Item Number - a unique 12-14 digit number based on
product information which typically must be gathered from multiple
different sources.
Depending on the month, the number of products requiring
categorization ranges from below 5,000 to close to 100,000. A
scalable and elastic computing model is required to support the
variations in workload.
Because judgment calls are involved, and data must be retrieved
and compared from potentially many different sources, the
CrowdFlower platform uses multiple humans for each judgment call to
ensure high levels of accuracy. About 60% of categorizations are
completed with 2 or 3 individual responses; however, particularly
complex judgment calls can require 10 or more responses. I've
confirmed that this algorithm is quite tunable - if your data
needed higher levels of certainty, you would simply involve more
human opinions, enabling you to achieve the goals you require.
Results Delivered
The marketplace formerly outsourced product categorization -
essentially paying for a large staff of contractors which were
alternately overwhelmed and then idle, depending on the day. From
week to week, there could be as much as a 400% difference in
workload.
With a Human Crowd Computing platform, the marketplace increased
its throughput for product categorization by over 300% - from 300
per hour to 1,000 per hour. At the same time, the number of
improper classifications were reduced by over 67%. To cap it off,
CloudFlower claims that this solution reduced the marketplace's
costs by some 70%.
CrowdFlower has published a nice 7-page customer success brief
on one of their larger customers that is worth reading. I can't
link to the report directly, but if you go to their home page and
click on the "get a free report" button, you'll get it via e-mail
within about 30 seconds - after you answer 3 or 4 pesky
questions.
Conclusion
Without question, this model of Human Crowd Computing will
become increasingly mainstream in organizations. It's highly
appropriate for any situation involving large to huge numbers of
small tasks that require human judgment. With the appropriate
software platform, the internet and commonly available
connectivity/interoperability software, this solution may be
exactly what you need for your data problem.
Although I can't personally testify to the reduction in costs
that the marketplace experienced, I have little doubt that there
was a significant reduction in the "cost per categorization"
metric. Furthermore, I am highly confident in the massive
scalability of the elastic computing model they employed, and I am
also highly confident in that models ability to produce quality
results.
Although I've highlighted CrowdFlower as an example, this pair
of articles isn't meant to be about CrowdFlower - it's about a new
model of leveraging large, distributed opt-in communities of
workers who are all connected over the internet and are managed by
sophisticated workflows and accuracy ranking systems.
But as an innovator in this space with many examples of
successful implementations and over 500 current customers using the
platform, CrowdFlower makes for an excellent example of how
organizations can solve their data problems using this new
approach.
Hollis Tibetts