Big Data.
Along with medical applications and geography-based social
networks, "Big Data" is the hottest topic in tech these days.
Everyone seems to be talking about it - though if you listen
closely you'll realize, as is often the case with new technologies,
that nobody really knows what it is.
In fact, Big Data has been around for decades - but until now it
has mostly been in the hands of a small elite: mostly academic and
government researchers manning supercomputers in secure facilities.
But, as usual, Moore's Law has done its work. . . and now billions
of servers, personal computers, tablets and smartphones around the
world are each generating gigabytes of raw data almost on a daily
basis. And with the presence of the global broadband Internet,
those mountains of data are now merging into vast data stores of
literally unimaginable size.
It is this huge collection of raw information that most people
think of when they mouth the words "Big Data." But that is
incorrect; that is merely A Whole Lot of Data. It only becomes Big
Data when that raw material has been processed in some way as to
glean out valuable meta-data that is invisible in everyday life,
but which emerges when you sort through the behaviour of millions
or billions of data sources.
That brings to me to my definition of Big Data: Any body of
information that is so big it cannot be analyzed directly for
profitable use in its raw form.
Let's deconstruct this definition a bit. Note that we begin with
a 'body of information'. I've phrased it this way to recognize that
there have always been these large information caches. William the
Conqueror's Domesday Book of the 11th century, the first great
census of the Medieval world, was for its time a vast body of
information. So was the Oxford English Dictionary eight centuries
later. These giant repositories of knowledge were all-but
impenetrable to their owners because of the lack of automatic
search tools. At best, they could be accessed only in piecemeal,
the conclusions drawn rarely more than anecdotal.
The modern census, a phenomenon of the 19th century, was at
least initially another largely impenetrable body of information.
Thanks to early adding and computational machines, it was possible
to extract a few pieces of meta-data from these pools - such as
total population. But the amount of data captured from each data
point necessarily had to be small (i.e. a half-dozen questions
about family size, age, etc.) or the data pool quickly grew out of
control once again. Indeed, the censuses of the late 19th century
grew so unmanageable that the U.S. Census bureau nearly used up one
decade (the 1880s) - thus almost starting the next census - before
they completed processing the previous one. It was the desperate
need to solve this problem that led Herman Hollerith to develop his
Tabulator and give birth to the modern computer.
Even today's cheapest laptop or tablet is millions of times more
powerful than the Hollerith Tabulator, but the challenge of Big
Data has never really gone away. That's because while each
generation of processor grows ever more powerful at crunching
volumes of data, it also becomes even more efficient at creating
that data. And that's just the start, because most of those
billions of processors are now connected via the World Wide Web,
and that interconnection multiples the amount of data being
created. The good news is that this interconnection also allows to
us to multiply our processing power - and use that expanded power,
with the right application tools, to crunch all of that expanded
data. . . and, with luck, uncover trends and truths heretofore lost
in the noise.
It is these tools that are what we really mean these days by
"Big Data". And it is their creation and early implementation that
is the subject of all of those magazine articles, blog entries and
conference reports that you have been reading. These Big Data folks
believe two things:
1. That the right tools can be found
to crunch all of this raw data in a fast and efficient way - a
challenge that will grow even greater by the year as billions more
processors get imbedded into the natural world and start emitting
data; and
2. That there really are valuable nuggets of metadata hidden out
there in those mountains of data, that they can be found, and their
value will be greater than the cost of their discovery.
Hence my use in the definition of the terms "analyze" and
"profitable". Whatever the breathless coverage by the media (and by
entrepreneurs in the industry) about the importance of Big Data,
those two core beliefs I just listed have not been proven viable.
At present, we have no idea what tools we will ultimately need to
drill down through all of that data out there, how far down we will
have to drill to find anything useful, and once we do find
something useful, whether it can produced in a form that is useful
to corporate customers, governments, healthcare professionals and
everyday consumers - and finally, even if it useful, whether it
will show sufficient return on investment to justify continuing the
pursuit.
This is not going to happen overnight, whatever you may have
read. A year from now, we may begin seeing the first public results
of this research. When you will see Big Data a part of our daily
lives the way we see past tech revolutions like the Web, social
networks, GPS and smartphone apps? Maybe five years.
And we may need every day of that half-decade. There is a
technology law, first proposed years ago by my friend, technology
journalist and author Mike Malone. It's not as famous as Moore's or
Metcalfe's Laws, but I think it might be particularly useful here.
It says that All technology revolutions arrive more slowly than we
predict, but arrive quicker than we are prepared for them.
My gut tells me that Big Data is real, that we will find the
right tools to explore it (and that they will be available to
almost everyone via the Cloud), that we will find useful results
almost from the beginning - and that will grow even richer the
deeper we drill. And that a whole second generation of tools will
convert those results into content that will form the basis for a
whole new boom of entrepreneurial start-ups. There is too much
historic precedent - such as the work of epidemiologists in the
eighteen century looking at illness tables - to argue
otherwise.
But my hunch is also that all of this will take at least five
years, and maybe more, to pull off. And that we will need every bit
of that half-decade, and probably a whole lot more, to deal with
the larger cultural implications of Big Data. Eugenics, racial
theory and a whole host of other misguided, and often murderous,
nonsense emerged from the misuse of the cruder forms of Big Data
two centuries ago. There are a whole bunch of potential problems
waiting in the wings with this newest wave of Big Data - the
biggest one being privacy. Indeed, in my cynical moments, I believe
that the only thing standing between us and our complete loss of
personal privacy is the fact that we are still too dumb to devise
the right Big Data tools. That won't last long - indeed, you'd be
amazed, and shocked, at what is being tracked by Big Data already.
It is time to start worrying about these matters right now.
In my next two blog entries, I plan to look even deeper into
what needs to be done to help Big Data achieve its destiny; and
then, into what needs to be done restrain Big Data once it gets
there.