It's interesting to note that the performance of today's
comparatively larger storage arrays isn't intrinsically much better
than I was getting in 1980 -- maybe twice as fast for data
retrieval. There's just a lot more data being stored, and the cost
per stored bit is way down. Some of the same operational challenges
are still around, too.
When I first started running programs that dealt with big
data-meaning both a lot of data about something or somebody and a
lot of things or people to have some data about-big was actually
pretty small.
I once built a system for a modern 300-bed hospital that ran
everything (including patient records for half a million people) on
less than 10GB (yes, you read that right) of high-performance disc
storage.
It's interesting to note that the performance of today's
comparatively larger storage arrays isn't intrinsically much better
than I was getting in 1980-maybe twice as fast for data retrieval.
There's just a lot more data being stored, and the cost per stored
bit is way down. Some of the same operational challenges are still
around, too.
- First, data quality remains an issue. The more data you
accumulate, the harder it is to keep everything consistent and
correct. We have invented whole new areas of focus (master data
management) and tools to deal with the garbage in/garbage out
problem, but it's not getting any easier. With really large data
sets accumulated over time (which means that things change-what was
once correct isn't any more, and vice versa), you have to solve for
garbage in/gold out and prevent gold in/garbage out.
- Second, adequate data characterization (metadata to the geeks)
is critical. How you deal with data -- even how you choose to
organize its storage -- requires you to know how much data there is
going to be and how fast it's likely to grow and change. A query
that runs well to find 100 rows in a million-row table may not run
well on 100 billion rows. It matters how you flag and track errors.
Logging and auditing matter if the data changes frequently-less so
if the data is essentially static.
- Third, interpretation remains more of an art than a science --
or a science accessible to only a few trained specialists. Software
developers have had to design efficient filters and pattern
recognizers that can sift through mountains of data and find
(perhaps unanticipated) patterns that are relevant to a dimension
of interest.
- Fourth, data visualization -- representing results in an easily
consumable form -- is critical. What good is all that data if you
can't understand what the interpreters-human or software-concluded
from their analysis. Data visualization design theory isn't new
but, like many things that involve deep understanding of the range
and vagaries of human cognition, it's hard to do well.
- Fifth, you're generally going to have to choose between a
real-time view of the data (which may mean that you have to
continuously recompute everything whenever the data changes) and a
complete but retrospective view (the most common state of
cube-based analytics), which will always be somewhat out of
date.
- Sixth, how do you know in advance how long the data is relevant
or valuable? Data costs money to acquire, store, analyze and back
up. A retention policy beyond a typical "keep everything forever"
approach is needed, and that policy has to be enforced.
It's probably best to start from the value end of the equation
and keep only what you are sure you will need. After all, someone
else is probably keeping everything else for you already.
John Parkinson