Lately, we've been hearing a lot about "big data;" there
are various new applications from Hadoop and NoSQL and
all sorts of new analytics software. I've spent a lot of time
talking to people and thinking about these trends recently, and
have come away convinced that, indeed, we're seeing huge changes in
both the amount of data that is being collected and how as
individuals, companies, and societies, we're going to deal with
that data.
We're just in the early stages of a complete rethinking of how
organizations deal with data and how they turn raw data into
information they use to make decisions. However, I'm also
convinced that the "big data" terminology is probably more
confusing than useful. As Jeff Bedell, CTO of analytics vendor
MicroStrategy, told me, "big data" is just a buzzword. "The
whole game is to introduce confusing terms."
For instance, Gartner describes big data as being based not only
on the volume of data, but also its variety, velocity, and
complexity. Analyst Mark Beyer spoke at last fall's symposium
about extreme information management and said companies
need to build modern information management systems that include
logical data warehouses.
Instead of just talking about "big data" as if it was one thing,
it might be more useful to think of a variety of changes in how
organizations deal with data.
Sure, there are some cases with truly massive amounts of
data. The Large
Hadron Collider produces 15 petabytes (15,000 terabytes) of
data annually while the upcoming DOME
radiotelescope project is expected to generate more than one
exabyte (one million terabytes) of data per day. But such
projects are relatively rare, and more related to high-performance
computing than to typical business cases.
Instead, most typical organizations are dealing with databases
that are notably smaller but can still measure in the terabytes and
petabytes. (That's still a LOT of data.) This data can come
from a variety of sources: tracking what people are doing on a
website or multiple websites, analyzing social networks, or dealing
with all of the data generated by sensors.
Before I talk about how data issues have changed in recent
times, it might be helpful to recap some of the big trends in the
area up until now.
There have been databases-collections of data-on computers
almost as long as there have been digital computers, notably with
products like IMS running on IBM's mainframe systems. The
early databases were hierarchical systems, but the model that
became and remains the standard is the relational model. These
date back to a 1970 paper from Edgar F. Codd titled, "
A Relational Model of Data For Large Shared Data
Banks."
Today, every large organization uses one or more of these
products-notably Oracle Database , IBM DB2, Microsoft SQL Server,
and the open-source MySQL (also now owned by Oracle)-to store its
transactional data. All sorts of applications have been built
on top of relational databases including inventory, accounting,
enterprise resource planning (ERP), customer relationship
management (CRM), HR applications, and the thousands of custom
applications large organizations typically
have.
In particular, as the number of transactions has gotten more
complex and often distributed among multiple machines, many firms
have implemented what are known as online transaction processing or
OLTP systems.
One big change over the past couple of decades has been the
emergence of business intelligence platforms and data warehouses,
often but not always working together.
A data warehouse typically stores copies of data from
operational systems, but these systems are not themselves used for
the constant transactions to run a business. Instead, they are
used to keep a history of the data, to integrate multiple systems,
and often as a starting point where the data is structured for use
in analytics applications. Teradata is probably the company
best-known for its data warehouse products, but in recent years
Oracle, with its Exadata line (based in part on its Sun aquistion),
and IBM (including its Netezza acquisition) have been getting more
attention, along with pure software players such as Greenplum (now
part of EMC).
There are lots of different kinds of business analytics
applications, but probably the most common is often known as online
analytical processing, or OLAP. The data is configured in a
multidimensional data "cube," in which the data from a relational
database (or a series of databases; or a data warehouse) is brought
together and connected, then analyzed. Often you'll see
business intelligence platforms run as a "semantic layer" on top of
such cubes within a data warehouse.
The best-known business intelligence platforms are
from Business Objects (owned by SAP), Cognos (owned by IBM),
Hyperion (owned by Oracle), Microsoft, MicroStrategy, and SAS.
As Bedell describes it, this view comes out of work in the 90s
on "very large databases" and data warehousing where you had a
separate database for reporting, as opposed to the one you used for
transactions.
Typically, such reporting databases would capture summary data,
not every transaction; the idea was that by analyzing the data you
could have more insight into what is happening in your
business.
This kind of business intelligence built a very large market,
and it's what lies behind most of the great examples of BI, like
those described in Moneyball.
Such systems are usually run by professionals and require a fair
amount of setup, though this is changing. Lately, I've been
particularly impressed with a number of tools that let more typical
business analysts (rather than programmers) do quick reporting and
analysis on corporate data. These vendors include Tableau
Software, Qliktech's QlikView, and Tibco Spotfire, all of which
allow quick visualization of data from multiple
sources.
With the growth of both the Web and sensor-based applications,
the amount of data collected has been scaling faster than
traditional databases can allow, resulting in new approaches often
called "NoSQL" and based on tools such as Apache Hadoop. I'll
talk about this more in a later post, but it seems like every
enterprise vendor is now working on Hadoop-based solutions, as well
as smaller companies and, crucially, the open-source movement.
In addition, there is a growing emphasis on what is often called
"unstructured" data-content or information that may not fit well
into traditional databases including everything from webpages to
text documents to media. A whole new set of tools exists for
such content , covering traditional enterprise and document content
management systems like those from Documentum(now EMC), Filenet
(now IBM), Stellent (now Oracle), OpenText, and Microsoft
SharePoint and newer unstructured search providers such as Autonomy
(now part of HP) and Endeca (now part of Oracle).
In short, there are lots of different data needs and most large
organizations will end up with multiple solutions and many with
several providers.
In the next few posts, I'm going to talk about a number of these
areas but it's clear that these are different markets with
different tools aimed at different customers-not some monolithic
new "big data" market. However, it is just as clear to me that
organizations are going to have to rethink how they intend to
collect, store, analyze, and manage data and how they are going to
take this data and turn it into real information.