Courtesy of
Computing.co.uk
If an organisation is sitting on top of 10 databases, each of
which is 100TB in size, it has a big data issue, right?
Not necessarily - it certainly has a problem in that it has a lot
of data to deal with, but federating databases and applying data
cleansing, master data management (MDM) and business analytics can
provide a pretty decent solution to this. Big data introduces
a lot of different problems - ones that require a bit of different
thinking which may take many outside of their comfort zone.
Let's begin by taking a simple view of information within an
organisation. In the dim, dark past when I got into the ITC
world, a rule of thumb approach was that around 20% of an
organisation's information was in electronic format, the rest on
paper. Of the electronic stuff, about 80% was held within formal
databases. Roll the clock forward by a couple of decades and
this has essentially flipped - around 80% of an organisation's
information is now in electronic format, and only around 20% of
that will be held in a formal database. The rest of the
electronic stuff will be held in various file formats dotted around
on file servers, personal devices and so on.
Any "big data" approach that just deals with the data held within
databases is therefore only using 16% of the available information
- not a good way to reach mission-critical decisions.
This is further complicated by how information usage has
changed. Back at that earlier time, an organisation's data
assets were pretty easy to define - the data was in that database
that was on that server in that datacentre. Now, the
organisation's information assets have to include shared
information across the value chain of customers and suppliers - and
then beyond that into the information held in the internet itself
and across social networking sites.
All of a sudden, the "big data" approach of federating
information across those large databases that the organisation
controls is looking a little measly. Even if it is assumed
that those databases are large - say a total of 10 petabytes (PB),
or close to 1,000 times the amount of information held in the
American Library of Congress - the total size pales into
insignificance against the volume of information held on the
internet, where other information that could be useful could be
found in semi-structured or unstructured formats. The current
information volume of the internet is estimated to be around 2
zettabytes (ZT) - or 2 million PB. Bringing that into the
equation brings that 16% of available information that you may have
thought you were acting against down to a very small fraction of a
single per cent.
Sure, a lot of the available information out there on the
internet is either complete dross or is not germane to the problem
you are dealing with. The problem is that some of it is - the
views of customers being propagated through the social networks;
the performance and activities of competitors; the dynamics of the
markets in which you are operating, whether these are vertical or
geographic. You need the tools to identify that useful stuff,
and then the means to bring it into an environment where it can be
analysed and reported against in a manner that allows intelligence
to be gleaned from a broader set of sources - in other words, a
true big data approach.
A term that is being used around big data sums it all up nicely
- it is about volume, velocity and variety. The volume side
is the one everyone accepts, but is also the one that vendors have
latched on to and focused on. The velocity side is where the
big battles seem to be being played out - how fast can one vendor
provide insights against this large volume of data that is under
focus?
But variety is often glossed over - and yet it is the most
important. Less structured information held in documents and
spreadsheets, along with information that can be gleaned from less
traditional sources such as voice and video and those internet
sources alluded to earlier are all potentially relevant.
Those who can use the right technologies in order to bring this
variety of information sources together such that volume and
velocity needs are also met will be the outright winners in a world
of true big data - those who just look at it as a problem with
volumes of structured data under their direct control will face
major problems.
For a bit more on this subject, see Quocirca's argument on why
"Big data" should be re-termed as "unbounded information" here.