7 Deadly Sins of Big Data Users

Courtesy of Information Week Software

 

Sloth, negligence, gluttony...and that's just the beginning. Consider these common mistakes organizations make when assessing the meaning of large amounts of data.

We're swimming in a vast sea of data that's rising every year. And according to Josh Williams, president and chief science officer of Kontagent, a social and mobile analytics company, companies that collect, analyze, and interpret data accurately--and act upon it quickly--have a significant competitive advantage.

At the Kontagent Konnect user conference in late May, Williams made a presentation called the "7 Deadly Sins of Data Science," in which he outlined the common mistakes that organizations make when processing large amounts of data. There's a good chance you're familiar one or more of the Deadly Sins, which include Sloth, Negligence, Gluttony, Polemy, Imprudence, Pride, and, of course, Torpor.

We've summarized each transgression below. If your organization has sinned, now is the time to repent.

1. Sloth: Lazy Data Collection. If your data-collection skills are bad, the data you acquire probably won't help your organization much. "We see a lot of times that faulty measurements lead to faulty management," Williams told InformationWeek. "It's a garbage-in, garbage-out problem."

2. Negligence: Misapplied Analysis. It's easy to make analytical errors as data starts to filter through your organization. "Not everyone is a data expert, and they can draw the wrong conclusions," Williams said. You must analyze the data rigorously to create simple, easy-to-understand reports.

3. Gluttony: Too Many Reports. A glut of information and good visualization tools often lead organizations to produce too many reports, including those with vanity metrics (e.g., a website's number of registered users) that cause you to miss important facts about your business or industry. "Whether you're doing this in-house or with third-party vendors, it's easy to spit out a lot of reports, a lot of data," said Williams. "Too much information can cloud your judgment, and that makes it hard to make decisions."

4. Polemy: Data Definition, Use Disagreements. If the people in your organization don't agree about what a report means and how to act on it, you'll end up in conflict. Unclear definitions, personal interpretations of what the data means, or uncertainty on how to act on data, can hamper an organization's ability to make decisions. So make sure that different groups within your company aren't going in different directions based on the same data. "It's shocking how often that happens," Williams said.

5. Imprudence: Jumping To Conclusions. When you dig through data and read reports, it's not uncommon to see things that cause alarm. Companies may jump to conclusions without examining data sufficiently. They may even change their business model for the wrong reasons, such as relying on other people's conclusions, misinterpreting data, or reading an industry benchmark and deciding they need to follow a so-called best practice. "We encourage people to verify, run their own tests, and then decide if something that has become common knowledge really works," said Williams.

6. Pride: Decision-Driven Data Making. Rather than running tests and using data to confirm or deny assumptions, this Deadly Sin is where you dig through data to confirm your preconceived notions. "We see this happen a lot throughout organizations, both at the executive level and within teams," Williams said. "People try to confirm what they believe; they dig through data to find it." But the best data-driven cultures have mantras like: "Data wins arguments," he said. Let the data speak the truth.

7. Torpor: Learning And Acting Slowly. "A critical factor is how quickly you act on data, and how quickly you learn from it. This is where a lot of companies fall short," Williams said. You should interpret data methodically, of course. And you'll want to develop a process to ensure that people aren't jumping to conclusions based on the data, or using the data to confirm what they already believe. But once a decision is made, you must act on it right away.

Big data places heavy demands on storage infrastructure. In the new, all-digital  Big Storage issue of InformationWeek Government, find out how federal agencies must adapt their architectures and policies to optimize it all. Also, we explain why tape storage continues to survive and thrive.

Posted at 15:18
Tags :

Big data thats worth big bucks

Courtesy of the Hindu Times

 

Huge amounts of data are being crunched to create meaningful information

 

Realising its worth: More and more businesses, even in India, are looking to crunch their large data sets to see what works and what doesn't. File photo: AP



Last week, an official business meet between the chiefs of social media giant Facebook and retail behemoth Walmart raised a few eyebrows.

While officially it was given to understand that Walmart, a retail major which lags behind others such as Amazon in online retail, was looking to enhance its social media presence, tech forums deliberated on the real purpose of the "relationship meet" - data. With over 800 million users, and needless to say, a lot of intricate and often geo-tagged personal data uploaded by them, Facebook presents a data trove like none other, and Walmart, which has been on the ball as far as technology goes, knows that. Just a few months ago, Walmart's acquisition of 'Social Calendar' - a hugely popular Facebook app that people use to track birthdays - was also, obviously, about getting access to and using data, mostly personal, to make better and more customised business decisions.

Today, companies, both at home and globally, are waking up to the value of data. The growing interest in big data has obviously to do with the fact that it is worth big bucks. Driven by the explosion of social media, the all-pervasive use of mobile networks and cloud storage, data has gotten bigger and bigger, so much so that the term 'big data' - used in tech parlance to refer to data sets that are large and tough to manage - has come to be known as one that has no prescribed upper limit.

As storage capacity, computing power and parallel processing capabilities expand, the value of data is being realised better. That is, huge amounts of data (this could be data generated within the enterprise or data on it generated online or on social media) is being crunched to create insights or meaningful information. And increasingly, this process, which used to take hours and even days, is now being done in real time. While tools such as Hadoop allowed for real-time analysis of data, Google's Dremel and other Open Source implementations that are developing in this ecosystem, allows for ad-hoc querying of big data in real time.

Around half a decade ago, when analytics was still much in its infancy, a popular and provocative article in wired.com asked if analytics signalled the 'end of theory'. In the petabyte age, the article pondered, will scientific analysis based on hypothesis, modelling and testing be rendered obsolete? Is theory not relevant anymore?

Today, big data enthusiasts agree. An 'analyst' is more of a "tool expert", or someone proficient in using various data analytical tools, and there is a lot of demand in the market for someone who can do this well, says Rahul Kulkarni, senior product manager at Google India.

DATA CRUNCHING

More and more businesses, even in India, are looking to crunch their large data sets to see what works and what doesn't.

"And people are seeing the value in that. Earlier, people were not enthusiastic about storing data, but now they know that data contains insights that can aid crucial decision-making," he explains. Earlier, taking this data and analysing it was a two to three week cycle, but now most of this is possible in real time, and the benefits of that are immense, he says.

However, several obstacles limit their ability to turn this massive amount of unstructured data into profit, points out Mitesh Agarwal, Chief Technology Officer and Director, System Solution Consulting, Oracle India. The most prominent obstacle among them is a lack of understanding on how to add big data capabilities to the overall information architecture to build an all-pervasive big data architecture. "When big data is distilled and analysed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation - all of which can have a significant impact on the bottom line."

Technology-wise, companies are now focussing on ways to make the analytics and query interface as simple as possible. While internally Google uses Dremel to do this for its own processes, for its clients, Google provides analytics as a service. "What we attempt to deliver is analytics interfaces that are so simple that a marketing officer can use it to pose ad-hoc queries to the data set, and be able to extract information that can be used meaningfully," Mr. Kulkarni explains.

ANALYTICS OUTSOURCED

As an emerging tech field, several Indian companies, big and small, have their eyes set on analytics. The bigger outsourcers, such as Wipro, TCS and Infosys, are into analytics services; several other larger global companies across segments ranging from automobile to pharmaceutical, are getting their analytics done here.

Apart from them, many smaller companies and start-ups are into analytics services, and in some sense it is a natural progression from business process outsourcing to knowledge process outsourcing to analytics, says S. Anand, Chief Data Scientist at Gramener, a data visualisation company. His company is into analytics products and specialises in the emerging tech field of data visualisation.

"During the nineties, the services model did well and the products-model in IT did not pick up. That seems to be changing, and in a field like analytics, it now appears we may have the advantage on both," he says.

Posted at 02:36
Tags :

Big Data is changing the way we look at problems

Courtesy of News Observer

 

Huge groupings of information - "Big Data" - are changing the way we look at countless problems. I'm looking at an unexpected offshoot of the Wikileaks controversy, in which Julian Assange's website released documents galore from all kinds of classified resources. Programmers going to work on this suddenly public information have now extracted dates and locations from 77,000 incident reports involved in the war in Afghanistan, creating a map of the violence. The project took one night, and the remarkable thing is that based solely on the model created here, the researchers could predict ensuing military events with uncanny accuracy.

The method was tested against the events of 2010 and proved accurate even in the relatively quiet northern provinces, where data points were few. What we are seeing is like a foretaste of Isaac Asimov's "psychohistory," described in his "Foundation" novels as a way of analyzing and predicting future events through a combination of history, sociology and statistics. Big data combines our unprecedented and growing ability to store information with increases in computing power. The result: We're tackling problems that have always seemed beyond our reach with statistics and quantitative analysis, and it's even happening on our home PCs.

One early player in all this is Google. The company has already indexed something on the order of 4 percent of all the books ever printed between 1800 and 2000, and has released a database containing every word in this library. You would think a word like "television" probably didn't appear until the first sets were being developed, but Google's database ( books.google.com/ngrams) can find instances of the word appearing before 1900, with sustained use beginning in the early 1920s. Play around with the site and you'll find it a source of endless fascination. You can plug in multiple words and chart their usage against each other.

Watch the Big Data trend carefully as you look for business openings. One thing that's bound to happen is the conjunction of the smartphone with ever-increasing storage and onboard camera technology. So-called "lifelogs" are much in the air among futurists. They're the result of next-generation equipment - the kind of thing we'll routinely be carrying in a few years - that records not just where you are but what you saw and what you've heard. Imagine the uses of technology like this in keeping track of your own habits, flagging the places where you're spending too much, and helping you recall places and names you might have forgotten.

Right now, Big Data is being used to produce curious and somewhat unsettling results. A Stanford professor named Jure Leskovec tracks data on Web behavior, using social networks like Facebook not to keep up with friends and family but as goldmines of statistical information. Leskovec has discovered that the right methods can predict which contacts users will add as "friends" on the site - a method that's already accurate in about half the cases he's studied. His study of messaging (using Microsoft Instant Messenger) has uncovered how widely spaced users are (six degrees of separation is just about right), with implications for making the Internet more efficient by learning how to produce the shortest path between any two computers.

 

Game-changing solutions

But if you want to take the trend to where it really gets powerful, consider that other researchers at Stanford have developed the first software simulation of an entire organism. It's only a single-cell bacterium, but modeling it involves 525 genes and the interactions of 28 categories of molecules, taking us down to the fundamental building blocks of cellular life. Computational biology takes Big Data in the direction of computerized experiments that can model and test game-changing solutions to life's worst problems: diseases like Alzheimer's and cancer.

We're only at the beginning of this trend, but when people voluntarily give up their own data - think social networks - they help to generate statistical models that everyone from law enforcement to human resources will consult to predict future behaviors. Next time you send a tweet, remember that you're adding to the data storehouse (Cornell scientists are already studying Twitter usage) and ponder how business will put Big Data to work in the future.

Paul A. Gilster is the author of several books on technology. Reach him at gilster@mindspring.com.

Posted at 02:32
Tags :

Big Data's big week

Courtesy of ZD Net

 

During a single week in usually hum-drum July, a slew of companies had a Big Data beach party bash. Big Data, BI, Database and Cloud companies went crazy with partnership announcements and a new version of an in-memory database was born.

 

In just one week, the Big Data world saw several major partnership announcements that in aggregate tie together an Internet search powerhouse, two Hadoop all-stars, a decades-old database company, several Business Intelligence players and the maker of a real-time database for Hadoop.  For the kicker, an important Big Data in-memory database saw a new release.

Let's review these announcements and what they mean for the Big Data market.

 

Where's the Data?  Google it! 
Search and advertising giant Google is a counterparty in two of these deals.  The Mountain View, California-based company appears to be very serious about making its cloud platform a serious contender in Big Data.  On Tuesday, the company announced partnerships with several database and BI companies around its BigQuery cloud-based column store.

 

In one deal,  Google and database veteran Pervasive Software are teaming to allow Pervasive's RushAnalyzer to provide Extract Transform and Load (ETL) functionality for BigQuery.  Google built a full RESTful API over BigQuery with the clear intent that existing Business Intelligence (BI) and ETL tools would integrate with it, and it seems to be working.

Maybe that's why open source ETL provider Talend announced a  similar BigQuery partnership with Google as well.  Talend's Open Studio for Big Data is an Apache Eclipse-based graphical add-in for loading and extracting data from Hadoop by automating Hadoop Distributed File System (HDFS), Hive, Pig, HBase and Sqoop.  And now Open Studio for Big Data works with BigQuery, too.  The other BigQuery partnerships include deals with Informatica and  SQLStream for ETL, Jaspersoft for reporting and analytics and QlikView for dashboards.

 

Don't forget Hadoop 
This week's partnerships are not all about BigQuery though.  For example, Pentaho's existing partnership with Cloudera got  amped up on Wednesday.  Cloudera's Distrubution Including Apache Hadoop (CDH) has for some time included the Sqoop import/export facility for interfacing Hadoop with SQL-based relational databases, and the Oozie component for creating and scheduling Hadoop workflows.  Like many Hadoop components though, neither of these features much in the way of tooling or graphical user interface (GUI).  That's where this deal comes in.  Pentaho's visual design studio now works with Sqoop and Oozie, providing a point-and-click GUI against both.

Back on Tuesday, San Francisco-based Drawn to Scale announced it will be redistributing MapR's M3distribution of Hadoop with Spire, Drawn to Scale's real-time database for Hadoop.  MapR's Hadoop distro embeds an HDFS-compatible network file system, files in which are readily updateable.

And speaking of real-time Big Data, Terracotta, another San Francisco firm (that is a wholly owned subsidiary of German company Software AG), announced version 3.7 of its BigMemory product on Wednesday.  Much like SAP HANA, another in-memory database from a German company, BigMemory is completely RAM-based, and employs a scale-out architecture. Both databases have the ability to handle transactional and analytical workloads.  Version 3.7 of BigMemory brings enhanced security, powerful data compression and new search capabilities too.

 

Big Data, BigQuery, BigMemory, Big Week 
If it wasn't already clear that everyone wants a piece of the Big Data action, it should be now.  The number of announcements during a mere 2-day period this week was staggering.  Now we just have to get these companies to take some time off in August.

Posted at 13:59

Applying Big Data and Big Analytics to Customer Engagement

Courtesy of Sys-Con Media

 

Practical considerations

 

Customer engagement has long benefited from data and analytics. Knowing more about each of your customers, their attributes, preferences, behaviors and patterns, is essential to fostering meaningful engagement with them. As technologies advance, and more of people's lives are lived online, more and more data about customers is captured and made available. At face value, this is good; more data means better analytics, which means better understanding of customers and therefore more meaningful engagement. However, volumes of data measured in terabytes, petabytes, and beyond are so big they have spawned the terms "Big Data" and "Big Analytics." At this scale, there are practical considerations that must be understood to successfully reap the benefits for customer engagement. This article will explore some of these considerations and provide some suggestions on how to address them.

 

Customer Data Management (CDM), also known as Customer Data Integration (CDI), is foundational for a Customer Intelligence (CI) or Customer Engagement (CE) system. CDM is rooted in the principles of Master Data Management (MDM), which includes the following:

  • Acquisition and ingestion of multiple, disparate sources, both online and offline, of customer and prospect data
  • Change Data Capture (CDC)
  • Data cleansing, parsing, and standardization
  • Entity Modeling
  • Entity relationship and hierarchy management
  • Entity matching, identity resolution, and persistent key management for key individual, household, company/institution/location entities
  • Rules-based attribute mastering, "Survivorship" or "Build the Best Record"
  • Data lineage, version history, audit, aging, and expiration

It's useful to first make the distinction between attributive and behavioral data. Attributive data, often referred to as profile data, is discrete fields that describe an entity such as an individual's name, address, age, eye color, and income. Behavioral data is a series of events that describe an entity's behavior over time, such as phone calls, web page visits, and financial transactions. Admittedly, there is a slippery slope between the two; a customer's current account balance can be either an attribute or an aggregation of behavioral transactions.

MDM typically focuses on attributive data. Being based on MDM, the same is true for CDM. Personally Identifying Information (PII) such as name, email, address, phone, and username are the primary drivers behind identity resolution. Other attributes such as income, number of children, or gender are attributes that are commonly "mastered" for each of the resolved entities (individual, household, company).

Enter Big Data. As more devices are developed - and adopted - that capture and store data, huge quantities of data are generated. Big Data, by definition, is almost always event-oriented and temporal, and the subset of Big Data that is relevant to a CE system is almost always behavioral in nature (clicks, calls, downloads, purchases, emails, texts, tweets, Facebook posts). Behavioral data is critical to understanding customers (and prospects). And, understanding customers is critical for establishing meaningful and welcome engagement with them. Therefore, Big Data is, or should be, viewed as an invaluable asset to any CE system.

Further, this sort of rich, temporal behavioral data is ripe for analytics. In fact, the term Big Analytics has emerged as a result. Big Analytics can be defined as the ability to execute analytics on Big Data. However, there are some real challenges involved in executing analytics on Big Data, challenges that drive the need for specialized technologies such as Hadoop or Netezza (or both). These technologies must support Massively Parallel Processing (MPP) and, just as importantly if not more so, they must bring the analytics to the data instead of bringing the data to the analytics. Having recently completed a course for Hadoop developers (an excellent course that I highly recommend), I have a heightened appreciation for the challenges related to managing and analyzing data "at scale" and the need for specialized technologies that support Big Data and Big Analytics.

 

A few significant points regarding Big Analytics should be considered:

  1. Big Analytics allow the build of models on an entire data set, rather than just a sampling or an aggregation. My colleague, Jack McCush, explains: "When building models on a small subset and then validating them against a larger set to make sure the assumptions hold, you can miss the ability to predict rare events. And often those rare events are the ones that drive profit."
  2. Big Analytics allow the build of non-traditional models, for example, social graphs and influencer analytics. Several useful and inherently big sources of data such as Call Detail Records (CDRs) generated from mobile/smart phones and web clickstream data both lend themselves well to these models.
  3. Big Analytics can take even traditional analytics to the next level. Big Analytics allows the execution of traditional correlation and clustering models in a fraction of the time, even with billions of records and hundreds of variables. As Revolution Analytics points out in Advanced 'Big Data' Analytics with R and Hadoop, "Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. The algorithm is not the competitive advantage; the ability to apply it to huge amounts of data-without compromising performance-generates the competitive advantage."

 

Big Data is great for a CE system. It paints a rich behavioral picture of customers and prospects and takes CE-enabling analytics to the next level. But what happens when this massive behavioral data is thrown at a CDM/MDM system that is optimized for attributive data? A "basketball through the garden hose" effect might occur. But this doesn't have to happen; there are ways to gracefully extend CDM to manage Big Data.

The key is data classification. Attributive, or profile, data is classified separately from behavioral data. While both contain Source Native Key (e.g., cookie-based visitor id, cell phone number, device id, account number), attributive data can be structured only. Behavioral data, on the other hand, can be structured and unstructured and contains no PII. Big Data almost always falls under the behavioral category.

Importantly, behavioral data requires different processing than attributive data. Since the processing is different, the two streams can be separated just after ingestion, like a fork in the road, with the attributive data going one way and the behavioral data going the other. This is the key to integrating Big Data into a CDM-MDM system without grinding it to a halt. To be fair, the two streams aren't completely independent. The behavioral stream will typically require two things from the attributive stream: Dimension Tables and Master ID-to-Natural Key Cross-References - both of which can be considered as reference data.

Dimension Tables
For example, the "subscriber" dimension table may be required in the Big Data world so that it can be joined to the "web clicks" table. This is done in order to aggregate web clicks by subscriber gender, which only exists in the subscriber table.

Master ID-to-Natural Key Cross-References
Master IDs are created and managed in the CDM-MDM world, but they are often needed for linkage and aggregation in the Big Data world. Shadowing cross-references that map master IDs, such as master individual id, to "source natural keys" into the Big Data world solves this problem.

The two classifications of data are separated into two streams and processed (mostly) independently. How do they come back together? One way this architecture works is that both streams, attributive and behavioral, contain a "source natural key." This is a unique identifier that relates the two streams. For example, web clickstream data typically has an IP address or a web application-managed, cookie-based visitor ID. Transactional data typically has an account number. Mobile data will have a phone number or device ID. These identifiers don't have to mean anything, per se, but are critical for stitching the two streams back together.

It's not just the dimensionalized, aggregated data that is reunited with the profile data, but also the high-value, behavioral analytics attributes (predictive scores, micro-segmentations, etc.) created courtesy of Big Analytics. The attributive data is now greatly enriched by the output of the Big Data processing stream. And, to get things really crazy, these enriched behavioral analytics profile attributes can be used as part of the next cycle of matching; similar, complex behavior patterns can help tip the scales, causing two entities to match that might not have matched otherwise. In the end, CDM-MDM and Big Data can live together harmoniously; Big Data doesn't replace CDM-MDM, but rather extends it.

Posted at 11:06
Tags :

Want to try MATCHCITE MDM?

FREE VERSION

Or contact us for details on our Proof of Concept Program...

Proof of Concept Enquiry

Archive