We practitioners of the technological arts have a tendency to use specialized jargon. That’s not unusual. Most guilds, priesthoods, and professions have had their own style of communication, either for convenience or to establish a sense of exclusivity. In technology, we also tend to attach very simple buzzwords to very complex topics, and then expect the rest of the world to go along for the ride.
Take, for example, the tag team of “cloud” and “big data.” The term “cloud” came about because we systems engineers used to draw network diagrams of local area networks. Between the LANs, we’d draw a cloud-like jumble meant to refer to, pretty much, “the undefined stuff in between.” Of course, the Internet became the ultimate undefined stuff in between, and the cloud became The Cloud.
To Mom and Dad and Janice in Accounting, “The Cloud” means the place where you store your photos and other stuff. Many people don’t really know that “cloud” is a shorthand, and the reality of the cloud is the growth of almost unimaginably huge data centers holding vast quantities of information.
Big data is another one of those shorthand words, but this is one that Janice in Accounting and Jack in Marketing and Bob on the board really do need to understand. Not only can big data answer big questions and open new doors to opportunity, your competitors are using big data for their own competitive advantage.
That, of course, begs the question: what is big data? The answer, like most in tech, depends on your perspective. Here’s a good way to think of it. Big data is data that’s too big for traditional data management to handle. Big, of course, is also subjective. That’s why we’ll describe it according to three vectors: volume, velocity, and variety — the three Vs.
Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions.
Facebook, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Facebook is storing roughly 250 billion images.
Can you imagine? Seriously. Go ahead. Try to wrap your head around 250 billion images.
So, in the world of big data, when we start talking about volume, we’re talking about insanely large amounts of data. As we move forward, we’re going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up.
Or, consider our new world of connected apps. Everyone is carrying a smartphone. Let’s look at a simple example, a to-do list app. More and more vendors are managing app data in the cloud, so users can access their to-do lists across devices. Since many apps use a freemium model, where a free version is used as a loss-leader for a premium version, SaaS-based app vendors tend to have a lot of data to store.
Todoist, for example (the to-do manager I use) has roughly 10 million active installs, according to Android Play. That’s not counting all the installs on the Web and iOS. Each of those users has lists of items — and all that data needs to be stored. Todoist is certainly not Facebook scale, but they still store vastly more data than almost any application did even a decade ago.
Then, of course, there are all the internal enterprise collections of data, ranging from energy industry to healthcare to national security. All of these industries are generating and capturing vast amounts of data.
That’s the volume vector.
Remember our Facebook example? 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months.
Velocity is the measure of how fast the data is coming in. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
Here’s another example. Let’s say you’re running a presidential campaign and you want to know how the folks “out there” are feeling about your candidate right now. How would you do it? One way would be to license some Twitter data from Gnip (recently acquired by Twitter) to grab a constant stream of tweets, and subject them to sentiment analysis.
That feed of Twitter data is often called “the firehose” because so much data (in the form of tweets) is being produced, it feels like being at the business end of a firehose.
Here’s another velocity example: packet analysis for cybersecurity. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network.
Unfortunately, due to the rise in cyberattacks, cybercrime, and cyberespionage, sinister payloads can be hidden in that flow of data passing through the firewall. To prevent compromise, that flow of data has to be investigated and analyzed for anomalies, patterns of behavior that are red flags. This is getting harder as more and more data is protected using encryption. At the very same time, bad guys are hiding their malware payloads inside encrypted packets.
Or take sensor data. The more the Internet of Things takes off, the more connected sensors will be out in the world, transmitting tiny bits of data at a near constant rate. As the number of units increase, so does the flow.
That flow of data is the velocity vector.
You may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other. This data isn’t the old rows and columns and database joins of our forefathers. It’s very different from application to application, and much of it is unstructured. That means it doesn’t easily fit into fields on a spreadsheet or a database application.
Take, for example, email messages. A legal discovery process might require sifting through thousands to millions of email messages in a collection. Not one of those messages is going to be exactly like another. Each one will consist of a sender’s email address, a destination, plus a time stamp. Each message will have human-written text and possibly attachments.
Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they’re generally unstructured, and incredibly varied.
All that data diversity makes up the variety vector of big data.
It would take a library of books to describe all the various methods that big data practitioners use to process the three Vs. For now, though, your big takeaway should be this: once you start talking about data in terms that go beyond basic buckets, once you start talking about epic quantities, insane flow, and wide assortment, you’re talking about big data.
One final thought: there are now ways to sift through all that insanity and glean insights that can be applied to solving problems, discerning patterns, and identifying opportunities. That process is called analytics, and it’s why, when you hear big data discussed, you often hear the term analytics applied in the same sentence.
The three Vs describe the data to be analyzed. Analytics is the process of deriving value from that data. Taken together, there is the potential for amazing insight or worrisome oversight. Like every other great power, big data comes with great promise and great responsibility.