The trend of data collection is at an all-time high and this process will continue to increase as more people willingly and perhaps unintentionally, offer up their own data via social media and other outlets.
You may ask yourself, what do you mean unintentionally? “We know where you are. We know where you've been. We can more or less know what you're thinking about.” – Eric Schmidt, former CEO of Google. Every search that is done gives a little more data to be analyzed by the people who own it, and it appears most people do not even realize this. For example, I just opened my Facebook account and checked my profile. I gave them all the basic data including political leanings and a bunch of “things” I like. Heck, if I just count the music I put in Facebook, there are over 160 albums listed. Think about that for just one second. Just using my profile, I have freely given Facebook enough data about myself that a fairly complete profile could be generated. Wow, I just got a little scared. I will come back to this profile stuff later, but that is what is meant by unintentionally giving data for free.
The other thing you may have noticed so far is that I talk about “data” and “information” as two completely different things. The word “data” to me represents raw elements, things like numbers, words, counts, strings, pictures, video, music, and a lot of other things that record items. Data in most cases is uncategorized, unfiltered, and unrelated; it may be held in a well-structured database, but the actual data itself may not have any real relationships except for the ones imposed on it so that it can be stored logically.
The word “information” means analysis has been done to the “data” and it has been transformed from raw parts into a useful set of results. Think about data as a raw set of links in a data chain: if those links are not put together or there are weak links, then the chain created will be weak as well. There are many examples throughout history where data was assembled into an information chain and the linkage was not correct. However, at the time, the available data appeared to be linked correctly and assumptions were made based on the information that the data generated.
One of the biggest mistakes when linking data to generate information was with regards to the causes of poliomyelitis (Polio) in the early 1940's. The raw data suggested that incidents of Polio increased during the summer months when it was warm. Ironically, the raw data showed that ice cream sales increased at almost the same rate as the Polio incidents. You have to remember that in 1952, nearly 58,000 U.S. Citizens (many were children) were diagnosed with Polio, and it was considered a pandemic. The expert public health officials at the time recommended that ice cream; drinks with sugar, and other high sugar foods should be avoided and proclaimed the Polio diet would save people. As many scholars have written, the thought that ice cream was the cause of the disease is funny, but at the time, people would try anything not to be afflicted with the disease.
This brings us back to today. How do you turn the massive amounts of data into usable and actionable information to enable better decisions? We have already established that both data gathering techniques and data are growing exponentially and that we have more data than ever before, but we still do not have the secret sauce on generating great information. Take for example you have two common tables in a normal organization:
I would imagine these example tables are well known to anybody reading this article. Yes there could be more columns and it is most likely the data being collected is much richer in nature than the examples shown. The point really gets down to what questions you can ask with these two datasets. Again, I'm sure that many of the readers can develop the classic set of questions from these tables… but most likely they are not asking the strategic questions that need to be asked.
They look and see a Customer and Sales tables and then the classic questions arise, but in reality, they need to ask questions that the data does not support. You may be asking yourself, “how the heck do you ask questions of data, if I don't have the data to ask those questions?” This is where the concept of Data Augmentation enters the picture. Most likely, the data exists in some form or fashion but where do you find additional datasets that are not part of the standard data your organization collects?
There are services that have been online, and new services that are coming online, that provide data that fill the gaps in traditional data collection. The real issue is not finding the missing data, but actually figuring out what data they are missing in the first place. That is why the “questions” being asked of the data have to include questions that your existing data cannot answer. Again, think about the classic example above. A set of questions that a company may want to ask could include; “What is a typical customer,” or “Where should I build my next store,” or maybe “Will the new product line sell?” Wow, if a store had that information, they would really have something useful. In the example, can the simple data we have support those questions? The obvious answer is no.
Now we have questions the data cannot support, so what data do we need to answer those questions? Or in today's terms, how can I augment my data to answer these questions?
By augmenting your data with GIS and Census data, you can get a better picture of your customer base. Ask the question, “who is my typical customer” and the data can respond with the demographics of the neighborhood, the median income, etc. What about “where should I build my next store?” Again, now that I know what my typical customer looks like, I can compare that location / demographic information across all the store locations and find a location that has a great opportunity to be successful. The last question is just a little harder, “should we add a new product line in our stores?” The data may not yet support that answer, but through Data Augmentation, now the question is focused on “what data” do you need to answer that question. This leads back to the supplier; do they have an idea of their typical customer? If the answer is yes, that information can be compared to your typical customer and see if it matches. The information comparison will give the business a leg up on the competition and possibly success with a new product line, whereas in the past it was merely a guess because there was not any data to support the information other than the classic “gut” feeling for.
The data is out there. It is being collected, it is being analyzed, and it is being summarized. However, before that data can be used, the most important part of the equation is asking the questions that your data cannot answer. This is not easy and there is no magic wand, or even a special line of questions you should be asking. Each set of data is limited in a specific way in how the data is being collected.
So bringing the article full circle, Facebook has a profile on me outlining the music I listen to, the concerts I go to, and the average ticket price I pay. Do I have expendable income to spend on t-shirts and hats at the concert? Google knows more about me through my searches than I most likely know about myself. Remember that all this collected data is for sale! Are you willing to pay for it? Will the information being generated from the data yield results or will the information being generated from the data cause the classic Correlation vs. Causation situation as it did with Polio? Therein lies the crux of the issue and why it is so hard to take raw data and turn it into actionable information reliably.