How do we model when we lack data for assumptions?
We have entered the age of big data. It is impossible to search predictive analytics and not get bombarded with news on Big Data. I think big data as proposed is exciting, and is ushering in a whole new category of analytics, but I find the importance of earnest predictive modeling may get lost in the tidal wave of exploratory approaches to big data.
Big Data will eventually become table stakes for all predictive models, but even then companies will find their biggest data may still be missing some of the data needed to complete predictive models.
One of the questions that will continue to nag business managers is what role data plays in making decisions and building predictive models? I would like to propose some considerations when tackling this question.
Exploratory statistics is not predictive modeling
First, it is worth starting with the note that exploratory statistics is not predictive modeling, it is a step in predictive modeling.
While I know big data is being used in predictive models, the most common application I see now is exploratory data visualization of big data sets.
The rapid rise of Tableau is a great example of how data visualization or exploration is the easiest first step in analysis. The ability to visualize huge data set is exciting and can identify some unanticipated relationships, but data visualization, like exploratory statistics is only the first step in developing a holistic model.
Without direction, data visualization can be endlessly interesting and ultimately unproductive. So data sets (including the big type) are quite valuable in informing potential hypothesis for a model, but are not a prerequisite in building a model.
Start with the Question, not the Data
Second, all meaningful models I have worked with are focused on answering a question. The modeling process starts with identifying the relevant question to be answered. This is important, because every complete model will produce an answer, but there is nothing worse then getting to the end of a complex modeling process and having an answer to a question you do not care about. You do not need data to pose the question. In fact, be careful that you do not allow the wrong question to be defined by the data. While exploratory analytics can lead to questions, but typically the questions come from strategic direction.
Two types of Data, not all data is quantitative
In the emergence of an increasingly measurable world, there is a danger in ignoring the unquantified or qualitative data. While measurable data is instrumental in accounting for variance in predictive models, the reality is that much of the variance in the best models is still explained by qualitative insight.
Jay Forrester as part of his writing on System Dynamics suggests that qualitative data, which is held in the mind of area experts, is one of the most important sources of information for the modeler. Ultimately it is “mental databases” that contain some of the most important information.
So, if you are light on data can you still start building predictive models? Yes, of course, you just need to find domain experts (or be the domain expert) to help understand the basic dynamics of a relationship. The fine tuning with data can and will come later.
Finally, Models are Hypothesis, you can start with no data
All predictive models move from an outcome of interest to a hypothetical set of relationships or variables that impact that outcome. In many cases this hypothetical mapping is, and should be, done seperately from underlying data. For certain data can be used to uncover unexpected relationships and is critical in confirming a hypothesis, but it is not a requirement in the development of an initial hypothesis.
Key questions in modeling and where the data matters
In an effort to help explore where the data matters in developing decision models, I propose a series of questions on whether data is critical.
Key Modeling Questions that do not require data
- What decisions to you have to make, or where does uncertainty exist?
- What hypothesis do you or others have regarding these decisions?
- What are the underlying assumptions of these hypothesis?
- What information do you have for these assumptions?
- Based on your hypothetical model, which assumptions are most important?
Key Modeling Decisions that require data
- How does your model compare to historical experience?
- If lacking data, are there quick experiments you can do to collect data?
- Do you have a system for collecting ongoing data for drivers in your model?
- How accurate is your model for prediction? (Confirmatory Analysis)
- Are there important relationships not yet discovered (Exploratory Analysis)