In predictive analytics, several steps are necessary before building models. In addition to understanding the needs and goals of the project, the data should be prepared. Real-world data comes with real-world problems, as I was reminded reading a blog post recently. The success of the project often times is sabotaged by these data issues. So, is your data ready?
Preparing data for analysis is a tricky process. There is no one-size-fits-all way to do things. There are some valid approaches to choose from in the face of various challenges. These challenges include, but are certainly not limited to, missing values, data entry errors, points outside the valid range, outliers, impossible combinations, duplicate records, invariant data, structure and issues of data organization, and the list goes on.
Finding these data issues with the data can be accomplished with graphs and summary tables, as seen here:
STATISTICA offers easy to use tools for dealing with the issue of outliers,
and various other data issues.
These videos are part of a 35 video series on the topic of data mining and predictive analytics.
Sometimes we all make things harder than they have to be, hence the title of my blog, Round Hole, Square Peg. My dad uses the acronym, KISS: Keep it simple, stupid. The simple fact is that once we start down a path, its hard to turn around and try a new one. But at times, that is exactly what we need to do to make things easier. This is as true in life as it is in data analysis.