If you are new to analyzing large data sets, specifically those with many variables, you may have never heard of the Curse of Dimensionality. I know the first time I heard the term, I was wondering how they were going to bring this low-budget horror film topic into a discussion on predictive analytics. I now have a better understanding of the Curse of Dimensionality and the difficulties that arise when trying to analyze data that has hundreds if not thousands of predictors. While it may feel like you are in a horror film when dealing with some of these data sets, it doesn't have to be a film with a bad ending. A better understanding about the term is often the first step in working with this type of data. Hopefully this blog post will help give you a bit of insight into the Curse of Dimensionality.
Richard Bellman is usually credited with first using the term Curse of Dimensionality in his work with dynamic optimization in the late 1950s and early 1960s. If you break the term into its two components, Dimensionality refers to the number of dimensions or predictors in a data set. The Curse refers to the difficulties that arise in analysis when the number of predictors increases. The Curse of Dimensionality is the exponentially increasing difficulty you encounter in finding any discernible patterns or global optima for the parameter space to fit models as the number of predictors in the data space increases.
It's hard to visualize what this means, but I read an example once that really helped me grasp conceptually the meaning of the Curse of Dimensionality. Imagine a single line 100 yards long, about the length of an American football field. This is a single dimension. Now imagine dropping a coin somewhere along this line and trying to find the coin by walking along the line. Yeah, that doesn't sound too hard.
Now imagine adding another dimension, another line of 100 yards, so that now instead of searching a single line, you are searching the area of a square 100 yards by 100 yards. Trying to find a coin dropped somewhere inside that square would be a considerably more difficult task.
Now add another dimension, so that instead of a 100 yard square you have a 100 yard cube. If a coin were to be dropped somewhere inside that cube, you may be in there for weeks before finding that coin!
This is a great visual example of the curse of dimensionality. As you add predictors, the volume of the space becomes so large that it is increasingly difficult to discern any meaning from the data.
There are many techniques to deal with this Curse. You can use feature selection to search through the available predictors to find those that are most important in predicting the variables of interest. You can use Principal Components Analysis to reduce the dimensionality in the data. There are many more, and these are really beyond the scope of this blog. I really just want to give you a better idea about what is meant by the term Curse of Dimensionality. I do want you to know, just because you may be facing the Curse of Dimensionality, it doesn't mean that you will not be able to organize and analyze the data to get the meaning you need from the analysis. You just need the appropriate tools and techniques.