While some of these techniques may be a little out of date and most of them have evolved over time greatly, for the past 10 years rendering most of these tools completely different and much more efficient to use. But here are few bad techniques in predictive modelling that are still widely in use in the industry:
1. Using traditional decision trees: usually too large decision trees are usually really complex to handle and almost impossible to analyze for even the most knowledgeable data scientist. They are also prone to over-fitting which is why they are best avoided. Instead we recommend that you combine multiple small decision trees into one than using a single large decision tree to avoid unnecessary complexity.
2. Use of linear regression: using linear regression has numerous drawbacks of which the biggest one is the inability to identify chaotic and complex patterns that are non-linear. As this regression method is based on heteroscedasticity and other normal assumptions it falls short when unstable and independent variables are linked. We recommend as a better option than this, that you use variable reduction to transform your variables and use constrained regression instead. For example, Lasso or ridge regression can be a good option.
3. Using K-means clustering technique: this method is usually used to cluster but tends to produce circular results in clusters that are undesirable in such circumstances. They also do not work well with data points which are not a combination of Gaussian Distributions.
4. Linear discriminant analysis techniques: this technique is used for supervised clustering and is a bad technique as it automatically assumes that data clusters do not overlap. And remain well separated with the existence of hyper-planes. But in actuality they are never so and almost always overlap. So we suggest that you use density estimation technique instead of linear discriminant analysis.
5. Use of neural networks: this is a bad technique as it is unstable, difficult to analyze and also has high chances to over-fit.
6. Using density estimation in high dimensions: this is often called as the “curse of dimensionality” in the data world and is one of the worst possible techniques in predictive modelling. Instead we recommend that you use Kernel density estimators (non-parametric) with adaptive bandwidths.
7. Use of Naïve Bayes: usually used to detect spam and fraudulent data detection for keeping score. This technique automatically assumes that variables are independent and if aren’t then this method will fail desolately. Because when in fraud or spam detection, usually variables are highly correlated. This technique often produces a lot of false results by showing fake positives and fake negatives.
Another noteworthy point is that you should always use a sound cross-validation technique when testing your predictive models. This post talks about the bad techniques usually over-used or in other words abused in the world of data science. We do not necessarily mean that they should not be used at all but that they should not be used in every model like a magical recipe because they are not the best fit for all models and can lead to catastrophic consequences.
Interested in a career in Data Analyst?
To learn more about Machine Learning Using Python and Spark – click here.
To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.