Sherlock Holmes Has Been Doing Data Visualization Before Big Data

Sherlock Holmes Has Been Doing Data Visualization Before Big Data

Investigative minded people will definitely relate to this story from almost every child’s formative years. The day they get their hands on a magnifying glass, kids would feign being the most famous detective of all times – Sherlock Holmes with a cap they would focus the magnifying glass on an object and try and derive meaning by studying the details closely. This would be their first lesson in data visualization. Later as we learnt about Mr. Holmes through books of Sir Arthur Conan Doyle many of us may have imagined pursuing a career as a full-fledged detective. In his book A Study in Scarlet is the most vivid description of the inclination Mr. Holmes has for the sciences.

Sherlock Holmes Has Been Doing Data Visualization Before Big Data

Now that we come to think of it a detective has probably evolved in this technologically driven planet into a modern-day data analyst or an experimental scientist. The job of a data analyst or scientist revolves around gathering a bunch of disorganized data, and then we use this to build a case through deduction and logic and then you reach a conclusion after analysis.


A quote by the iconic Mr. Holmes himself is very relevant in this scenario – “when you have eliminated the impossible whatever remains, no matter how improbable it is must be the truth”.


A sample case study of data visualization:


Let us imagine for this example that you are the CRO (chief risk officer) for a bank called U-bank. U-bank has disbursed a total of 60816 auto loans in the fiscal quarter between April–June, 2014. Now while you were on the job as their CRO, you noticed that around 2.5% of bad rate or about 1524 were bad loans out of a total of 60816 disbursed loans. You instantly had a hunch that there is a relationship between bad loans and the age of the borrowers. And later on after much thorough analysis you come to the data-backed conclusion that the rate of bad loans is inversely proportional to that of age of borrowers. So, for your model you take the age of the borrowers to be a strong variable or contender for the credit risk model you built. So, you may want to further inspect the matter after feeling a great sense of pride from the multivariate model you created, and hence, you begin your hunt for a few more variables.


The experiment continues…


As you Sherlock your way into this model, you get the idea that the income of the loan applicants should also have some relationship with the bad loans. So to test out this hypothesis you use your previously used methods with tools that you have a good understanding of like histogram and normalized histogram (overlaying this with good/bad borrowers). Soon enough you start by plotting an equal interval histogram and come to the following observations:


Yikes! This is no smooth bell curve histogram that you had obtained previously for the age group difference. Also the normalized histogram you made is completely uninformative as shown below:


Then what is going on in this case? Income much to the contrary of age has only a handful of outliers that are almost invisible for the histogram. There exists just one HNI (high net worth individual) with USD 1.47 million annual salary with only a few other outliers in the middle. And surprisingly a loan given to this one HNI client has turned sour into a bad loan, much to the inconvenience for the bank. Here is what the distribution table looks like, as is apparent that almost 99.8% of the population belong to the first two income buckets.


Now as an analyst you have to make a tough call, should you include such extreme cases within your data and include it in your model, or build a new income boundary model which will be applicable to the majority of the customers. As per the experts at DexLab Analytics, the latter option seems to be a wise decision. Thus, moving forward with the exploratory analysis and comprehensive data visualization, you then decide to zoom in at the regions with a large number of data points, i.e. the first two buckets and then you re-plot the histogram. The following observations were made:


Income Groups


As apparent this histogram is smooth within reason and does not need any transformation. Mentioned below is the normalized histogram for the above given one –


Conclusions that were drawn from the above given graphs:


  • A relationship does exist in terms of the bad rates and the income groups. As the borrowers earn a higher income lower will be the possibility to default on their loans and that is definitely some useful insight.
  • It can seen that for the last bucket i.e. the one > 150K the risk jumps up which is a break in the trend. This is attributed to the thin data in the bucket which is not just in terms of data count but is also spread across a very big interval of 150 to 1500 K.


Now your model has two variables that are possible for governing bad rates for borrowers, i.e. income and age. However, with more analysis of income with respect to age you soon discover that there is a high correlation between the two variables, which is 0.76 to be precise.


But you cannot use both these variable in the model as it will create problems due to multi-collinearity which will render the model much too complex. But the correlation between age and income does make sense as income is a function of years of experience for a professional and that is directly related on the age of the individual. So, you must consider dropping income from the model. Then that leaves us with a simple question – is there a way bringing income back to our multivariate model.


Fiscal ratios:


For corporate analysts when they try to analyze the financials of a company they often work with more than one financial ratio. There are definitely some advantages of working with ratios rather than just plain vanilla variables. Much better information is available by employing combined variables rather than simple ones. Also the creation of variables is a creative task that also needs a sound knowledge of the domain. For conducting credit risk analysis the ratio of the sum of obligations for income is immensely informative as it helps to gather insights about percentage of disposable income for the borrower.


Here is an example of FOIR (fixed obligation to income ratio) of an employee plotted in a normalized histogram:


It was noted with analyses that for people who are just left with 50% of their income to run her other expenses are usually in tight position and often turn loans into bad ones.


Thus, as can be noted from the above deductions that there is a clear relationship between FOIR and bad loan rates. Moreover for FOIR which has a little correlation with age of just 0.18, this becomes another variable for FOIR along with age thereby being a multi-variate model.


So here we would like to congratulate you on your successful following of Sherlock Holmes’ footsteps where in true data scientist-style you have built a case evidence by evidence which is a process in science.


Hope we could light a flame of inspiration in you so that you can pick up your data science magnifying glass and follow the favourite detective of all times in hot pursuit, only this time the secrets of the mystery is hidden in the strings of data.


Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.