Within the past few decades, the banking institutions have collected plenty of data in order to describe the default behaviour of their clientele. Good examples of them are historical data about a person’s date of birth, their income, gender, status of employment etc. the whole of this data has all been nicely stored into several huge databases or data warehouses (for e.g. relational).
And on top of all this, the banks have accumulated several business experiences about their crediting products. For instance, a lot of credit experts have done a pretty swell job at discriminating between low risk and high risk mortgages with the use of their business mortgages, thereby making use of their business expertise only. It is now the goal of all credit scoring to conduct a detailed analysis of both the sources of data into a more detailed perspective with then come up with a statistically based decision model, which allows to score future credit applications and then ultimately make a decision about which ones to accept and which to reject.
With the surfacing of Big Data it has created both chances as well as challenges to conduct credit scoring. Big Data is often categorised in terms of its four Vs viz: Variety, Velocity, Volume, and Veracity. To further illustrate this, let us in short focus into some key sources or processes, which will generate Big Data.
The traditional sources of Big Data are usually large scale transactional enterprise systems like OLTP (online Transactional Processing), ERP (Enterprise Resource Processing) and CRM (Customer Relationship Management) applications. The classical credit is generally constructed using the data extracted from these traditional transactional systems.
However, the online graphing is more recent example. Simply think about the all the major social media networks like, Weibo, Wechat, Facebook, Twitter etc. All of these networks together capture the information about close to two billion people relating to their friends preferences and their other behaviours, thereby leaving behind a huge trail of digital footprint in the form of data.
Also think about the IoT (the internet of things) or the emergence of the sensor enable ecosystems which is going to link the various objects (for e.g. cars, homes etc) with each other as well as with other humans. And finally, we get to see a more and more transparent or public data such as the data about weather, maps, traffic and the macro-economy. It is a clear indication that all of these new sources of generating data will offer a tremendous potential for building better credit scoring models.
The above mentioned data generating processes can all be categorised in terms of their sheer volume of the data which is being created. Thus, it is evident that this poses to be a serious challenge in order to set up a scalable storage architecture which when combined with a distributed approach to manipulate data and query will be difficult.
Big Data also comes with a lot of variety or in several other formats. The traditional data or the structured data, such as customer name, their birth date etc are usually more and more complementary with unstructured data such as images, tweets, emails, sensor data, Facebook pages, GPS data etc. While the former may be easily stored in traditional databases, the latter needs to be accommodated with the use of appropriate database technology thus, facilitating the storage, querying and manipulation of each of these types of unstructured data. Also it requires a lot of effort since it is thought to be that at least 80 percent of all data in unstructured.
The speed at which data is generated is the velocity factor and it is at that perfect speed that it must be analysed and stored. You can imagine the streaming applications like on-line trading platforms, SMS messages, YouTube, about the credit card swipes and other phone calls, these are all examples of high velocity data and form an important concern.
Veracity which is the quality or trustworthiness of the data, is yet another factor that needs to be considered. However, sadly more data does not automatically indicate better data, so the quality of data being generated must be monitored closely and guaranteed.
So, in closing thoughts as the velocity, veracity, volume, and variety keeps growing, so will the new opportunities to build better credit scoring models.
Interested in a career in Data Analyst?
To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.