One of our clients agreed to have their anonymized data used for credit scoring articles and training. Since not many real credit data sets are available publicly, we took this opportunity to share the data.
Euro Credit Data
The table you can download here – LoansWithPredictors is a typical simple dataset you can use to build different scoring models. It contains 10.664 observations – consumer loans. As you can see in the last column, 1.060 observations are marked as “bad” (last variable – BadFlag).
Full list of all variables in the dataset is available here: DocLoansWithPredictors.
ID variables and basic loan characteristics
The first few variables describe each loan: ID of the loan, ID of the customer, Loan Amount, Loan Type, Loan Date.
It is common and reasonable to model credit risk for new customers and existing customers separately. The credit process is usually different for those two groups. Models for existing customers are usually stronger. Behavioral data from earlier credit history provides several useful predictors. You can use LoanType variable to separate ‘New’ and ‘Existing’ sub-sets of the data. As you can see in the data ‘Existing’ obs have lower bad rate and more interesting predictor candidates. Lending to existing customers is rightly considered less risky but there are important systematic risks in managing series of short-term loans almost as if they were credit cards.
Of course, ID variables should not be used as predictors in scoring. It is also usually a bad idea to use ticket size (LoanAmount). It often comes as significant in predicting credit default but mainly because historic strategies of assigning loan amounts were not random. Better customers likely received greater loan amounts. As you can see in the data, loan amounts range between 200 EUR and 6.000 EUR.
Candidate predictors for credit risk scoring
There are 25 typical predictor variables in the Euro Credit Data table:
- 7 application variables (like declared Income, MaritalStatus and Education),
- 3 credit bureau variables (because of strict regulations on allowed use of bureau data, we could not publish full set of bureau variables, which is typically very large and very interesting),
- 15 behavioral variables (calculated for those customers, who already had credit history at the time of the loan application).
It is also possible (and a good idea!) to construct additional variables by transforming and combining the data. One example of a transformed variable is already included: IncomePrec is the precision with which income was declared. It is possible that customers who care about declaring their income with precision to single EUR (or even cents) differ in risk performance from those who round income.