UU
HOME cs.uu.nl

Reject Inference for Credit Risk Modelling

In credit scoring, loan applicants are either rejected or accepted depending on characteristics of the applicant such as age, income and marital status.

Repayment behaviour of the accepted applicants is observed by the creditor, usually leading after some time to a classification as either a good or bad (defaulted) loan.
As repayment behaviour of rejects is for obvious reasons not observed, the outcome of the loan (good or bad) is available only for accepted applicants.
Since the creditor presumably does not accept applicants at random, this constitutes a nonrandom sample from the population of interest (the "through the door" population).
Construction of a new selection rule based on accepted applicants only may therefore lead to incorrect results. In particular, one should be careful in using such a rule to assess the default risk of rejected applicants.
This is, in a nutshell, what is called the reject inference problem in the credit scoring literature.

There are two major issues that determine the extent of the problem:

  1. Is the selection mechanism (the decision whether or not to accept an applicant) deterministic or not?
  2. Is all information relevant to the acceptance decision recorded in the data?
Let's look at each issue in turn:
  1. Nowadays many creditors use a formal (computer-assisted) model to decide whether an applicant should be accepted. If this model is applied in such a way that applicants with the same relevant attributes are treated equal (which does seem fair!), then for some attribute values, all applicants are rejected, and for other attribute values all applicants are accepted. For example, the current selection model may say that applicants with income below 2000 and age below 25 are bad risks that should be rejected. If this rule is applied consistently then all applicants with income below 2000 and age below 25 are rejected. This means we don't observe the outcome of the loan (good or bad) for any applicant with those attribute values. Basically, this means we don't have any data to estimate the probability of a bad loan in that part of the attribute space. We could make some global parametric assumptions (e.g. that the probability of a bad loan depends linearly on the relevant attributes), but we can't really check whether these assumptions hold in the reject region. So at the end of the day, we are still extrapolating from the accept region (where we do observe the outcome) to the reject region. The dangers of extrapolation are well-known to people involved in data analysis: it means our estimates of the default probabilities in the reject region may be way off!

    So what to do about this? The creditor could decide to buy experience: rather than rejecting all people with income below 2000 and age below 25, he could decide to accept some of them at random. In this way the creditor collects some data on outcomes in the reject region, at the probable expense of a higher percentage of defaulted loans. Such an acceptance scheme can be made more sophisticated by letting the acceptance probability depend on the risk assessment of the current selection model.

  2. Suppose the loan officer thinks the current model is incomplete, and decides that applicants with income below 2000 and age below 25 that are married should be accepted, (if they are not married they are rejected as the model prescribes). However, whether or not someone is married is not recorded in the data by the bank, and so is not included in the analysis at all (OK, the example is not entirely realistic...). Now, if the loan officer is right in the sense that married people do have a lower risk of defaulting (married people are of course more reliable!), then our data contains a nasty bias.

    A numerical example illustrates this bias. Suppose that of all applicants with income below 2000 and age below 25, 40% are married and 60% are not. Suppose the married ones have a probability of 0.025 (1/40) to default, and the unmarried ones a probability of 0.15 (3/20). Overall, the applicants with income below 2000 and age below 25 have a probability of 0.4 x 0.025 + 0.6 x 0.15 = 0.10 to default. However, since we only observe the outcome of the married ones, we would estimate this probability to be approximately 0.025.

    This means we can't use the accepted applicants with income below 2000 and age below 25 to estimate the default probability of the rejected applicants with income below 2000 and age below 25, because the two groups have highly different default probabilities. In technical terms, we say the outcome (defaulted or not) is not missing at random.

    Note that the situation would have been entirely different if the loan officer would just have rejected at random 60% (or some other percentage) of all applicants with income below 2000 and age below 25. In that case we can use the accepted applicants with income below 2000 and age below 25 to estimate the default probability of the rejected applicants with income below 2000 and age below 25. In this case we say the outcome is missing at random.

    Another point to note is that the analysis could obviously be improved by including the marital status of the applicant in the data!

Some of my publications on this subject: