In credit scoring, loan applicants are either rejected or accepted depending on characteristics of the applicant such as age, income and marital status.
Repayment behaviour of the accepted applicants is observed by the creditor,
usually leading after some time to a classification as either a good or bad (defaulted)
loan.
As repayment behaviour of rejects is for obvious reasons not observed,
the outcome of the loan (good or bad) is available only for accepted applicants.
Since the creditor presumably
does not accept applicants at random, this constitutes a nonrandom sample
from the population of interest (the "through the door" population).
Construction of a new selection rule based
on accepted applicants only may therefore lead to incorrect results.
In particular, one should be careful in using such a rule to assess the
default risk of rejected applicants.
This is, in a nutshell, what is called
the reject inference problem in the credit scoring literature.
There are two major issues that determine the extent of the problem:
So what to do about this? The creditor could decide to buy experience: rather than rejecting all people with income below 2000 and age below 25, he could decide to accept some of them at random. In this way the creditor collects some data on outcomes in the reject region, at the probable expense of a higher percentage of defaulted loans. Such an acceptance scheme can be made more sophisticated by letting the acceptance probability depend on the risk assessment of the current selection model.
A numerical example illustrates this bias. Suppose that of all applicants with income below 2000 and age below 25, 40% are married and 60% are not. Suppose the married ones have a probability of 0.025 (1/40) to default, and the unmarried ones a probability of 0.15 (3/20). Overall, the applicants with income below 2000 and age below 25 have a probability of 0.4 x 0.025 + 0.6 x 0.15 = 0.10 to default. However, since we only observe the outcome of the married ones, we would estimate this probability to be approximately 0.025.
This means we can't use the accepted applicants with income below 2000 and age below 25 to estimate the default probability of the rejected applicants with income below 2000 and age below 25, because the two groups have highly different default probabilities. In technical terms, we say the outcome (defaulted or not) is not missing at random.
Note that the situation would have been entirely different if the loan officer would just have rejected at random 60% (or some other percentage) of all applicants with income below 2000 and age below 25. In that case we can use the accepted applicants with income below 2000 and age below 25 to estimate the default probability of the rejected applicants with income below 2000 and age below 25. In this case we say the outcome is missing at random.
Another point to note is that the analysis could obviously be improved by including the marital status of the applicant in the data!
Some of my publications on this subject: