ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

. The class imbalance problem arises in two-class classiﬁcation problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as mod-eling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to inﬁnite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coeﬃcient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.


1.
Introduction. Class imbalance is a common problem in the field of classification. For example, banks use statistical models to evaluate the credit risk of lending to consumers, typically by constructing a classification rule to distinguish good and bad risk customers. Customers' classes are defined by whether or not they default, that is, fail to satisfy their repayment obligations. This process is known as credit scoring. The most popular approach for consumer credit risk modelling is logistic regression [25, p. 79], which is regarded as a benchmark in the financial industry. There are a number of statistical modelling challenges in credit scoring, such as reject inference [10], handling variation over time and class imbalance. The latter is a common problem in credit scoring in which the bad class is much less frequent than the good. As an example of high imbalance, we introduce the Freddie Mac mortgage data set for which the number of default samples is profoundly smaller than the number of non-default samples. Freddie Mac is a U.S. government-sponsored enterprise that purchases mortgage loans for later sale as part of mortgage-backed securities. Freddie Mac provides their "Single Family Loan-Level Data Set" including the fixed-rate mortgages they purchased from 1999 to 2015. 1 Here, we define default as a mortgage that is 180 days or more overdue in making a repayment on their home loan. This is a standard definition of default in U.S. financial institutions [6]. The target variable we use is whether a mortgage moved to the default status in the two years immediately following the first repayment date. The upper plot of Figure 1 shows the number of new mortgages booked in each originating quarter from 2003 to 2013; the lower plot shows the default rate from 2003 to 2013. The number of applications fluctuates over this long time frame. We find a pronounced peak in default rate during the financial crisis period (2007-2008) with a peak of 6.8% in 2007 Q3; however, the default rate is extremely low in other quarters.
Modelling such imbalanced data is a challenging problem, fraught with difficulties, and it is the main concern of this paper. Specifically, we concentrate on the application of logistic regression and related methods in highly imbalanced data sets. Owen [19] provides a striking asymptotic result which suggests that, in cases of extreme class imbalance, the minority class only contributes to the logistic regression estimation via its sample mean vector. This raises concerns about the utility of such models, and potential unwanted consequences. In particular, if we suspect some underlying cluster structure within the minority class, then this will be lost when data is reduced to the sample mean vector. Two natural choices to alleviate these problems are penalizing and weighting the likelihood, which are used in the credit risk industry [28,12]. However, by extending Owen's result, we show that penalizing and weighting the likelihood are insufficient for handling the class imbalance problem. In fact, penalized logistic regression make matters worse.
We sketch a new procedure that attempts to handle a specific consequence of the highly imbalanced logistic regression, namely the minority class's contribution in the estimation process only comes from its mean vector. Essentially, this procedure seeks to cluster the minority class into two (or more) latent subclasses to improve the predictive performance of the model, with the underlying assumption that such useful subclasses exist within the minority class. This approach is shown to be effective in a simulation study and in modelling default for real mortgage data.
The outline of the paper is as follows. The next section provides background to logistic regression, highly imbalanced data, and the relationship between the two. Section 3 considers methods that extend logistic regression and are attractive proposals for credit risk modelling. New theorems are given for infinitely imbalanced weighted logistic regression and penalized logistic regression. We use a relabelling clustering approach to demonstrate problems arising from imbalanced classes with logistic regression, with a simulation study in Section 4 and using the Freddie Mac data in Section 5.
2. Logistic regression and class imbalance. In this section, we first introduce logistic regression and give a short review of the class imbalance problem, mentioning some sampling methods proposed in the literature to deal with this problem. Then we introduce the boundary behavior of logistic regression in the infinitely imbalanced data set [19] and a result about the existence of the maximum likelihood estimate (MLE) for logistic regression [23] as a preparation for further investigation of highly imbalanced logistic regression.
2.1. Logistic regression. Consumer credit risk classification usually uses binary logistic regression to predict the probability of a customer defaulting on a loan. Denote the binary response as Y ∈ {0, 1} with Y = 1 indicating a default. Pr(Y = 1|X = x) refers to the conditional probability of a vector x belonging to class 1 whenever X = x. The binary logistic regression model has the form: where β 0 is an intercept and β is a slope parameter vector, to be estimated. Suppose we have M observations in a credit data set, then the log-likelihood function for independent observations can be written as: To maximize Equation (2), we set its derivatives to zero and solve for β 0 and β. To provide solutions, and hence estimate parameters, various numerical optimization techniques are used.

2.2.
Classification with imbalanced data. High imbalance means one of the two classes is extremely rare in the binary classification problem. Highly imbalanced credit data is common in the credit risk industry. For example, in mortgage data sets (such as that described in Section 1), the default rate is often less than 0.5%, which means the default class is extremely rare. Various authors have investigated such imbalanced (especially highly imbalanced) problems. One type of proposal attempts to construct a balanced data set [14], and thus various sampling methods have been developed to address highly imbalanced problems. Random undersampling and random oversampling are two natural methods which randomly remove samples from the majority class or randomly replicate the minority class. Random undersampling obviously leads to information loss in the majority class. Thus, Liu et al. [15] proposed two informed undersampling methods to overcome the drawback of random undersampling. Another widely used oversampling method is SMOTE (Synthetic Minority Over-Sampling technique), which shows good results in various applications [4]. In contrast to random oversampling, SMOTE adds artificially generated data which has the same character as the distribution of the minority class.
At the same time, an investigation also shows that models trained from the imbalanced data set could have the same predictive power as models trained from balanced data set [2].
Apart from sampling methods, another common technique to handle imbalance data is reweighting the likelihood function. Reweighting methods pass the weights to each observation directly to the likelihood function. Seiffet et al. [22] investigate the difference between reweighting and sampling methods for imbalanced data in boosting algorithms and find reweighting performs as well as sampling methods.
In summary, the literature suggests that class imbalance is not well understood and no single method is apparently superior to the others, in general. The exception is logistic regression, for which deeper mathematical insights are available, as we now discuss.

Silvapulle's results about existence of MLE for logistic regression.
Here, and when convenient in the sequel, we use the following notation: consider n p-dimensional feature vectors from class Y = 1 (the minority class), denoted x 11 , · · · , x 1n , and N feature vectors from class Y = 0, x 01 , · · · , x 0N . In order to accommodate the intercept term in the regression parameters, let z 0i = (1, x 0i ) for i = (1, · · · , N ) and z 1i = (1, x 1i ) for i = (1, · · · , n). Let S, F be the two relative interiors of the convex cones generated by x 11 , · · · , x 1n and x 01 , · · · , x 0N respectively, When S ∩ F = Ø, then a unique MLE for logistic regression exists. However, if S ∩ F = Ø then no MLE exists [23]. Put differently, the MLE for logistic regression only exists if the classes are not linearly separable. This result is required for the arguments of the following sections.

2.4.
Owen's results about infinitely imbalanced logistic regression. Here, we introduce the result of about the limit behavior of logistic regression in infinitely imbalanced problems. Following the notation in [19], we focus on the case when n N . To demonstrate the result, Owen centers logistic regression around the minority class mean vector x = n i=1 x 1i /n. Since in the infinitely imbalanced case N → ∞, Owen also supposes that there is a good approximation for the conditional distribution of X given Y = 0 (majority class); denoted by F 0 . Thus, it is shown that the log-likelihood function, Equation (2), can be written as where F 0 is a distribution function for X given that Y = 0. It is convenient to separate the intercept and slope terms in the parameter vector, thus β 0 denotes the intercept term, β denotes the slope terms, and x i denotes minority class data only. Note that from here on, in this paper, {x i , i ∈ [1, 2, · · · , n]} denotes only minority class data. Owen proposes the following definition to express the overlap condition (Section 2.3) in the case of infinitely imbalanced logistic regression.
Owen also assumes that for all β ∈ R p , to ensure that the F 0 does not have tails that are too heavy, which makes a degenerate logistic regression arise. Then the main result is Theorem 1. (Theorem 8 in [19]) Let n ≥ 1, and x 1 , · · · , x n ∈ R p be fixed. Suppose that F 0 satisfies the tail condition (Equation 6) and surrounds the class mean vector x = n i=1 x i /n. Then the maximizer (β 0 , β) of l(β 0 , β) given by Equation (4) This theorem may be understood as: when N → ∞, logistic regression only depends on the minority class data {x 1 , · · · , x n } through the minority class mean vector x; we could replace {x 1 , · · · , x n } by one vector, the mean vector of the minority class, and obtain the same coefficient estimates of β in the limit N → ∞. Theorem 1 is an asymptotic theoretical result and Owen's simulation [19, p.763 Table 1] shows that convergence happens rapidly from N/n > 100. In practice, we replace all the minority class with their mean vector in the Freddie Mac mortgage data set when we get the same coefficient estimates as using all the original minority class data.
Owen [19] notes other issues, particularly related to the presence of cluster structure in the minority class, of which the overall minority class mean vector would be a poor representation. These issues are explored later, in Sections 4 and 5.
3. Highly imbalanced weighted logistic regression and penalized logistic regression. We have briefly described previous work about the existence of the maximum likelihood estimator for logistic regression and the limiting behavior of parameter estimates in highly imbalanced data sets. As mentioned in Section 1, a natural idea to address the imbalance problem would be to reweight the likelihood, leading to weighted logistic regression. Alternatively, penalized logistic regression adds a complexity penalty to the log-likelihood function (Equation 2) and, in the case of the lasso [26], is designed for coefficient shrinkage and variable selection, which is another commonly used alternative model to logistic regression in the credit risk industry [28, 8, p. 313]. We might propose them as solutions to alleviate the imbalance problem. However, we extend Owen's results to weighted logistic regression and penalized logistic regression in Sections 3.1 and 3.2 and show that both these extensions still suffer from the limiting behavior with highly imbalanced data.

YAZHE LI, TONY BELLOTTI AND NIALL ADAMS
3.1. Highly imbalanced weighted logistic regression. Zeng [30] proved that the MLE for weighted logistic regression exists and is unique when there is an overlap between the data points of the two classes, which extends the result in [23], described in Section 2.3. Next, we identify the characteristics of infinitely imbalanced weighted logistic regression for a specific weighting strategy.
Consider retaining a weight of one for the majority class, as in Equation (4), and increasing the weight of observations in the minority class, with weights Ω = {ω i > 1, i = 1, · · · , n}. Thus, the log-likelihood function for weighted logistic regression is where herex is a weighted minority class mean vectorx = We investigate the characteristics of infinite imbalance for this form of weighted logistic regression and sketch the whole proof before expanding on the relevant lemmas and theorems. The numbering of the lemmas corresponds to those in [19]. Lemma 2 and Lemma 4 in [19] are numerical facts which also hold for this weighted logistic regression. Lemma 5 is used to prove the existence of a finite MLE when the surrounded condition (Definition 1) is satisfied. Lemma 6a and Lemma 7a are used to establish a boundary of β when N → ∞.
Lemma 6a. Let β 0 and β be the maximizer of the likelihood function, F 0 satisfy the surrounded condition at the weighted minority class mean vectorx and η is the infimum of δ. Then for any For the concave likelihood function, the negative partial derivative means that the maximizer β 0 < log(2 ω i /ηN ).
Lemma 7a. Under the same conditions, we will have lim sup N →∞ β < ∞.
Proof. Under the surrounded condition (Equation 5), there exists a γ such that where ψ T ψ = 1 and (x −x) T ψ + means the positive part of (x −x) T ψ . We still let e β0 = A/N , then we have The first inequality above is taken from Lemma 4 in [19]. Equation (12) implies that when β > 1 we have l(β 0 , β; Ω) < l(β 0 , 0; Ω). Thus, maximizing likelihood function l(β 0 , β; Ω) will obviously let β ≤ 2/γ, with large enough N . Now, we prove the main theorem for infinitely weighted logistic regression: 396 YAZHE LI, TONY BELLOTTI AND NIALL ADAMS Theorem 2. Let x 1 , · · · , x n and the weights Ω be fixed. If F 0 satisfies the surrounding condition atx and the tail conditions, then the maximizer ( β 0 , β) of l(β 0 , β; Ω) satisfies Proof. Set ∂l ∂β = 0 we have, Divide Equation (14) by N e β0−x T β to yield Although the introduction of weights ω i will lead to slower convergence than the unweighted logistic regression, the right side of Equation (15) still vanishes as N → ∞ (because β is bounded by Lemma 7a). In the left side of Equation (15), and is at least as N → ∞ because β 0 → −∞ (see Lemma 6a) and e 2x T β dF 0 (x) < ∞ by the tail condition. Similarly we have Thus, Equation (15) simplifies tō Thus, Equation (13) holds.
Theorem 2 shows that this specific weighted logistic regression still only depends on the weighted minority mean vector in the class imbalance limit, N → ∞. This finding has interesting implications for methods based on resampling the minority class.

3.2.
Highly imbalanced penalized logistic regression. It is known that maximum likelihood estimation may fail with high-dimensional data or multiple highlycorrelated variables [8, p.303]. In order to perform parameter shrinkage and variable selection, penalized logistic regression is designed to add penalty terms to the likelihood function of logistic regression. We consider the two common forms of penalty: the ridge and the lasso. For data as described in Section 2.3, we can give the objective function for ridge penalized logistic regression [11] as: Lasso penalized logistic regression has the same form except the penalty term at the end of the expression is given as λ β 1 . We first describe some simulations to explore the characteristics of penalized logistic regression in highly imbalanced data. We consider samples of different size and different levels of imbalance, where P (X|Y = 0) ∼ N (0, 1), and we have 100 replicates from the Y = 1 class, all with x = 1. The role of the replicates is to handle computational issues. As N (the number of majority class sample points) increases, the problem becomes more imbalanced. The coefficient estimates of standard logistic regression and ridge penalized logistic regression (penalty parameter is 0.1) are given in Table 1. The table suggests that, as N → ∞, N e β0 converges to n and e β converges to 1 in ridge penalized logistic regression, with this particular and arbitrary choice of penalty parameter. However, consider X ∼ Uniform(0, 1) when Y = 0. Again, we use n = 100 points for y = 1; 50 points are x = 0.5 and the others are x = 2. This setting fails the surrounded condition in Theorem 1. Table 2 shows the coefficient estimates of this simulation. We see that penalized logistic regression demonstrates shrinkage behavior, despite failing to satisfy the surrounded condition. 2 3.2.1. Theoretical results. In this section, we give results, following Owen, for penalized logistic regression in the infinitely imbalanced case. Considering the existence of a unique solution in penalized logistic regression, we follow the previous argument for logistic regression considered in [23].
Lemma 1. For data as described in Section 2.3, assume that the n + N by p + 1 data matrix (including the constant vector 1 to accommodate the intercept) has rank p + 1, then a unique finite ridge or lasso penalized logistic regression solution exists.
Proof. We first consider the case of the ridge penalty, then give adjustments for lasso. In proof, we use B denotes p + 1 dimensional vector (β 0 , β). Other notations are same as those in Section 2.3 in our paper. We consider situations S ∩ F = ∅ and S ∩ F = ∅ separately.
If there is no separation between two convex cones S and F (S ∩ F = ∅), we cannot find a hyperplane which separates S and F properly [21,Theorem 11.3]. Therefore, there exists a unit p + 1 dimensional vector e, such that z T 1i e is negative for some 1 ≤ i ≤ n and z T 0i e is positive for some (whereẽ is e excluding the first component) since the summation parts of i ∈ A 2 and i ∈ A 3 will vanish as k → ∞. We also notice that when k goes to infinity, the first term in Equation (18) will shrink to i∈A1 z T 1i e < 0, the second term will shrink to − i∈A4 z T 0i e < 0 and the third term is positive. Thus l(B + (k + 1)e) − l(B + ke) will be negative for large k, so l(B + ke) is a decreasing function in k. Therefore, l(β 0 , β) for fixed x does not have a direction of recession and the existence of the solution follows from Theorem 27.1(d) in [21].
Assuming S ∩ F = ∅, we could find a unit p + 1 dimensional vector e, such that Considering two numerical facts: x is a increasing function and lim x→∞ f 1 (x) = 0, • f 2 (x) = log 1 1+e x is a decreasing function and lim x→∞ f 2 (x) = 0, we know that the summation part in Equation (19) is increasing and approaching 0, as k → ∞. Thus, term − 1 2 λ β + kẽ 2 2 dominates Equation (19) as k → ∞, leads to the result that l(B + ke) is a decreasing function for large enough k. Therefore, Table 3. Infinitely imbalanced logistic regression shrinkage law.

Fixture
Logistic Regression Ridge Lasso certain value, k 1 n n β certain value, k 2 0 0 l(β 0 , β) for fixed x does not have a direction of recession and the existence of the solution still follows from Theorem 27.1(d) in [21]. This proof shows the existence of the solution for ridge penalized logistic regression. Following lemmas in [27,Lemma 5] and [29, p.45], we know the extant solution for ridge penalized logistic regression is unique. Through a similar proof process, Lemma 1 also holds with the lasso penalty. The only change is for handling the penalty term in Equation (18): which is still a positive number.
By repeating the previous numerical simulation (Table 1 and Table 2) in lasso penalized logistic regression, we obtain the shrinkage law (Table 3) when N → ∞. The simulation results suggest that the intercept term β 0 shrinks to n/N and β shrinks to 0, when N → ∞. We now prove this shrinkage law and further prove that the estimated parameter vector β only depends on the distribution F 0 , the minority mean vectorx and the imbalance level N/n. The lasso penalty will be demonstrated in our proof process, because it is the more complex case than ridge penalty. The difference in the proof process between lasso and ridge will be pointed out in Section 3.2.2. We again use the notation of Section 2.4. In order to directly show the result, we center lasso penalized logistic regression around the minority class mean vector x = n i=1 x i /n. Since in the infinitely imbalanced case N → ∞, we also suppose that there is a good approximation for the conditional distribution of X given Y = 0; denoted by F 0 . Thus, the objective function for lasso penalized logistic regression [26] is written as where β j is the jth element of β. We follow Owen's proof again: Lemma 4 and Lemma 5 still hold for penalized logistic regression. The three changes in the proof process are for Lemma 6, Lemma 7 and the main theorem. Lemma 6b gives e β0 ≤ n/(N − n) and Lemma 7b gives p j=1 | β j | ≤ n/(N − n)λ. Note that in Lemma 6b and Lemma 7b, we do not require the surrounded condition, which makes the proof significantly different from Owen's proof.
The following theorem demonstrates the behavior of β 0 and β in infinitely imbalanced penalized logistic regression.
Proof. From Lemma 6b and Lemma 7b, we know e β0 ≤ n N −n and β is bounded when N → ∞.
Since N e β0 → n and β → 0 in penalized logistic regression when N → ∞ (regardless of the data set), the shrinkage properties of penalized methods means that the data ceases to be involved. In this case, the estimated probability for the minority and the majority class will simply approach their marginal frequencies: n+N . Note that Theorem 3 is a strong result that demonstrates β = 0 in infinitely imbalanced penalized logistic regression. We are also interested in how β shrinks when N approaches infinity. Therefore, we prove another theorem to demonstrate the behavior of β when approaching infinitely imbalanced penalized logistic regression. Here we will utilize e β0 → n/N as N → ∞ from Theorem 3.
Since the lasso penalized log-likelihood function is nondifferentiable at β = 0, we alternatively use the subgradient [18, p. 126] of convex Equation (21) with respect to β. Setting the subgradient of Equation (21) to 0, we have where s is a p dimensional vector where j ∈ {1, · · · , p}.
We take advantage of a specific characteristic of the subgradient method; a convex function f attains its optimal value at a vector v if the zero vector is a subgradient of f at v [18, Theorem 3.1.15]. Thus, because Equation (21) is a convex function, we have the following main theorem: Theorem 4. Let n ≥ 1 and minority class vectors {x 1 , · · · , x n } be fixed and suppose that F 0 surroundsx = n i=1 x i /n as described. Then the maximizer ( β 0 , β)of l given by Equation (21) satisfies n + N n λs (28) as N → ∞.
Proof. Setting the subgradient of Equation (21) to 0, we have As N → ∞, the right side of Equation (29) vanishes because β is bounded as N → ∞ by Lemma 7b. If we consider N → ∞, we have e β0 → n N and β → 0, yielding to e β0+(xi−x) T β → n N e 0 → 0. Thus Equation (29) yields After simplification, Equation (28) holds. Notice that, s must converge to 0 due to the established Equation (28) when N → ∞. (24) and Equation (29) is used to obtain analytic approximations. However the expression so obtained is an asymptotic form rather than a limit and gives useful results even for finite but large N . Equation (28) shows the asymptotic solution of β depends only on {x, F 0 (x), N n } when approaching infinite imbalance. We give some context and illustration of Theorem 4 with respect to the solution of penalized logistic regression with large finite N in the next section.

3.2.2.
Changes in the proof for infinitely imbalanced ridge penalized logistic regression. The proof given above is specifically for the lasso penalty. Here, we briefly demonstrate several changes in the proof process required for infinitely imbalanced ridge penalized logistic regression.
In the proof of Theorem 3, Equation (25) is changed to and maximize l( β 0 , β) leading to β = 0. Then the results still hold.
In the proof of Theorem 4, we do not use subgradient method because the derivative of ridge penalty exist. Setting the partial derivative with respect to β of Equation (17) to 0, we have: (33) After similarly simplify process we have the maximizer ( β 0 , β) of l given by Equation (17) satisfies as N → ∞.

3.3.
Numerical explanations for highly imbalanced penalized logistic regression. For application purposes, we are interested in investigating the solution of penalized logistic regression with large finite N . The following calculation and simulation provides some context and illustration of Theorem 4. Assume the probability density function of the majority class, f 0 (x), is N (0, I) where 0 is a p dimensional zero vector and I is the p × p identity covariance matrix. By taking advantage of independence between variables, we vectorise Equation (28) in dimension r (here x ·r refers to the rth column in the observation matrix): Table 4. Coefficient estimates of lasso penalized logistic regression with different penalty parameter λ.
Consider that all β ·r , r ∈ {1, · · · , p} are shrunk to 0, thus all s ·r ∈ [−1, 1]. For all r ∈ {1, · · · , p}, we have so x ·r must be located in the interval −λ(1 + N n ), λ(1 + N n ) . This interval will enlarge as the imbalance level N n increases, such that it will easily include all x ·r . Alternatively, if λ > n n+N max r {x ·r }, all β ·r will be shrunk to 0. When the imbalance level N/n is great, this condition is easily satisfied with respect to a fixed λ.
Following is a simple simulation to demonstrate the consequence of Equation (36). We generate 100, 000 five dimensional majority class data (Y = 0) following N (µ 0 , Σ 0 ), where µ 0 = 0, Σ 0 = I and I is the identity matrix. 1, 000 minority class data (Y = 1) are generated which follows N (µ 1 , Σ 1 ), where µ 1 = [1.9, 1.7, 1.  Table 4 shows the coefficient estimates of lasso penalized logistic regression for different values of λ. We find that the relationship betweenx ·r and λ determines whether the coefficient estimate β ·r will be shrunk to 0 (all coefficient estimates shrink to 0 when λ > n n+N max r {x ·r }, see Table 4 Line 1).
Without the assumption of the majority class distribution F 0 (x), if β = 0, Equation (28) simplifies to xdF 0 (x) is the population mean µ x of the majority class distribution, which is determined by data set. Thus we need x ∈ [µ x − n+N n λ, µ x + n+N n λ] to force all β shrink to 0, which is easy to satisfy with highly imbalanced data set.

4.
Example: Cluster structure in the minority class and relabelling. In light of the results given so far, if we suppose the minority class has some underlying structure, this could be problematic in the infinitely imbalanced regime, simply because the mean vector of clustered data is unlikely to be a good representation.
To explore this problem we first relabel the minority class into two new (both "bad") subclasses using a clustering algorithm, then use multinomial logistic regression to model the two bad classes along with the majority good class. This can be achieved by using standard clustering methods. Here, it is still possible to reason about the good/bad dichotomy, simply by considering the predictions Pr(Y = 0|X = x). This suggestion, to split the "bads" into subclasses, is familiar in credit scoring where different types of bad debt are considered, such as the "can't pay" and "won't pay" behavioural distinction [3].
To illustrate the issue of cluster structure in the minority class, we simulate a bivariate normal distribution example. We generate 10,000 sample points following X ∼ N (µ 1 , Σ 1 ) from the majority class (Y = 0). Then 50 points following X ∼ N (µ 2 , Σ 2 ) and 50 points following X ∼ N (µ 3 , Σ 3 ) are generated and combined as the single minority class. Here µ 1 = [0, 0], µ 2 = [0, 2], µ 3 = [2, 0], Σ 1 = 1 0 0 1 and Clearly, this is rather contrived, but serves for the purposes of illustration. In Figure 2, dark blue points represent the majority class, red points represent the minority class. The two red contour lines indicate that the minority class data are in two well-separated clusters. We train two models on this data set. The first is a standard logistic regression model, the second is a multinomial logistic regression model which has one majority class (c1, Y = 0) and two separate minority classes (c2 and c3, Y = 1) which are separated using K-means clustering (with K = 2). Table 5 shows the coefficients for the two different models. The first three columns are the logistic regression coefficients. By construction, the minority class is well-separated. The right four columns are multinomial logistic regression coefficients. Finally, we generate test data following the distribution described above, to assess out-of-sample predictive performance. The prediction AUC (area under the ROC curve [9]) of logistic regression is 0.918 which is lower than the AUC of multinomial logistic regression (0.954). Although the coefficient table shows that both intercept and slope terms are significant in logistic regression, all of these results suggest that logistic regression under-perform multinomial logistic regression when cluster structure is taken into account. 5. Illustration: Default model for mortgage data. In this section, we conduct an analysis of the Freddie Mac data set introduced in Section 1. Table 8 in the Appendix provide a description of the variables in this data set. The theory in previous sections show that penalized or weighted logistic regression cannot avoid the consequence of highly imbalanced data. We try our relabelling approach on Freddie Mac data to test if and when a problem emerges in highly imbalanced logistic regression.

Data preparation.
In the financial industry, the log transformation is frequently used to make highly skewed data less skewed and reduce the influence of outliers [1]. We deploy the log transformation on the variable "UPB", since "UPB" is left skewed in the Freddie Mac data. "Seller" and "Servicer" are transformed from categorical variables to numerical variables using the weight of evidence approach [25, p. 25], because "Seller" and "Servicer" contain many category levels. All other categorical variables are dummy coded to binary variables.
After dummy coding categorical variables into binary variables, we delete the variables which are constant in the minority class, because linear separation appear in some dummy variables. For example, the variable "number of units" is dummy coded into several indicator variables ("number.units2 ", "number.units3 ", "number.units4 "). However, in 2005 for example, some of these categories are barely populated. Thus, after dummy coding the variable "number of units" into binary variables, the new dummy variable "number.units4 " has constant value 0 in the minority (default) class, which makes the coefficient estimate of "number.units4 " in logistic regression fail to be finite.

5.2.
Model building procedure. Table 6 illustrates the experimental procedure.
After data preparation, we use data from an individual year (e.g. 2000) as a training set to train two different models: "without relabelling" and "with relabelling". Here, "without relabelling" refers to modelling a logistic regression based on the prepared data. Model output is the estimated posterior probability of a mortgage account defaulting, given its feature vector.
In "with relabelling" procedure, we use hierarchical clustering to split the minority class into two clusters. These two clusters provide two new default classes, which, along with the non-default class data, made up a new relabeled data. Then multinomial logistic regression is used to model the three classes in this relabeled data. The model output is still the posterior probability of defaulting by summarizing the posterior probability of two new default classes.
Although many kinds of unsupervised clustering methods are available, we choose hierarchical clustering here because it is fast and also because the algorithm is deterministic [17]. We manually set a threshold, which the minority class is clustered into two groups. We also considered K-means and K-medians clustering as well, since we used these in the simulation (Section 4) and also because they are a common choice for clustering. However, hierarchical clustering gave marginally better results.
A two-year gap (e.g. 2001-2002) is used for collecting default status information. Using contiguous windows for training is common practice in the credit risk industry. There are two main factors which influence the choice of the time gap: • On one hand, we need to keep the observation time for default long enough to capture adequate default information. • On the other hand, we also need to keep the model up-to-date, which requires short observation time. For example, if we increase the two-year gap to four years, the model is unlikely to be up-to-date. This means, for example, a model built in 2000 will be used to predict the data of each quarter in 2005 rather than 2003.
A two-year gap is a reasonable choice to balance this trade-off. Finally, these models (i.e. "without relabelling" and "with relabelling") are used to forecast default events in the four quarters in the test year. The performance assessment metric is AUC as this is a common industry measure of performance [25]. To explore modelling and performance issues over a long time horizon, this procedure is repeated over an eleven year period.  Figure 3 shows the test set AUC for each of the methods, over the observation period and Table 9 in the Appendix shows AUC on the test sets and corresponding standard deviation calculated by DeLong's method [7], repeated for 200 random bootstrap test sets and 200 stratified bootstrap [20] test sets in each quarter separately. Notice that the relabelling method does well to enhance performance in certain years, such as 2005, 2006 and 2012, but fails to achieve this in other years, 2007 and 2010. However, where relabelling's performance is not as good as logistic regression, it is not a large difference. On the other hand, we observe a large improvement in certain periods.
Given the context of the data, there are a number of other issues (except imbalance) that make a simple performance comparison difficult. Any model's efficacy may vary over time for a number of reasons and population drift (the distribution changes over time [13]) is a key concern, especially when considering a long time frame experiment. We want to provide some simple evidence in light of population drift, to enable us to better understand the behavior of logistic regression with imbalance.
Detecting evidence of population drift can be challenging. Here, we use a simple heuristic procedure to suggest which years (training set and corresponding test set) exhibit evidence of drift. Essentially, via the bootstrap, we compare a biascorrected estimate of the distribution of the "resubstitution" AUC (the measure computed on the training data) with the corresponding distribution of the test set AUC. Differences in these distributions provide a crude indication for the presence of population drift. Specifically, bootstrapped training sets are used for model training and out of bootstrap training samples are used to calculate the debiased AUC of the training set. We repeat this process 100 times for calculating the distributions of the training set AUC. We also bootstrap test sets 100 times in each test set quarter to calculate the distributions of the test set AUC. Figure 7 in the Appendix shows the results. Different lines are the distributions of the training set AUC and the corresponding test set AUC in four test quarters respectively. This figure suggests that population drift occurs in a number of years.  We seek to quantify the population drift between these distributions. The K-S statistic [16] can calculate the distance (D) between two empirical distributions. Here, we use 100 debiased training set AUCs and all four quarters' test set AUCs (4×100) as the samples of the two empirical distributions. Table 7 gives D for each year. Generally, we find that population drift happens very often in our experiment (unsurprising since the observation period spans the financial crash of 2008) but the performance of the "without relabelling" and "with relabelling" method with respect to population drift are different. The D statistics of "with relabelling" are consistently less than "without relabelling", which may suggest that our approach helps alleviate the influence of population drift. Table 10 in Appendix provides information about the default rate and the mean AUC difference ("with relabelling" -"without relabelling") in each year. Figure 4 plots this mean AUC difference in each year with respect to different default rate in training year and population drift scenario. For illustration, we arbitrarily define the absence of population drift when D < 0.5 for the "with relabelling" approach. First, note that higher default rates are associated with population drift. Second, for these high default scenario, there is no compelling evidence to prefer either model. However, in the low default rate scenario, ie with class imbalance, "with relabelling" performs better than "without relabelling" when population drift does not happen, while logistic regression is preferred when population drift is present. These post-hoc diagnostic results suggest the shortcomings of logistic regression with imbalanced data.
For this application, it is also meaningful to investigate the difference in characteristics between different default groups arising from relabelling. We look particularly at years 2002 and 2004 since those training years correspond to test years 2005 and 2007 respectively: 2005 was a year logistic regression performed poorly, relative to the relabelling approach, and in 2007, logistic regression was acceptable in performance, as can be read from Figure 3. The distribution of key variables are shown in Figures 5 and 6. For 2002, we see clear differences in distribution of DTI and LTV for the two default clusters: in particular cluster 1 represents a group with much higher LTV but interestingly a similar range of FICO score. For 2004, we also see clear differences in LTV. However one of the default groups has generally lower LTV than the non-default group which is contrary to conventional credit-scoring wisdom: expert knowledge [5] in mortgage markets shows higher LTV brings higher risk of default. This may therefore not express a genuinely useful clustering for modelling default risk.    [24] between "Default 1" and "Default 2" in each variable. 6. Conclusion. We have explored the impact of class imbalance for the logistic regression methods used for credit scoring. Owen's results show that, in some limit, unwanted estimation artifacts arise when classes are highly imbalanced. We extend these results to show that some candidate approaches for mitigating these effects, namely weighted and penalized likelihood approaches, suffer from the same problem. This limiting behavior is a characteristic of logistic regression, regardless of any particular imbalanced dataset. This tells us that the infinite imbalance problem is fundamental for logistic regression. The challenge of severe model imbalance is profound. Since logistic regression remains the workhorse of credit scoring and many other applications [12,31], the issue of imbalance certainly demands more attention, and the development of enhanced tools.
To mitigate the problem of highly imbalanced logistic regression, we made use of a relabelling approach. Results for the credit scoring example are promising and indicate that the relabelling method works well in conditions of low population drift and high imbalance. This suggests proposing the relabelling approach itself as a practical solution to the problems discussed here. Further work is required to determine the correct clustering methodology and to identify more precisely the circumstances when the clustering of the rare class gives benefit for the classification problem. This will be the subject of future research.
Appendix. This section provides figures and tables which mentioned in our paper.    . Density plot of AUC on training year and four test quarters respectively, the left side is "with relabelling" method and the right side is "without relabelling".