December  2019, 1(4): 389-417. doi: 10.3934/fods.2019016

Issues using logistic regression with class imbalance, with a case study from credit risk modelling

Department of Mathematics, Imperial College London, London, SW7 2AZ, UK

* Corresponding author: Yazhe Li

Published  December 2019

The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen's results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

Citation: Yazhe Li, Tony Bellotti, Niall Adams. Issues using logistic regression with class imbalance, with a case study from credit risk modelling. Foundations of Data Science, 2019, 1 (4) : 389-417. doi: 10.3934/fods.2019016
References:
[1]

E. I. Altman and G. Sabato, Modelling credit risk for smes: Evidence from the US market, Abacus, 43 (2007), 332-357.  doi: 10.1111/j.1467-6281.2007.00234.x.  Google Scholar

[2]

G. E. BatistaR. C. Prati and M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 6 (2004), 20-29.  doi: 10.1145/1007730.1007735.  Google Scholar

[3]

C. BravoL. C. Thomas and R. Weber, Improving credit scoring by differentiating defaulter behaviour, Journal of the Operational Research Society, 66 (2015), 771-781.  doi: 10.1057/jors.2014.50.  Google Scholar

[4]

N. V. ChawlaK. W. BowyerL. O. Hall and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.  doi: 10.1613/jair.953.  Google Scholar

[5]

T. M. Clauretie, A note on mortgage risk: Default vs. loss rates, Real Estate Economics, 18 (1990), 202-206.  doi: 10.1111/1540-6229.00517.  Google Scholar

[6]

Cornell Law School, Definition of default, date of default, and requirement of notice of default, URL https://www.law.cornell.edu/cfr/text/24/203.467. Google Scholar

[7]

E. R. DeLong and D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach, Biometrics, 44 (1988), 837-845.  doi: 10.2307/2531595.  Google Scholar

[8] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, Institute of Mathematical Statistics (IMS) Monographs, 5. Cambridge University Press, New York, 2016.  doi: 10.1017/CBO9781316576533.  Google Scholar
[9]

T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 27 (2006), 861-874.  doi: 10.1016/j.patrec.2005.10.010.  Google Scholar

[10]

D. J. Hand, Reject inference in credit operations, Credit Risk Modeling: Design and Application, 181–190. Google Scholar

[11]

A. E. Hoerl and R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67.   Google Scholar

[12]

G. King and L. Zeng, Logistic regression in rare events data, Political analysis, 9 (2001), 137-163.   Google Scholar

[13]

G. Krempl and V. Hofer, Classification in presence of drift and latency, in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, 2011, 596–603. doi: 10.1109/ICDMW.2011.47.  Google Scholar

[14]

J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, Artificial Intelligence in Medicine, 2101 (2001), 63-66.  doi: 10.1007/3-540-48229-6_9.  Google Scholar

[15]

X.-Y. LiuJ. Wu and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (2009), 539-550.   Google Scholar

[16]

F. J. Massey Jr, The Kolmogorov-{S}mirnov test for goodness of fit, Journal of the American Statistical Association, 46 (1951), 68-78.   Google Scholar

[17]

F. Murtagh and P. Contreras, Algorithms for hierarchical clustering: An overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2 (2012), 86-97.   Google Scholar

[18]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[19]

A. B. Owen, Infinitely imbalanced logistic regression, Journal of Machine Learning Research, 8 (2007), 761-773.   Google Scholar

[20]

O. Pons, Bootstrap of means under stratified sampling, Electronic Journal of Statistics, 1 (2007), 381-391.  doi: 10.1214/07-EJS033.  Google Scholar

[21]

R. Rockafellar, Convex Analysis, Princeton University Press, Princeton, N.J. 1970.  Google Scholar

[22]

C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, Resampling or reweighting: A comparison of boosting implementations, in 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 1 (2008), 445–451. doi: 10.1109/ICTAI.2008.59.  Google Scholar

[23]

M. J. Silvapulle, On the existence of maximum likelihood estimators for the binomial response models, Journal of the Royal Statistical Society. Series B (Methodological), 43 (1981), 310-313.  doi: 10.1111/j.2517-6161.1981.tb01676.x.  Google Scholar

[24]

St udent, The probable error of a mean, Biometrika, 6 (1908), 1-25.   Google Scholar

[25]

L. C. Thomas, Consumer Credit Models: Pricing, Profit and Portfolios, Oxford, 2009. Google Scholar

[26]

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.  Google Scholar

[27]

R. Tibshirani, The lasso problem and uniqueness, Electronic Journal of Statistics, 7 (2013), 1456-1490.  doi: 10.1214/13-EJS815.  Google Scholar

[28]

H. Wang, Q. Xu and L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PLoS ONE, 10 (2015), e0117844. doi: 10.1371/journal.pone.0117844.  Google Scholar

[29]

V. Wieringen and Wessel, Lecture notes on ridge regression, arXiv preprint, arXiv: 1509.09169. Google Scholar

[30]

G. Zeng, On the existence of maximum likelihood estimates for weighted logistic regression, Communications in Statistics-Theory and Methods, 46 (2017), 11194-11203.  doi: 10.1080/03610926.2016.1260742.  Google Scholar

[31]

M. ZhuW. Su and H. A. Chipman, Lago: A computationally efficient approach for statistical detection, Technometrics, 48 (2006), 193-205.  doi: 10.1198/004017005000000643.  Google Scholar

show all references

References:
[1]

E. I. Altman and G. Sabato, Modelling credit risk for smes: Evidence from the US market, Abacus, 43 (2007), 332-357.  doi: 10.1111/j.1467-6281.2007.00234.x.  Google Scholar

[2]

G. E. BatistaR. C. Prati and M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 6 (2004), 20-29.  doi: 10.1145/1007730.1007735.  Google Scholar

[3]

C. BravoL. C. Thomas and R. Weber, Improving credit scoring by differentiating defaulter behaviour, Journal of the Operational Research Society, 66 (2015), 771-781.  doi: 10.1057/jors.2014.50.  Google Scholar

[4]

N. V. ChawlaK. W. BowyerL. O. Hall and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.  doi: 10.1613/jair.953.  Google Scholar

[5]

T. M. Clauretie, A note on mortgage risk: Default vs. loss rates, Real Estate Economics, 18 (1990), 202-206.  doi: 10.1111/1540-6229.00517.  Google Scholar

[6]

Cornell Law School, Definition of default, date of default, and requirement of notice of default, URL https://www.law.cornell.edu/cfr/text/24/203.467. Google Scholar

[7]

E. R. DeLong and D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach, Biometrics, 44 (1988), 837-845.  doi: 10.2307/2531595.  Google Scholar

[8] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, Institute of Mathematical Statistics (IMS) Monographs, 5. Cambridge University Press, New York, 2016.  doi: 10.1017/CBO9781316576533.  Google Scholar
[9]

T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 27 (2006), 861-874.  doi: 10.1016/j.patrec.2005.10.010.  Google Scholar

[10]

D. J. Hand, Reject inference in credit operations, Credit Risk Modeling: Design and Application, 181–190. Google Scholar

[11]

A. E. Hoerl and R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67.   Google Scholar

[12]

G. King and L. Zeng, Logistic regression in rare events data, Political analysis, 9 (2001), 137-163.   Google Scholar

[13]

G. Krempl and V. Hofer, Classification in presence of drift and latency, in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, 2011, 596–603. doi: 10.1109/ICDMW.2011.47.  Google Scholar

[14]

J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, Artificial Intelligence in Medicine, 2101 (2001), 63-66.  doi: 10.1007/3-540-48229-6_9.  Google Scholar

[15]

X.-Y. LiuJ. Wu and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (2009), 539-550.   Google Scholar

[16]

F. J. Massey Jr, The Kolmogorov-{S}mirnov test for goodness of fit, Journal of the American Statistical Association, 46 (1951), 68-78.   Google Scholar

[17]

F. Murtagh and P. Contreras, Algorithms for hierarchical clustering: An overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2 (2012), 86-97.   Google Scholar

[18]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.  Google Scholar

[19]

A. B. Owen, Infinitely imbalanced logistic regression, Journal of Machine Learning Research, 8 (2007), 761-773.   Google Scholar

[20]

O. Pons, Bootstrap of means under stratified sampling, Electronic Journal of Statistics, 1 (2007), 381-391.  doi: 10.1214/07-EJS033.  Google Scholar

[21]

R. Rockafellar, Convex Analysis, Princeton University Press, Princeton, N.J. 1970.  Google Scholar

[22]

C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, Resampling or reweighting: A comparison of boosting implementations, in 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 1 (2008), 445–451. doi: 10.1109/ICTAI.2008.59.  Google Scholar

[23]

M. J. Silvapulle, On the existence of maximum likelihood estimators for the binomial response models, Journal of the Royal Statistical Society. Series B (Methodological), 43 (1981), 310-313.  doi: 10.1111/j.2517-6161.1981.tb01676.x.  Google Scholar

[24]

St udent, The probable error of a mean, Biometrika, 6 (1908), 1-25.   Google Scholar

[25]

L. C. Thomas, Consumer Credit Models: Pricing, Profit and Portfolios, Oxford, 2009. Google Scholar

[26]

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.  Google Scholar

[27]

R. Tibshirani, The lasso problem and uniqueness, Electronic Journal of Statistics, 7 (2013), 1456-1490.  doi: 10.1214/13-EJS815.  Google Scholar

[28]

H. Wang, Q. Xu and L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PLoS ONE, 10 (2015), e0117844. doi: 10.1371/journal.pone.0117844.  Google Scholar

[29]

V. Wieringen and Wessel, Lecture notes on ridge regression, arXiv preprint, arXiv: 1509.09169. Google Scholar

[30]

G. Zeng, On the existence of maximum likelihood estimates for weighted logistic regression, Communications in Statistics-Theory and Methods, 46 (2017), 11194-11203.  doi: 10.1080/03610926.2016.1260742.  Google Scholar

[31]

M. ZhuW. Su and H. A. Chipman, Lago: A computationally efficient approach for statistical detection, Technometrics, 48 (2006), 193-205.  doi: 10.1198/004017005000000643.  Google Scholar

Figure 1.  Sample size and default rate from 2003 to 2013 in the Freddie Mac data set
Figure 2.  Scatter plot of Simulations Samples. Green points represent majority class and red points represent minority class
Figure 3.  AUC plot of different methods from test year 2003 to 2013
Figure 7.  Density plot of AUC on training year and four test quarters respectively, the left side is "with relabelling" method and the right side is "without relabelling"
Figure 4.  Scatter plot of mean AUC difference in test year v.s. default rate in training year
Figure 5.  Boxplots for Score, DTI, UPB, LTV and OIR in the year 2002, The $ p $-values in the plot are calculated through Student's $ t $-test [24] between "Default 1" and "Default 2" in each variable
Figure 6.  Boxplots for Score, DTI, UPB, LTV and OIR in the year 2004, The $ p $-values in the plot are calculated through Student's $ t $-test [24] between "Default 1" and "Default 2" in each variable
Table 1.  Simulation A for infinitely imbalanced penalized logistic regression. $ N $ observations in majority class ($ Y = 0 $) following $ N(0,1) $ and 100 observations in minority class with $ Y = 1, X = 1 $
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 1.1215 41.7805 -0.5247 0.5917 59.1750 0.6879 1.9896
1000 0.5656 65.3495 -2.4591 0.0855 85.5127 0.2454 1.2782
10000 0.5013 68.3830 -4.6289 0.0098 97.6581 0.0450 1.0460
100000 0.5007 68.6940 -6.9102 0.0010 99.7516 0.0049 1.0050
1000000 0.5001 68.7254 -9.2106 0.0001 99.9750 0.0005 1.0005
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 1.1215 41.7805 -0.5247 0.5917 59.1750 0.6879 1.9896
1000 0.5656 65.3495 -2.4591 0.0855 85.5127 0.2454 1.2782
10000 0.5013 68.3830 -4.6289 0.0098 97.6581 0.0450 1.0460
100000 0.5007 68.6940 -6.9102 0.0010 99.7516 0.0049 1.0050
1000000 0.5001 68.7254 -9.2106 0.0001 99.9750 0.0005 1.0005
Table 2.  Simulation B for infinitely imbalanced penalized logistic regression. $ N $ observations in majority class ($ Y = 0 $) following $ \mathrm{Uniform}(0,1) $ and 100 observations in minority class (half of them with $ Y = 1, X = 0.5 $, the others with $ Y = 1, X = 2 $)
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 2.2347 16.2756 -1.0602 0.3464 34.6374 1.2598 3.5246
1000 3.2033 8.4214 -3.4516 0.0317 31.6947 1.6478 5.1958
10000 4.6591 2.8035 -4.9902 0.0068 68.0441 0.7112 2.0364
100000 6.3475 0.7238 -6.9521 0.0010 95.6659 0.0878 1.0918
1000000 8.1866 0.1524 -9.2148 0.0001 99.5517 0.0090 1.0090
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 2.2347 16.2756 -1.0602 0.3464 34.6374 1.2598 3.5246
1000 3.2033 8.4214 -3.4516 0.0317 31.6947 1.6478 5.1958
10000 4.6591 2.8035 -4.9902 0.0068 68.0441 0.7112 2.0364
100000 6.3475 0.7238 -6.9521 0.0010 95.6659 0.0878 1.0918
1000000 8.1866 0.1524 -9.2148 0.0001 99.5517 0.0090 1.0090
Table 3.  Infinitely imbalanced logistic regression shrinkage law
Fixture Logistic Regression Ridge Lasso
$ \beta_0 $ $ -\infty $ $ -\infty $ $ -\infty $
$ N e^{\beta_0} $ certain value, $ k_1 $ n n
$ \beta $ certain value, $ k_2 $ 0 0
Fixture Logistic Regression Ridge Lasso
$ \beta_0 $ $ -\infty $ $ -\infty $ $ -\infty $
$ N e^{\beta_0} $ certain value, $ k_1 $ n n
$ \beta $ certain value, $ k_2 $ 0 0
Table 4.  Coefficient estimates of lasso penalized logistic regression with different penalty parameter $ \lambda $
$ \lambda $ $ \beta_{\cdot 1} $ $ \beta_{\cdot 2} $ $ \beta_{\cdot 3} $ $ \beta_{\cdot 4} $ $ \beta_{\cdot 5} $
0.0190 0 0 0 0 0
0.0168 0.1650 0 0 0 0
0.0153 0.3106 0.1148 0 0 0
0.0139 0.4388 0.2416 0.0377 0 0
0.0116 0.6435 0.4445 0.2392 0.0471 0
0.0087 0.8621 0.6581 0.4525 0.2547 0.0516
$ \lambda $ $ \beta_{\cdot 1} $ $ \beta_{\cdot 2} $ $ \beta_{\cdot 3} $ $ \beta_{\cdot 4} $ $ \beta_{\cdot 5} $
0.0190 0 0 0 0 0
0.0168 0.1650 0 0 0 0
0.0153 0.3106 0.1148 0 0 0
0.0139 0.4388 0.2416 0.0377 0 0
0.0116 0.6435 0.4445 0.2392 0.0471 0
0.0087 0.8621 0.6581 0.4525 0.2547 0.0516
Table 5.  Coefficient of logistic regression and two clusters multinomial logistic regression. The left three columns are logistic regression and right four columns are multinomial logistic regression
Logistic Regression Multinomial Logistic Regression
Coefficients Estimate $ \text{Pr}(>|z|) $ Cluster Coefficients Estimate $ \text{Pr}(>|t|) $
Intercept -5.7705 $< 2\times 10^{-16} $ $ c2 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_1 $ 1.1384 $< 2\times 10^{-16} $ $ c3 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_2 $ 1.1287 $< 2\times 10^{-16} $ $ c2 $ $ x1 $ 0.0818 0.6045
$ c3 $ $ x1 $ 2.2775 $< 2\times 10^{-16} $
$ c2 $ $ x2 $ 2.3532 $< 2\times 10^{-16} $
$ c3 $ $ x2 $ 0.0953 0.5412
Logistic Regression Multinomial Logistic Regression
Coefficients Estimate $ \text{Pr}(>|z|) $ Cluster Coefficients Estimate $ \text{Pr}(>|t|) $
Intercept -5.7705 $< 2\times 10^{-16} $ $ c2 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_1 $ 1.1384 $< 2\times 10^{-16} $ $ c3 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_2 $ 1.1287 $< 2\times 10^{-16} $ $ c2 $ $ x1 $ 0.0818 0.6045
$ c3 $ $ x1 $ 2.2775 $< 2\times 10^{-16} $
$ c2 $ $ x2 $ 2.3532 $< 2\times 10^{-16} $
$ c3 $ $ x2 $ 0.0953 0.5412
Table 8.  Description of variables in the Freddie Mac data set
Variable Type Description
Default Categorical Dependent variable: 1 if borrower greater than 180 days past due on monthly installments; 0 otherwise.
Score Continuous A number, prepared by third parties, summarizing the borrower's creditworthiness, which may be indicative of the likelihood that the borrower will timely repay future obligations.
DTI Continuous Original Debt-To-Income Ratio.
UPB Continuous Unpaid Principal Balance.
LTV Continuous Original Loan-To-Value.
OIR Continuous Original Interest Rate.
Number of Borrowers Categorical The number of borrower(s) who are obligated to repay the mortgage note secured by the mortgaged property. 1 = one borrower; 2 = more than one borrower.
Seller Categorical The entity acting in its capacity as a seller of mortgages to Freddie Mac at the time of acquisition.
Servicer Categorical The entity acting in its capacity as the servicer of mortgages to Freddie Mac as of the last period for which loan activity is reported in the Dataset.
First Time Homebuyer Categorical Y =yes; N = no.
Number of Units Categorical Denotes whether the mortgage is a one-, two-, three-, or four-unit property.
Occupancy Status Categorical O = Owner Occupied; I = Investment Property; S = Second Home; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
PPM Categorical Denotes whether the mortgage is a Prepayment Penalty Mortgage. Y = PPM; N = Not PPM.
Property Type Categorical CO = Condo; LH = Leasehold; PU = PUD; MH = Manufactured Housing; SF = 1-4 Fee Simple; CP = Co-op; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
Loan Purpose Categorical P = Purchase; C = Cash-out Refinance; N = No Cash-out Refinance; Space = Unknown.
Variable Type Description
Default Categorical Dependent variable: 1 if borrower greater than 180 days past due on monthly installments; 0 otherwise.
Score Continuous A number, prepared by third parties, summarizing the borrower's creditworthiness, which may be indicative of the likelihood that the borrower will timely repay future obligations.
DTI Continuous Original Debt-To-Income Ratio.
UPB Continuous Unpaid Principal Balance.
LTV Continuous Original Loan-To-Value.
OIR Continuous Original Interest Rate.
Number of Borrowers Categorical The number of borrower(s) who are obligated to repay the mortgage note secured by the mortgaged property. 1 = one borrower; 2 = more than one borrower.
Seller Categorical The entity acting in its capacity as a seller of mortgages to Freddie Mac at the time of acquisition.
Servicer Categorical The entity acting in its capacity as the servicer of mortgages to Freddie Mac as of the last period for which loan activity is reported in the Dataset.
First Time Homebuyer Categorical Y =yes; N = no.
Number of Units Categorical Denotes whether the mortgage is a one-, two-, three-, or four-unit property.
Occupancy Status Categorical O = Owner Occupied; I = Investment Property; S = Second Home; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
PPM Categorical Denotes whether the mortgage is a Prepayment Penalty Mortgage. Y = PPM; N = Not PPM.
Property Type Categorical CO = Condo; LH = Leasehold; PU = PUD; MH = Manufactured Housing; SF = 1-4 Fee Simple; CP = Co-op; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
Loan Purpose Categorical P = Purchase; C = Cash-out Refinance; N = No Cash-out Refinance; Space = Unknown.
Table 6.  Experiment Procedure Time Table
Training set year 2000 2001 $ \cdots $
Default collection year 2001 2002 2002 2003 $ \cdots $
Testing set year 2003 2004 $ \cdots $
Training set year 2000 2001 $ \cdots $
Default collection year 2001 2002 2002 2003 $ \cdots $
Testing set year 2003 2004 $ \cdots $
Table 9.  AUC and standard deviation in Freddie Mac experiment
Time With Relabelling Without Relabelling
AUC DeLong Bootstrap Stratified AUC DeLong Bootstrap Stratified
2003 Q1 0.879 0.033 0.035 0.032 0.873 0.032 0.028 0.033
2003 Q2 0.880 0.025 0.024 0.024 0.878 0.026 0.026 0.025
2003 Q3 0.839 0.035 0.033 0.031 0.824 0.039 0.037 0.038
2003 Q4 0.872 0.025 0.025 0.025 0.872 0.026 0.028 0.026
2004 Q1 0.808 0.042 0.041 0.041 0.804 0.043 0.041 0.039
2004 Q2 0.804 0.053 0.056 0.053 0.795 0.052 0.046 0.050
2004 Q3 0.636 0.067 0.063 0.067 0.634 0.075 0.067 0.073
2004 Q4 0.806 0.046 0.045 0.046 0.796 0.054 0.056 0.051
2005 Q1 0.865 0.025 0.024 0.027 0.805 0.042 0.045 0.043
2005 Q2 0.841 0.026 0.025 0.026 0.758 0.038 0.037 0.036
2005 Q3 0.849 0.021 0.020 0.022 0.799 0.033 0.032 0.033
2005 Q4 0.814 0.022 0.022 0.021 0.776 0.027 0.028 0.029
2006 Q1 0.817 0.017 0.016 0.016 0.797 0.020 0.021 0.019
2006 Q2 0.803 0.015 0.016 0.016 0.795 0.017 0.017 0.017
2006 Q3 0.789 0.016 0.015 0.015 0.776 0.018 0.018 0.018
2006 Q4 0.776 0.012 0.012 0.012 0.769 0.013 0.013 0.013
2007 Q1 0.697 0.013 0.013 0.014 0.713 0.013 0.012 0.012
2007 Q2 0.704 0.010 0.010 0.010 0.720 0.009 0.009 0.009
2007 Q3 0.725 0.008 0.008 0.008 0.727 0.008 0.008 0.008
2007 Q4 0.720 0.006 0.006 0.007 0.738 0.006 0.006 0.005
2008 Q1 0.837 0.004 0.004 0.004 0.838 0.004 0.005 0.005
2008 Q2 0.832 0.005 0.005 0.005 0.833 0.005 0.006 0.005
2008 Q3 0.830 0.006 0.006 0.007 0.831 0.006 0.006 0.007
2008 Q4 0.857 0.008 0.008 0.008 0.856 0.008 0.008 0.008
2009 Q1 0.804 0.024 0.023 0.022 0.805 0.024 0.023 0.023
2009 Q2 0.811 0.018 0.019 0.017 0.807 0.018 0.017 0.018
2009 Q3 0.757 0.013 0.013 0.013 0.758 0.013 0.012 0.013
2009 Q4 0.738 0.023 0.025 0.022 0.742 0.023 0.022 0.023
2010 Q1 0.825 0.033 0.034 0.032 0.829 0.032 0.029 0.031
2010 Q2 0.793 0.038 0.039 0.037 0.798 0.037 0.034 0.039
2010 Q3 0.826 0.034 0.031 0.034 0.830 0.033 0.029 0.033
2010 Q4 0.769 0.036 0.038 0.034 0.779 0.037 0.035 0.037
2011 Q1 0.789 0.039 0.037 0.035 0.780 0.039 0.043 0.039
2011 Q2 0.780 0.042 0.041 0.039 0.773 0.043 0.041 0.042
2011 Q3 0.740 0.048 0.048 0.044 0.733 0.049 0.048 0.046
2011 Q4 0.782 0.050 0.043 0.047 0.783 0.049 0.050 0.046
2012 Q1 0.861 0.034 0.032 0.033 0.868 0.031 0.031 0.031
2012 Q2 0.776 0.043 0.045 0.038 0.778 0.042 0.046 0.039
2012 Q3 0.771 0.045 0.043 0.045 0.784 0.045 0.046 0.043
2012 Q4 0.771 0.038 0.036 0.034 0.766 0.039 0.038 0.040
2013 Q1 0.769 0.039 0.037 0.039 0.772 0.040 0.039 0.041
2013 Q2 0.738 0.029 0.028 0.029 0.739 0.030 0.028 0.026
2013 Q3 0.730 0.040 0.039 0.041 0.735 0.042 0.043 0.041
2013 Q4 0.754 0.033 0.031 0.032 0.750 0.033 0.032 0.032
Time With Relabelling Without Relabelling
AUC DeLong Bootstrap Stratified AUC DeLong Bootstrap Stratified
2003 Q1 0.879 0.033 0.035 0.032 0.873 0.032 0.028 0.033
2003 Q2 0.880 0.025 0.024 0.024 0.878 0.026 0.026 0.025
2003 Q3 0.839 0.035 0.033 0.031 0.824 0.039 0.037 0.038
2003 Q4 0.872 0.025 0.025 0.025 0.872 0.026 0.028 0.026
2004 Q1 0.808 0.042 0.041 0.041 0.804 0.043 0.041 0.039
2004 Q2 0.804 0.053 0.056 0.053 0.795 0.052 0.046 0.050
2004 Q3 0.636 0.067 0.063 0.067 0.634 0.075 0.067 0.073
2004 Q4 0.806 0.046 0.045 0.046 0.796 0.054 0.056 0.051
2005 Q1 0.865 0.025 0.024 0.027 0.805 0.042 0.045 0.043
2005 Q2 0.841 0.026 0.025 0.026 0.758 0.038 0.037 0.036
2005 Q3 0.849 0.021 0.020 0.022 0.799 0.033 0.032 0.033
2005 Q4 0.814 0.022 0.022 0.021 0.776 0.027 0.028 0.029
2006 Q1 0.817 0.017 0.016 0.016 0.797 0.020 0.021 0.019
2006 Q2 0.803 0.015 0.016 0.016 0.795 0.017 0.017 0.017
2006 Q3 0.789 0.016 0.015 0.015 0.776 0.018 0.018 0.018
2006 Q4 0.776 0.012 0.012 0.012 0.769 0.013 0.013 0.013
2007 Q1 0.697 0.013 0.013 0.014 0.713 0.013 0.012 0.012
2007 Q2 0.704 0.010 0.010 0.010 0.720 0.009 0.009 0.009
2007 Q3 0.725 0.008 0.008 0.008 0.727 0.008 0.008 0.008
2007 Q4 0.720 0.006 0.006 0.007 0.738 0.006 0.006 0.005
2008 Q1 0.837 0.004 0.004 0.004 0.838 0.004 0.005 0.005
2008 Q2 0.832 0.005 0.005 0.005 0.833 0.005 0.006 0.005
2008 Q3 0.830 0.006 0.006 0.007 0.831 0.006 0.006 0.007
2008 Q4 0.857 0.008 0.008 0.008 0.856 0.008 0.008 0.008
2009 Q1 0.804 0.024 0.023 0.022 0.805 0.024 0.023 0.023
2009 Q2 0.811 0.018 0.019 0.017 0.807 0.018 0.017 0.018
2009 Q3 0.757 0.013 0.013 0.013 0.758 0.013 0.012 0.013
2009 Q4 0.738 0.023 0.025 0.022 0.742 0.023 0.022 0.023
2010 Q1 0.825 0.033 0.034 0.032 0.829 0.032 0.029 0.031
2010 Q2 0.793 0.038 0.039 0.037 0.798 0.037 0.034 0.039
2010 Q3 0.826 0.034 0.031 0.034 0.830 0.033 0.029 0.033
2010 Q4 0.769 0.036 0.038 0.034 0.779 0.037 0.035 0.037
2011 Q1 0.789 0.039 0.037 0.035 0.780 0.039 0.043 0.039
2011 Q2 0.780 0.042 0.041 0.039 0.773 0.043 0.041 0.042
2011 Q3 0.740 0.048 0.048 0.044 0.733 0.049 0.048 0.046
2011 Q4 0.782 0.050 0.043 0.047 0.783 0.049 0.050 0.046
2012 Q1 0.861 0.034 0.032 0.033 0.868 0.031 0.031 0.031
2012 Q2 0.776 0.043 0.045 0.038 0.778 0.042 0.046 0.039
2012 Q3 0.771 0.045 0.043 0.045 0.784 0.045 0.046 0.043
2012 Q4 0.771 0.038 0.036 0.034 0.766 0.039 0.038 0.040
2013 Q1 0.769 0.039 0.037 0.039 0.772 0.040 0.039 0.041
2013 Q2 0.738 0.029 0.028 0.029 0.739 0.030 0.028 0.026
2013 Q3 0.730 0.040 0.039 0.041 0.735 0.042 0.043 0.041
2013 Q4 0.754 0.033 0.031 0.032 0.750 0.033 0.032 0.032
Table 7.  $ D $ statistics from KS-test between training and test bootstrapped AUC
train year 2000 2001 2002 2003 2004 2005
test year 2003 2004 2005 2006 2007 2008
without relabelling 0.435 0.885 0.980 1.000 1.000 0.800
with relabelling 0.420 0.679 0.398 0.842 0.855 0.289
train year 2006 2007 2008 2009 2010
test year 2009 2010 2011 2012 2013
without relabelling 0.993 0.900 0.985 0.890 0.990
with relabelling 0.930 0.930 0.983 0.827 0.795
train year 2000 2001 2002 2003 2004 2005
test year 2003 2004 2005 2006 2007 2008
without relabelling 0.435 0.885 0.980 1.000 1.000 0.800
with relabelling 0.420 0.679 0.398 0.842 0.855 0.289
train year 2006 2007 2008 2009 2010
test year 2009 2010 2011 2012 2013
without relabelling 0.993 0.900 0.985 0.890 0.990
with relabelling 0.930 0.930 0.983 0.827 0.795
Table 10.  Mean AUC difference (Hierarchical - Logistic) in each year
Train year Training Default rate Test year Test Default rate AUC difference
2000 0.41% 2003 0.06% 0.0057
2001 0.20% 2004 0.07% 0.0063
2002 0.10% 2005 0.18% 0.0578
2003 0.06% 2006 0.89% 0.0119
2004 0.07% 2007 4.26% -0.0133
2005 0.18% 2008 3.15% -0.0005
2006 0.89% 2009 0.30% -0.0003
2007 4.26% 2010 0.09% -0.0055
2008 3.15% 2011 0.08% 0.0055
2009 0.30% 2012 0.06% -0.0041
2010 0.09% 2013 0.10% -0.0011
Train year Training Default rate Test year Test Default rate AUC difference
2000 0.41% 2003 0.06% 0.0057
2001 0.20% 2004 0.07% 0.0063
2002 0.10% 2005 0.18% 0.0578
2003 0.06% 2006 0.89% 0.0119
2004 0.07% 2007 4.26% -0.0133
2005 0.18% 2008 3.15% -0.0005
2006 0.89% 2009 0.30% -0.0003
2007 4.26% 2010 0.09% -0.0055
2008 3.15% 2011 0.08% 0.0055
2009 0.30% 2012 0.06% -0.0041
2010 0.09% 2013 0.10% -0.0011
[1]

Xiyou Cheng, Zhitao Zhang. Structure of positive solutions to a class of Schrödinger systems. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020461

[2]

Shiqiu Fu, Kanishka Perera. On a class of semipositone problems with singular Trudinger-Moser nonlinearities. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020452

[3]

Yolanda Guerrero–Sánchez, Muhammad Umar, Zulqurnain Sabir, Juan L. G. Guirao, Muhammad Asif Zahoor Raja. Solving a class of biological HIV infection model of latently infected cells using heuristic approach. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020431

[4]

Zedong Yang, Guotao Wang, Ravi P. Agarwal, Haiyong Xu. Existence and nonexistence of entire positive radial solutions for a class of Schrödinger elliptic systems involving a nonlinear operator. Discrete & Continuous Dynamical Systems - S, 2020  doi: 10.3934/dcdss.2020436

[5]

Mengni Li. Global regularity for a class of Monge-Ampère type equations with nonzero boundary conditions. Communications on Pure & Applied Analysis, , () : -. doi: 10.3934/cpaa.2020267

 Impact Factor: 

Metrics

  • PDF downloads (1040)
  • HTML views (617)
  • Cited by (0)

Other articles
by authors

[Back to Top]