doi: 10.3934/jimo.2020128

Two penalized mixed–integer nonlinear programming approaches to tackle multicollinearity and outliers effects in linear regression models

 Faculty of Mathematics, Statistics and Computer Science, Semnan University, P.O. Box 35195–363, Semnan, Iran

* Corresponding author: Mahdi Roozbeh

Received  September 2019 Revised  May 2020 Published  August 2020

In classical regression analysis, the ordinary least–squares estimation is the best strategy when the essential assumptions such as normality and independency to the error terms as well as ignorable multicollinearity in the covariates are met. However, if one of these assumptions is violated, then the results may be misleading. Especially, outliers violate the assumption of normally distributed residuals in the least–squares regression. In this situation, robust estimators are widely used because of their lack of sensitivity to outlying data points. Multicollinearity is another common problem in multiple regression models with inappropriate effects on the least–squares estimators. So, it is of great importance to use the estimation methods provided to tackle the mentioned problems. As known, robust regressions are among the popular methods for analyzing the data that are contaminated with outliers. In this guideline, here we suggest two mixed–integer nonlinear optimization models which their solutions can be considered as appropriate estimators when the outliers and multicollinearity simultaneously appear in the data set. Capable to be effectively solved by metaheuristic algorithms, the models are designed based on penalization schemes with the ability of down–weighting or ignoring unusual data and multicollinearity effects. We establish that our models are computationally advantageous in the perspective of the flop count. We also deal with a robust ridge methodology. Finally, three real data sets are analyzed to examine performance of the proposed methods.

Citation: Mahdi Roozbeh, Saman Babaie–Kafaki, Zohre Aminifard. Two penalized mixed–integer nonlinear programming approaches to tackle multicollinearity and outliers effects in linear regression models. Journal of Industrial & Management Optimization, doi: 10.3934/jimo.2020128
The diagnostic plots of the model (18)
The diagram of ${\rm GCV}(k,z)$ versus the ridge parameter for the bridge projects data set
The diagnostic plots for the model (20)
The diagram of ${\rm GCV}(k,z)$ versus the ridge parameter for the electricity data
The diagnostic plots for the model (21)
The diagram of ${\rm GCV}(k,z)$ versus the ridge parameter for the CPS data
Evaluation of the proposed estimators for the bridge projects data set
 Method Coefficients OLS RLTS MLTSCM UBDMLTSCM1 $Intercept$ 2.3317 1.91363 2.0304 1.8278 $\log(CCost)$ 0.1483 0.33718 0.3056 0.2923 $\log(Dwgs)$ 0.8356 0.58002 0.6210 0.7829 $\log(Spans)$ 0.1963 0.06662 0.0657 0.0241 ${\rm SSE}$ 3.8692 1.9788 1.9778 1.0577 ${\rm R}^2$ 0.7747 0.8579 0.8600 0.9147 Method Coefficients UBDMLTSCM2 LSVR NSVR NNR $Intercept$ 1.9140 -0.0125 - -7.8431 $\log(CCost)$ 0.2360 0.4152 - 0.4236 $\log(Dwgs)$ 0.8914 0.3933 - 2.8061 $\log(Spans)$ 0.0467 0.1176 - 0.5110 ${\rm SSE}$ 1.1504 4.0131 2.7834 1.7108 ${\rm R}^2$ 0.9020 0.7663 0.8379 0.9004
The most effective subgroup of predictor variables based on the ${\rm R}^2_{adj}$ and AIC criteria for the electricity data set
 Subset size Predictor variables ${\rm R}^2_{adj}$ AIC 1 $Temp$ 0.5523 -1067.814 2 $Temp,LREG$ 0.5781 -1077.339 3 ${\bf Temp,LREG,LI}$ 0.5892 -1081.063 4 $Temp,LREG,LI,x_{9}$ 0.5891 -1080.057 5 $Temp,LREG,LI,x_{9},x_{10}$ 0.5882 -1078.709 6 $Temp,LREG,LI,x_{9},x_{10},x_{11}$ 0.5875 -1077.427 7 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1}$ 0.5858 -1075.734 8 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3}$ 0.5837 -1073.897 9 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5}$ 0.5812 -1071.907 10 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5},x_{4}$ 0.5789 -1069.987 11 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5},x_{4},x_{7}$ 0.5764 -1067.997 12 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5},x_{4},x_{7},x_{2}$ 0.5740 -1064.098 13 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5},x_{4},x_{7},x_{2},x_{6}$ 0.5718 -1064.281 14 $Temp,LREG,LI,x_{9},x_{10},x_{11},x_{1},x_{3},x_{5},x_{4},x_{7},x_{2},x_{6},x_{8}$ 0.5709 -1063.014
Evaluation of the proposed estimators for the electricity data set
 Method Coefficients OLS RLTS MLTSCM UBDMLTSCM1 $Intercept$ 4.4069 5.1693 4.9881 5.2039 $LI$ 0.1925 0.0989 0.1146 0.0956 $LREG$ -0.0778 -0.0939 -0.1054 -0.0956 $Temp$ -0.0002 -0.0002 -0.0003 -0.0003 ${\rm SSE}$ 0.3765 0.2637 0.1982 0.1296 ${\rm R}^2$ 0.5962 0.6742 0.7399 0.7559 Method Coefficients UBDMLTSCM2 LSVR NSVR NNR $Intercept$ 4.0907 0.0881 - 2.6215 $LI$ 0.2225 0.1545 - 1.2806 $LREG$ -0.0940 -0.1322 - -3.7418 $Temp$ -0.0003 -0.7508 - -0.8067 ${\rm SSE}$ 0.1413 0.3881 0.2629 0.4240 ${\rm R}^2$ 0.7468 0.5838 0.7181 0.5452
Evaluation of the proposed estimators for the CPS data
 Method Coefficients OLS RLTS MLTSCM UBDMLTSCM1 $Intercept$ 1.0786 0.7498 1.1963 0.9257 $education$ 0.1794 0.1482 0.2576 0.2018 $south$ -0.1024 -0.1208 -0.1109 -0.1174 $sex$ -0.2220 -0.2851 -0.2776 -0.2665 $experience$ 0.0958 0.0613 0.1630 0.1090 $union$ 0.2005 0.1939 0.1987 0.1427 $age$ -0.0854 -0.0473 -0.1510 -0.0960 $race$ 0.0504 0.0674 0.0482 0.0749 $occupation$ -0.0074 -0.0122 0.0072 -0.0126 $sector$ 0.0915 0.0614 0.0411 0.0965 $married$ 0.0766 0.0590 0.1937 0.0924 ${\rm SSE}$ 101.17 76.3827 50.5810 49.8101 ${\rm R}^2$ 0.3185 0.4049 0.4146 0.4123 Method Coefficients UBDMLTSCM2 LSVR NSVR NNR $Intercept$ 0.9038 0.0054 - -5.5913 $education$ 0.1974 0.4997 - 0.6978 $south$ -0.0916 -0.1141 - -0.4331 $sex$ -0.2416 -0.2638 - -0.9731 $experience$ 0.1011 0.2573 - 0.2991 $union$ 0.1791 0.1511 - 1.0483 $age$ -0.0888 0.0420 - -0.2590 $race$ 0.0515 0.0930 - 0.2437 $occupation$ -0.0140 -0.0526 - 0.0004 $sector$ 0.0810 0.0918 - 0.3258 $married$ 0.1216 0.0524 - 0.4156 ${\rm SSE}$ 49.2827 102.5847 79.0911 84.2234 ${\rm R}^2$ 0.4279 0.3089 0.4672 0.4326
