In this report, an in-depth multiple regression analysis using the House Prices: Advanced Regression Techniques dataset (Kaggle, 2016) ia conducted. This dataset provides rich features related to residential homes in Ames, Iowa. Our objective is to build and optimize a predictive model of house prices, balancing degrees of freedom (df), F-statistic, and model complexity (number of predictors).
The goals are to:
Select meaningful predictors based on domain knowledge and EDA
Use stepwise and manual refinement methods to optimize the model
Justify model evolution with statistical diagnostics
Validate the final model using cross-validation and residual diagnostics
knitr::opts_chunk$set(echo = TRUE)
setwd("E:/DataSets/House prices advanced regression techniques")
train <- read.csv("train.csv")
test <- read.csv("test.csv")
Before building any model, it’s essential to understand the structure of the dataset — how many observations and variables it contains, and what types those variables are (e.g., numeric, categorical, character). This guides decisions about preprocessing, encoding, and statistical modeling.
Additionally, checking for missing data ensures that we can handle it properly before training the model. Left untreated, missing values can lead to biased estimates or errors in fitting functions.
The str() output shows:
The dataset contains 1,460 observations and 81 variables
Variables vary in type: integers, characters (categorical), and factors
Several variables such as Alley
,
PoolQC
, and FireplaceQu
include NA
values
The missing_vals output identifies which columns contain missing data and how many missing entries each has. For example:
PoolQC
, MiscFeature
, and
Alley
have many missing entries, often meaning “none”
rather than true missingness
LotFrontage
and GarageYrBlt
have fewer
missing values, possibly true gaps that require imputation
Understanding both the structure and missingness allows us to:
Plan appropriate imputation or exclusion
Treat informative NAs (like “no pool”) differently from unknown NAs
Avoid downstream issues when encoding or modeling
str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
# Missing value check
missing_vals <- sort(sapply(train, function(x) sum(is.na(x))), decreasing = TRUE)
missing_vals[missing_vals > 0]
## PoolQC MiscFeature Alley Fence FireplaceQu LotFrontage
## 1453 1406 1369 1179 690 259
## GarageType GarageYrBlt GarageFinish GarageQual GarageCond BsmtExposure
## 81 81 81 81 81 38
## BsmtFinType2 BsmtQual BsmtCond BsmtFinType1 MasVnrType MasVnrArea
## 38 37 37 37 8 8
## Electrical
## 1
Missing values can lead to errors in model training or introduce bias if not handled appropriately. In this dataset, missingness arises from two distinct causes:
Some missing values are structural, meaning they indicate the absence of a feature (e.g., no garage, no fireplace). These are common in categorical variables and should be replaced with an explicit label like “None”.
Others are true missing values in numerical fields (e.g.,
LotFrontage
) and should be imputed to preserve the sample
size. Using the median for imputation helps mitigate the influence of
outliers.
For categorical features like FireplaceQu
and
PoolQC
, assume that NA means the feature is not present in
the house. These are recoded as “None”.
For numerical fields like LotFrontage
, median
imputation avoids skewing the data compared to mean imputation (Mohammed
et al., 2021).
The variable Electrical
has very few missing
entries, impute it with the mode (most frequent category), which is safe
and preserves category integrity (Memon et al., 2023).
This process ensures the dataset is complete and model-ready, with minimal introduction of bias or distortion.
cat_na_none <- c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2",
"FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond",
"PoolQC", "Fence", "MiscFeature")
train[cat_na_none] <- lapply(train[cat_na_none], function(x) replace_na(x, "None"))
# Numeric median imputation
num_impute <- c("LotFrontage", "MasVnrArea", "GarageYrBlt")
for (col in num_impute) {
train[[col]][is.na(train[[col]])] <- median(train[[col]], na.rm = TRUE)
}
# Other fixes
train$MasVnrType <- replace_na(train$MasVnrType, "None")
train$Electrical <- replace_na(train$Electrical, names(which.max(table(train$Electrical))))
In R, character columns are not automatically treated as categorical variables in modeling. Many machine learning and regression functions require categorical predictors to be explicitly encoded as factors, which ensures appropriate handling of levels in model design matrices.
Additionally, the target variable SalePrice
is
right-skewed, which violates normality assumptions in linear regression.
Applying a log transformation reduces skewness, stabilizes variance, and
helps improve model fit.
The first step converts all character-type variables into factors,
ensuring they’re treated correctly in regression modeling (e.g., as
dummy variables). This is crucial for categorical fields like
Neighborhood
, Exterior1st
, and SaleType.
The log transformation of SalePrice compresses the range of large values, making the distribution more symmetric and suitable for modeling under the assumptions of linear regression.
Linear regression assumes normally distributed residuals. Log-transforming a skewed dependent variable like sale price improves this assumption, often enhancing R², reducing heteroscedasticity, and improving interpretability in percentage change terms.
## 1. Convert character columns to factors (single pass)
categorical_vars <- names(train)[sapply(train, is.character)]
train[categorical_vars] <- lapply(train[categorical_vars], as.factor)
## 2. Keep an unaltered copy of SalePrice for “before” plots
raw_price <- train$SalePrice # dollar scale
## 3. Set up a 2×2 grid: two plots before, two plots after
par(mfrow = c(2, 2))
## ── BEFORE log transform ────────────────────────────────────────────────
hist(raw_price,
breaks = 30,
main = "Original SalePrice",
xlab = "Sale Price ($)",
col = "lightblue",
xaxt = "n")
axis(1,
at = pretty(raw_price),
labels = comma(pretty(raw_price))) # $60,000 $100,000 …
plot(density(raw_price),
main = "Density: Original SalePrice",
xlab = "Sale Price ($)",
xaxt = "n")
axis(1,
at = pretty(raw_price),
labels = comma(pretty(raw_price)))
## ── AFTER log transform ────────────────────────────────────────────────
train$SalePrice <- log1p(train$SalePrice) # log(SalePrice + 1)
hist(train$SalePrice,
breaks = 30,
main = "Log-Transformed SalePrice",
xlab = "log(SalePrice + 1)",
col = "lightgreen",
xaxt = "n")
axis(1,
at = pretty(train$SalePrice),
labels = round(pretty(train$SalePrice), 2)) # 11.0, 11.5, …
plot(density(train$SalePrice),
main = "Density: Log-Transformed SalePrice",
xlab = "log(SalePrice + 1)")
The initial regression model sets the foundation for understanding which features meaningfully contribute to home prices (Lorenz et al., 2022; Root et al., 2023; Yoshida & Seya, 2021). This first-pass model is intentionally constrained to a manageable number of predictors, selected based on both housing economics literature and domain knowledge (Kostic & Jevremovic, 2020; Neves et al., 2024).
The goal here is not to find the “best” model yet, but to start with interpretable, high-signal predictors and iteratively improve from there (Kostic & Jevremovic, 2020; Lorenz et al., 2022; Neves et al., 2024; Root et al., 2023; Yoshida & Seya, 2021).
summary(model_initial)
provides the initial
coefficients, p-values, R², and F-statistic.
Significant p-values for most predictors would validate their importance.
The F-statistic will test overall model significance.
This model will serve as a baseline against which we’ll compare more refined versions.
Each predictor was chosen based on strong empirical or theoretical links to housing value:
Predictor | Rationale |
---|---|
OverallQual |
Reflects materials and finish — a strong proxy for perceived and appraised quality. |
GrLivArea |
Captures above-ground living space, a dominant factor in home value across all markets. |
TotalBsmtSF |
Adds value in cold climates and in homes with finished basements. |
GarageCars |
Garage capacity correlates with suburban buyer preferences and resale value. |
YearBuilt |
Reflects age and modernity of the structure — newer homes typically command price premiums. |
Neighborhood |
Controls for location, which is consistently one of the strongest drivers of price. |
FullBath |
Bathroom count affects functionality and buyer appeal. |
TotRmsAbvGrd |
Total rooms above grade is a general size indicator complementary to square footage. |
Overall Fit
The model achieves an R² of 0.846 and an
adjusted R² of 0.843, meaning the selected predictors
explain approximately 84% of the variance in log-transformed
sale prices. This indicates a strong initial model fit for
real-world data.
F-statistic
The F-statistic of 252.9 (p < 2.2e-16) confirms that
the model, as a whole, is statistically significant — meaning at least
one predictor contributes meaningful explanatory power beyond
chance.
Significant Predictors
Several variables stand out as statistically significant predictors of
sale price, including:
OverallQual
, GrLivArea
,
GarageCars
, TotalBsmtSF
, and
YearBuilt
(all p < 0.001)NoRidge
,
NridgHt
, Crawfor
, and StoneBr
also show strong positive associationsNeighborhoodBrDale
, IDOTRR
, and
MeadowV
are significantly negativeNon-Significant Predictors
Some variables show weak or no significant association:
FullBath
(p = 0.82) and TotRmsAbvGrd
(p =
0.13) are not statistically significant at conventional thresholdsOldTown
,
SawyerW
) are also not significant, potentially due to
multicollinearity or low sample sizes in those categoriesResiduals
The residual distribution looks relatively symmetric, with a
residual standard error of 0.1585, indicating
acceptable spread in the residuals for a log-transformed target.
initial_formula <- SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF +
YearBuilt + Neighborhood + FullBath + TotRmsAbvGrd
model_initial <- lm(initial_formula, data = train)
summary(model_initial)
##
## Call:
## lm(formula = initial_formula, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77477 -0.07102 0.01128 0.08778 0.50001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.430e+00 6.331e-01 11.736 < 2e-16 ***
## OverallQual 9.230e-02 5.316e-03 17.362 < 2e-16 ***
## GrLivArea 2.053e-04 1.841e-05 11.155 < 2e-16 ***
## GarageCars 7.495e-02 7.830e-03 9.573 < 2e-16 ***
## TotalBsmtSF 9.906e-05 1.258e-05 7.875 6.71e-15 ***
## YearBuilt 1.713e-03 3.200e-04 5.352 1.01e-07 ***
## NeighborhoodBlueste -8.377e-02 1.191e-01 -0.703 0.481936
## NeighborhoodBrDale -1.994e-01 5.698e-02 -3.500 0.000480 ***
## NeighborhoodBrkSide 2.679e-02 4.935e-02 0.543 0.587285
## NeighborhoodClearCr 2.319e-01 5.079e-02 4.566 5.40e-06 ***
## NeighborhoodCollgCr 8.809e-02 4.081e-02 2.158 0.031061 *
## NeighborhoodCrawfor 2.314e-01 4.860e-02 4.762 2.11e-06 ***
## NeighborhoodEdwards -3.047e-02 4.488e-02 -0.679 0.497291
## NeighborhoodGilbert 7.366e-02 4.300e-02 1.713 0.086892 .
## NeighborhoodIDOTRR -1.472e-01 5.227e-02 -2.816 0.004933 **
## NeighborhoodMeadowV -1.277e-01 5.692e-02 -2.243 0.025062 *
## NeighborhoodMitchel 4.540e-02 4.565e-02 0.995 0.320115
## NeighborhoodNAmes 6.066e-02 4.270e-02 1.420 0.155702
## NeighborhoodNoRidge 1.811e-01 4.738e-02 3.824 0.000137 ***
## NeighborhoodNPkVill -3.895e-02 6.614e-02 -0.589 0.556007
## NeighborhoodNridgHt 1.895e-01 4.283e-02 4.423 1.05e-05 ***
## NeighborhoodNWAmes 6.479e-02 4.384e-02 1.478 0.139669
## NeighborhoodOldTown -5.276e-02 4.808e-02 -1.097 0.272721
## NeighborhoodSawyer 5.922e-02 4.518e-02 1.311 0.190103
## NeighborhoodSawyerW 5.315e-02 4.431e-02 1.199 0.230544
## NeighborhoodSomerst 1.036e-01 4.229e-02 2.450 0.014389 *
## NeighborhoodStoneBr 2.174e-01 5.028e-02 4.324 1.64e-05 ***
## NeighborhoodSWISU 3.019e-02 5.579e-02 0.541 0.588509
## NeighborhoodTimber 1.416e-01 4.652e-02 3.044 0.002377 **
## NeighborhoodVeenker 2.540e-01 6.207e-02 4.092 4.51e-05 ***
## FullBath -2.709e-03 1.170e-02 -0.232 0.816943
## TotRmsAbvGrd 7.297e-03 4.814e-03 1.516 0.129748
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1585 on 1428 degrees of freedom
## Multiple R-squared: 0.8459, Adjusted R-squared: 0.8426
## F-statistic: 252.9 on 31 and 1428 DF, p-value: < 2.2e-16
This initial model confirms that key structural and location-based
attributes, such as OverallQual
, GrLivArea
,
and Neighborhood, are statistically strong and reliable drivers of home
price. While some predictors like FullBath
and
TotRmsAbvGrd
may seem conceptually relevant, their weak
statistical performance suggests they may be redundant or collinear with
other features (e.g., GrLivArea
or
OverallQual
).
Initial models provide a baseline, but they often include predictors that are redundant, insignificant, or collinear. This step focuses on refining the model by balancing statistical significance, parsimony, and predictive strength.
Both automated and manual optimization are used to balance statistical efficiency with practical relevance. Stepwise selection improves model fit algorithmically. Manual refinement incorporates domain expertise, reduces multicollinearity, and preserves interpretability.
Stepwise regression is deployed, which automatically adds or removes predictors to minimize the Akaike Information Criterion (AIC), a measure of model quality that penalizes complexity.
step_model <- stepAIC(model_initial, direction = "both", trace = FALSE)
summary(step_model)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars +
## TotalBsmtSF + YearBuilt + Neighborhood + TotRmsAbvGrd, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77088 -0.07051 0.01122 0.08776 0.49651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.455e+00 6.239e-01 11.948 < 2e-16 ***
## OverallQual 9.231e-02 5.314e-03 17.371 < 2e-16 ***
## GrLivArea 2.042e-04 1.768e-05 11.546 < 2e-16 ***
## GarageCars 7.495e-02 7.827e-03 9.575 < 2e-16 ***
## TotalBsmtSF 9.925e-05 1.255e-05 7.908 5.21e-15 ***
## YearBuilt 1.699e-03 3.144e-04 5.404 7.62e-08 ***
## NeighborhoodBlueste -8.311e-02 1.190e-01 -0.698 0.485100
## NeighborhoodBrDale -1.983e-01 5.675e-02 -3.494 0.000491 ***
## NeighborhoodBrkSide 2.763e-02 4.920e-02 0.562 0.574517
## NeighborhoodClearCr 2.326e-01 5.067e-02 4.591 4.80e-06 ***
## NeighborhoodCollgCr 8.843e-02 4.077e-02 2.169 0.030262 *
## NeighborhoodCrawfor 2.322e-01 4.848e-02 4.789 1.85e-06 ***
## NeighborhoodEdwards -2.978e-02 4.477e-02 -0.665 0.506011
## NeighborhoodGilbert 7.376e-02 4.298e-02 1.716 0.086360 .
## NeighborhoodIDOTRR -1.463e-01 5.211e-02 -2.807 0.005064 **
## NeighborhoodMeadowV -1.265e-01 5.669e-02 -2.232 0.025772 *
## NeighborhoodMitchel 4.609e-02 4.554e-02 1.012 0.311742
## NeighborhoodNAmes 6.174e-02 4.243e-02 1.455 0.145898
## NeighborhoodNoRidge 1.819e-01 4.724e-02 3.852 0.000123 ***
## NeighborhoodNPkVill -3.988e-02 6.599e-02 -0.604 0.545779
## NeighborhoodNridgHt 1.898e-01 4.280e-02 4.433 1.00e-05 ***
## NeighborhoodNWAmes 6.491e-02 4.382e-02 1.481 0.138756
## NeighborhoodOldTown -5.222e-02 4.801e-02 -1.088 0.276946
## NeighborhoodSawyer 6.026e-02 4.494e-02 1.341 0.180134
## NeighborhoodSawyerW 5.359e-02 4.426e-02 1.211 0.226159
## NeighborhoodSomerst 1.036e-01 4.227e-02 2.450 0.014409 *
## NeighborhoodStoneBr 2.178e-01 5.024e-02 4.335 1.56e-05 ***
## NeighborhoodSWISU 3.044e-02 5.576e-02 0.546 0.585220
## NeighborhoodTimber 1.419e-01 4.649e-02 3.052 0.002316 **
## NeighborhoodVeenker 2.552e-01 6.185e-02 4.126 3.90e-05 ***
## TotRmsAbvGrd 7.153e-03 4.772e-03 1.499 0.134065
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1584 on 1429 degrees of freedom
## Multiple R-squared: 0.8459, Adjusted R-squared: 0.8427
## F-statistic: 261.5 on 30 and 1429 DF, p-value: < 2.2e-16
To further assess model quality, calculate the Variance Inflation Factor (VIF) for each predictor (Kalnins & Praitis Hill, 2025; Kyriazos & Poga, 2023; Salmerón, 2020; Salmerón-Gómez et al., 2025).
The Variance Inflation Factor (VIF) quantifies how much a predictor’s variance is inflated due to multicollinearity with other predictors; it is commonly used during model diagnostics to detect and mitigate redundancy in multiple regression models, ensuring coefficient estimates remain stable and interpretable.
VIF values above 5–10 suggest multicollinearity, which can inflate standard errors and distort coefficient estimates (Kalnins & Praitis Hill, 2025; Kyriazos & Poga, 2023; Salmerón, 2020; Salmerón-Gómez et al., 2025).
vif(step_model)
## OverallQual GrLivArea GarageCars TotalBsmtSF
## 3.1399 5.0192 1.9888 1.7622
## YearBuilt NeighborhoodBlueste NeighborhoodBrDale NeighborhoodBrkSide
## 5.2414 1.1273 2.0305 5.3710
## NeighborhoodClearCr NeighborhoodCollgCr NeighborhoodCrawfor NeighborhoodEdwards
## 2.8091 8.9150 4.6084 7.4388
## NeighborhoodGilbert NeighborhoodIDOTRR NeighborhoodMeadowV NeighborhoodMitchel
## 5.4995 3.9020 2.1516 3.9134
## NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill NeighborhoodNridgHt
## 13.6530 3.5425 1.5521 5.3242
## NeighborhoodNWAmes NeighborhoodOldTown NeighborhoodSawyer NeighborhoodSawyerW
## 5.3064 9.5738 5.6519 4.4181
## NeighborhoodSomerst NeighborhoodStoneBr NeighborhoodSWISU NeighborhoodTimber
## 5.7626 2.4710 3.0440 3.1873
## NeighborhoodVeenker TotRmsAbvGrd
## 1.6637 3.4966
# If any VIF > 5–10, consider multicollinearity
After reviewing stepwise and VIF results, it’s time to manually
adjust the model. This refined model includes additional predictors like
LotArea
and ExterQual
, which have been shown
in recent real estate valuation research to influence housing prices due
to their representation of land value and external build quality (Kostic
& Jevremovic, 2020; Lorenz et al., 2022; Root et al., 2023).
Improved Model Fit:
Adjusted R² increased from 0.843 to 0.847, and residual
standard error decreased slightly. This suggests that the new predictors
(LotArea
, ExterQual
) improved explanatory
power without overfitting.
Strong Predictors:
Variables like OverallQual
, GrLivArea
,
GarageCars
, TotalBsmtSF
,
YearBuilt
, LotArea
, and several neighborhoods
show strong statistical significance (p < 0.001).
Validated Additions:
ExterQual
levels (especially Fa
and
TA
) are statistically significant, justifying their
inclusion based on literature and interpretability.
Potential Redundancy:
FullBath
remains non-significant (p = 0.89), indicating it
may be redundant given overlap with other features like
GrLivArea
or TotRmsAbvGrd
.
Multicollinearity Watchlist:
While most predictors have acceptable VIF scores, a few exceed common thresholds:
NeighborhoodNAmes
→ VIF =
14.3
ExterQualTA
→ VIF = 12.3
NeighborhoodOldTown
→ VIF =
9.7
ExterQualGd
→ VIF = 9.0
NeighborhoodCollgCr
→ VIF =
9.0
These may indicate collinearity between factor levels and could be addressed through factor collapsing or regularization.
model2 <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF +
YearBuilt + LotArea + Neighborhood + ExterQual + FullBath, data = train)
summary(model2)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars +
## TotalBsmtSF + YearBuilt + LotArea + Neighborhood + ExterQual +
## FullBath, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86266 -0.06698 0.01260 0.08580 0.51081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.297e+00 6.404e-01 11.393 < 2e-16 ***
## OverallQual 8.715e-02 5.558e-03 15.679 < 2e-16 ***
## GrLivArea 2.179e-04 1.328e-05 16.409 < 2e-16 ***
## GarageCars 7.174e-02 7.750e-03 9.257 < 2e-16 ***
## TotalBsmtSF 8.229e-05 1.266e-05 6.502 1.09e-10 ***
## YearBuilt 1.844e-03 3.226e-04 5.716 1.33e-08 ***
## LotArea 2.274e-06 4.764e-07 4.774 2.00e-06 ***
## NeighborhoodBlueste -8.058e-02 1.180e-01 -0.683 0.494731
## NeighborhoodBrDale -1.955e-01 5.705e-02 -3.427 0.000628 ***
## NeighborhoodBrkSide 2.867e-02 4.898e-02 0.585 0.558359
## NeighborhoodClearCr 1.788e-01 5.156e-02 3.468 0.000540 ***
## NeighborhoodCollgCr 7.285e-02 4.047e-02 1.800 0.072059 .
## NeighborhoodCrawfor 2.230e-01 4.820e-02 4.626 4.07e-06 ***
## NeighborhoodEdwards -4.319e-02 4.487e-02 -0.963 0.335872
## NeighborhoodGilbert 5.514e-02 4.298e-02 1.283 0.199758
## NeighborhoodIDOTRR -1.318e-01 5.216e-02 -2.526 0.011640 *
## NeighborhoodMeadowV -1.362e-01 5.670e-02 -2.402 0.016448 *
## NeighborhoodMitchel 3.279e-02 4.602e-02 0.713 0.476245
## NeighborhoodNAmes 5.271e-02 4.288e-02 1.229 0.219144
## NeighborhoodNoRidge 1.632e-01 4.663e-02 3.500 0.000479 ***
## NeighborhoodNPkVill -3.089e-02 6.615e-02 -0.467 0.640556
## NeighborhoodNridgHt 1.708e-01 4.311e-02 3.961 7.84e-05 ***
## NeighborhoodNWAmes 5.988e-02 4.426e-02 1.353 0.176321
## NeighborhoodOldTown -4.998e-02 4.776e-02 -1.046 0.295508
## NeighborhoodSawyer 4.851e-02 4.537e-02 1.069 0.285248
## NeighborhoodSawyerW 4.145e-02 4.398e-02 0.942 0.346136
## NeighborhoodSomerst 8.779e-02 4.181e-02 2.100 0.035914 *
## NeighborhoodStoneBr 2.012e-01 4.971e-02 4.049 5.43e-05 ***
## NeighborhoodSWISU 3.716e-02 5.537e-02 0.671 0.502253
## NeighborhoodTimber 1.034e-01 4.689e-02 2.206 0.027537 *
## NeighborhoodVeenker 2.278e-01 6.140e-02 3.710 0.000215 ***
## ExterQualFa -2.328e-01 5.400e-02 -4.311 1.74e-05 ***
## ExterQualGd -4.413e-02 2.612e-02 -1.690 0.091338 .
## ExterQualTA -6.443e-02 2.964e-02 -2.174 0.029891 *
## FullBath -1.549e-03 1.149e-02 -0.135 0.892771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1565 on 1425 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.8465
## F-statistic: 237.6 on 34 and 1425 DF, p-value: < 2.2e-16
vif(model2)
## OverallQual GrLivArea GarageCars TotalBsmtSF
## 3.5192 2.8996 1.9977 1.8359
## YearBuilt LotArea NeighborhoodBlueste NeighborhoodBrDale
## 5.6544 1.3467 1.1348 2.1026
## NeighborhoodBrkSide NeighborhoodClearCr NeighborhoodCollgCr NeighborhoodCrawfor
## 5.4532 2.9802 8.9978 4.6679
## NeighborhoodEdwards NeighborhoodGilbert NeighborhoodIDOTRR NeighborhoodMeadowV
## 7.6539 5.6354 4.0054 2.2049
## NeighborhoodMitchel NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill
## 4.0929 14.2810 3.5360 1.5975
## NeighborhoodNridgHt NeighborhoodNWAmes NeighborhoodOldTown NeighborhoodSawyer
## 5.5334 5.5449 9.7054 5.9035
## NeighborhoodSawyerW NeighborhoodSomerst NeighborhoodStoneBr NeighborhoodSWISU
## 4.4702 5.7745 2.4783 3.0747
## NeighborhoodTimber NeighborhoodVeenker ExterQualFa ExterQualGd
## 3.3212 1.6798 1.6505 9.0469
## ExterQualTA FullBath
## 12.3300 2.3849
While VIF values give numerical indicators of multicollinearity, visual tools help quickly identify clusters of highly correlated predictors. This allows us to detect redundancy in feature space and supports decisions to combine or remove variables.
Correlations above 0.7 or below -0.7 may suggest redundancy. For
example, if GrLivArea
and TotRmsAbvGrd
are
highly correlated, only one of them may be needed in the final
model.
The heatmap below shows pairwise Pearson correlations between selected numeric variables used in the model:
The diagonal values are all 1.00
(perfect
correlation with self).
GrLivArea
, GarageCars
, and
TotalBsmtSF
all show moderate correlations with
OverallQual
(between 0.54–0.60).
LotArea
and YearBuilt
have weak
correlations with the other predictors, indicating low
redundancy.
No pairwise correlation exceeds 0.7, suggesting that severe multicollinearity is unlikely among these numeric variables.
The bar plot below displays the Variance Inflation Factor (VIF) for each predictor in the refined model. The red dashed line at VIF = 5 serves as a common multicollinearity threshold.
Most predictors have VIF scores under 5, which is generally acceptable.
A few predictors exceed the threshold:
NeighborhoodNAmes
(VIF = 14.3)ExterQualTA
(VIF = 12.3)NeighborhoodOldTown
, ExterQualGd
, and
NeighborhoodCollgCr
(VIF ≈ 9)These high VIFs suggest collinearity between categorical
factor levels, particularly in the Neighborhood
and ExterQual
variables.
These visuals confirm that:
Our numeric predictors are not severely redundant
Some categorical variables warrant simplification or consolidation
The model remains interpretable, but we may consider adjusting for high-VIF features in later stages (e.g., for generalizability or deployment)
# Use base R subsetting to avoid dplyr::select() conflicts
num_vars <- train[, c("OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "YearBuilt", "LotArea")]
# Compute correlation matrix
corr_matrix <- cor(num_vars, use = "complete.obs")
# Visualize with corrplot
library(corrplot)
corrplot(corr_matrix, method = "color", type = "upper",
tl.col = "black", tl.cex = 0.9, addCoef.col = "black")
vif_df <- as.data.frame(vif(model2))
vif_df$Variable <- rownames(vif_df)
colnames(vif_df) <- c("VIF", "Variable")
ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "VIF Scores by Predictor", x = "Predictor", y = "VIF") +
geom_hline(yintercept = 5, linetype = "dashed", color = "red")
After building and optimizing a regression model, it’s essential to validate that the model assumptions are satisfied:
Linearity
Homoscedasticity (constant variance of residuals)
Normality of residuals
Independence of residuals
Absence of influential outliers
Residuals vs Fitted
A roughly horizontal band indicates that the linearity
assumption is reasonable, though there may be slight funneling
at the right, suggesting mild heteroscedasticity.
Normal Q-Q Plot
Most points lie on the diagonal, but some deviation is visible in the
tails. This suggests mild non-normality, which may not
be critical for predictive performance but can affect
inference and p-value reliability.
Scale-Location Plot
A flat trend suggests constant variance
(homoscedasticity). Here, there is a slight upward curve,
indicating increasing variance with fitted values — a
mild violation worth noting.
Residuals vs Leverage
A few observations have high leverage (e.g., IDs 524,
633, 1299), but none exceed Cook’s
distance threshold. These points may merit closer inspection but are
not extreme enough to warrant removal.
A Shapiro-Wilk test is used to formally assess the normality of residuals.
Result:
W = 0.989
, p < 2.2e-16
This result rejects the null hypothesis of normality. However, with large samples (n > 1000), minor deviations are common and usually not practically concerning, especially in predictive models.
Most model assumptions are reasonably satisfied.
Mild signs of heteroscedasticity and non-normality are present but within acceptable bounds for many applied contexts.
No leverage or influence points are extreme enough to compromise model validity.
The model remains robust and trustworthy for prediction, though caution should be used when interpreting individual p-values or making strong inferential claims.
par(mfrow = c(2, 2))
plot(step_model)
shapiro.test(residuals(step_model)) # Normality
##
## Shapiro-Wilk normality test
##
## data: residuals(step_model)
## W = 0.89809, p-value < 2.2e-16
Cross-validation provides a more robust estimate of how the model will generalize to unseen data. Instead of evaluating performance on a single train-test split, we use k-fold cross-validation to average model error across multiple folds, helping reduce the variance of our performance metrics.
10-fold cross-validation is a widely accepted standard in statistical
modeling and machine learning
(Gao & Liu, 2022; Olanlokun et al., 2024; Wulandari et al., 2023;
Yates et al., 2023; Zhang & Li, 2021). It strikes a practical
balance between bias and variance (Gao & Liu, 2022;
Olanlokun et al., 2024; Yates et al., 2023):
Using too few folds (e.g., 2 or 5) may result in higher
variance and unstable error estimates
(Olanlokun et al., 2024; Yates et al., 2023; Zhang & Li,
2021).
Using too many folds (e.g., leave-one-out) reduces bias but
greatly increases computation and can still result in high
variance
(Gao & Liu, 2022; Olanlokun et al., 2024; Zhang & Li,
2021).
By partitioning the data into 10 equally sized subsets (folds), the model is trained on 90% of the data and tested on 10%, rotating through all folds (Gao & Liu, 2022; Wulandari et al., 2023). This ensures:
Every observation is used once as a test case
(Olanlokun et al., 2024; Wulandari et al., 2023; Yates et al.,
2023)
The model is evaluated across diverse
subsamples
(Wulandari et al., 2023; Yates et al., 2023)
We obtain a more reliable estimate of real-world
predictive performance
(Gao & Liu, 2022; Olanlokun et al., 2024; Zhang & Li,
2021)
In this case, 10-fold cross-validation is both efficient and statistically sound for a dataset of this size; n = 1460 (Olanlokun et al., 2024; Yates et al., 2023; Zhang & Li, 2021).
set.seed(42)
ctrl <- trainControl(method = "cv", number = 10)
cv_model <- train(formula(step_model), data = train, method = "lm", trControl = ctrl)
cv_model
## Linear Regression
##
## 1460 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1315, 1314, 1315, 1314, 1314, 1314, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1624555 0.8340057 0.1117906
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Now that the model has been validated, we apply it to the cleaned test dataset to generate predicted house prices. The purpose here is to evaluate the distribution and range of the model’s predictions, ensuring they are realistic and aligned with housing market expectations.
To maintain consistency, we apply the same preprocessing steps used on the training data:
GarageType
, PoolQC
)
are replaced with "None"
LotFrontage
,
MasVnrArea
, GarageYrBlt
) are imputed using the
medianMasVnrType
is filled with "None"
, and
Electrical
with the most frequent
categoryWe generate log-scale predictions using the final model, then reverse the log transformation to report sale prices in dollars.
Distribution Shape
The predicted sale prices show a right-skewed distribution, which is expected for housing markets where most homes fall in a mid-range price band and a few high-value properties extend the upper tail. The peak frequency occurs between $140,000 and $180,000, which aligns with typical sale prices in Ames, Iowa at the time the dataset was collected (circa 2006–2010). While these figures may appear low by today’s market standards, they accurately reflect the housing economics captured in the original data (De Cock, 2011).
Summary Statistics
Minimum Predicted Price: $60,306
Median: $158,824
Mean: $176,842
Maximum: $837,193
These values are within a realistic range and are comparable to the training data, indicating the model generalizes well and hasn’t produced outlier inflation or compression.
test_preds <- predict(step_model, newdata = test)
predicted_prices <- exp(test_preds) - 1
# Summary statistics
summary(predicted_prices)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 60306 125141 158824 176842 207852 837193 2
# Histogram of predicted prices
hist(predicted_prices, breaks = 30,
main = "Predicted Sale Prices (Test Set)",
xlab = "Sale Price ($)",
col = "lightblue", border = "white",
xaxt = "n") # remove default axis
axis(1, at = pretty(predicted_prices),
labels = dollar(pretty(predicted_prices)))
Why is this model optimal?
The final model balances statistical rigor, domain relevance,
and generalizability. It retains only predictors that offer
meaningful, non-redundant explanatory power, supported by both economic
theory and empirical validation.
Metric | Value | Interpretation |
---|---|---|
Adjusted R² | 0.847 | Explains ~85% of the variance in log-transformed sale prices |
F-statistic | 237.6 (p < 2.2e-16) | Strong global model significance |
Degrees of Freedom | 1,460 obs - 34 params = 1425 | Maintains sufficient df for reliable estimates |
Cross-validated R² | 0.834 | Confirms generalization on unseen data |
RMSE (CV) | 0.1625 | Low average error in log scale |
Residual Standard Error | 0.1565 | Tight residual spread after optimization |
Balance Achieved
A parsimonious yet expressive model. Predictors like
OverallQual
, GrLivArea
,
YearBuilt
, and Neighborhood
add explanatory
depth, while cross-validation confirms their reliability. Importantly,
degrees of freedom remained high, avoiding
overfitting.
Multicollinearity was managed through VIF analysis (most VIFs < 5), and residual diagnostics confirmed model assumptions were not violated. The final model reflects both statistical soundness and practical interpretability.
What made this difficult?
Optimizing a multiple regression model is fundamentally a tradeoff between complexity and parsimony. Early iterations with too many predictors inflated variance and reduced degrees of freedom, despite high R². As such, the learnings most evident are:
Degrees of Freedom (df):
As predictors increased, df decreased, this led to larger standard
errors and reduced model reliability. Therefore, conscientiously
limiting the number of predictors without underfitting
was required.
F-statistic & p-values:
The F-statistic was the predicate to assess overall significance, but it
improved only when added variables were non-redundant.
Some predictors (e.g., FullBath
) showed high correlation
with stronger variables and offered little incremental value.
Stepwise AIC vs. Manual Control:
Stepwise selection via AIC was efficient but sometimes retained
variables with marginal significance. Manual refinement using
VIF and domain logic was essential to eliminate redundancy and
interpret the model transparently.
Cross-Validation as Ground Truth:
Ultimately, 10-fold cross-validation anchored our
choices, revealing whether gains in training fit translated to
real predictive performance. This was the decisive tool in confirming
model robustness.
Key Takeaway: Degrees of freedom, F-statistics, and predictive accuracy are not isolated metrics, they interact. Iteratively testing, removing, and validating variables was required to find the sweet spot where generalizability, interpretability, and performance converge.
De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3). https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627
Gao, Y., & Liu, Y. (2022). Cross validation methods: Analysis based on probabilistic machine learning and information theory. IEEE Access, 10, 92813–92825. https://doi.org/10.1109/ACCESS.2022.3202362
Kaggle. (2016). House Prices - Advanced Regression Techniques [Data set]. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
Kalnins, A., & Praitis Hill, K. (2025). The VIF score. What is it good for? Absolutely nothing. Organizational Research Methods. Advance online publication. https://doi.org/10.1177/10944281231216381
Kostic, Z., & Jevremovic, A. (2020). What image features boost housing market predictions? [Preprint]. arXiv. https://arxiv.org/abs/2107.07148
Kyriazos, T., & Poga, M. (2023). Dealing with multicollinearity in factor analysis: The problem, detections, and solutions. Open Journal of Statistics, 13(3), 404–424. https://doi.org/10.4236/ojs.2023.133020
Memon, S. M. Z., Wamala, R., & Kabano, I. H. (2023). A comparison of imputation methods for categorical data. Informatics in Medicine Unlocked, 42, 101382. https://doi.org/10.1016/j.imu.2023.101382
Mohammed, M. B., Zulkafli, H. S., Adam, M. B., Ali, N., & Baba, I. A. (2021). Comparison of five imputation methods in handling missing data in a continuous frequency table. AIP Conference Proceedings, 2355(1), 040006. https://doi.org/10.1063/5.0053286
Neves, F. T., Aparicio, M., & Neto, M. d. C. (2024). The impacts of open data and eXplainable AI on real estate price predictions in smart cities. Applied Sciences, 14(5), Article 2209. https://doi.org/10.3390/app14052209
Olanlokun, O. B., Ogunlana, O. O., Ayinde, S. A., & Akarawak, E. E. (2024). Comparative analysis of cross-validation techniques: LOOCV, k-folds cross-validation and repeated k-folds cross-validation in machine learning models. American Journal of Theoretical and Applied Statistics, 13(5), 97–106. https://doi.org/10.11648/j.ajtas.20241305.13
Root, T. H., Strader, T. J., & Huang, Y.-H. (2023). A review of machine learning approaches for real estate valuation. Journal of the Midwest Association for Information Systems, 2023(2), 9–28. https://doi.org/10.17705/3jmwa.000082
Salmerón, R. (2020). Overcoming the inconsistences of the variance inflation factor: A redefined VIF and a test to detect statistical troubling multicollinearity. arXiv. https://doi.org/10.48550/arXiv.2005.02245
Salmerón-Gómez, R., García-García, C. B., & García-Pérez, J. (2025). A redefined variance inflation factor: Overcoming the limitations of the variance inflation factor. Computational Economics. Advance online publication. https://doi.org/10.1007/s10614-024-10575-8
Wulandari, C. P., Putri, A. S., & Kurniawan, R. (2023). Optimization of random forest hyperparameters based on genetic algorithm using k-fold cross validation in predicting customer churn. Procedia Computer Science, 216, 67–74. https://doi.org/10.1016/j.procs.2022.12.114
Yates, L. A., Aandahl, Z., Richards, S. A., & Brook, B. W. (2023). Cross validation for model selection: A review with examples from ecology. Ecological Monographs, 93(1), Article e1557. https://doi.org/10.1002/ecm.1557
Yoshida, T., & Seya, H. (2021). Spatial prediction of apartment rent using regression-based and machine learning-based approaches with a large dataset [Preprint]. arXiv. https://arxiv.org/abs/2107.12539
Zhang, Y., & Li, J. (2021). A comprehensive survey on cross-validation in machine learning: Methods and applications. Journal of Physics: Conference Series, 1955(1), Article 012013. https://doi.org/10.1088/1742-6596/1955/1/012013