Introduction

In this report, an in-depth multiple regression analysis using the House Prices: Advanced Regression Techniques dataset (Kaggle, 2016) ia conducted. This dataset provides rich features related to residential homes in Ames, Iowa. Our objective is to build and optimize a predictive model of house prices, balancing degrees of freedom (df), F-statistic, and model complexity (number of predictors).

The goals are to:

Select meaningful predictors based on domain knowledge and EDA
Use stepwise and manual refinement methods to optimize the model
Justify model evolution with statistical diagnostics
Validate the final model using cross-validation and residual diagnostics

1. Data Exploration and Preprocessing

1.1 Load Data

knitr::opts_chunk$set(echo = TRUE)
setwd("E:/DataSets/House prices advanced regression techniques")
train <-  read.csv("train.csv")
test <- read.csv("test.csv")

1.2 Understand Structure and Missingness

Before building any model, it’s essential to understand the structure of the dataset — how many observations and variables it contains, and what types those variables are (e.g., numeric, categorical, character). This guides decisions about preprocessing, encoding, and statistical modeling.

Additionally, checking for missing data ensures that we can handle it properly before training the model. Left untreated, missing values can lead to biased estimates or errors in fitting functions.

The str() output shows:

The dataset contains 1,460 observations and 81 variables
Variables vary in type: integers, characters (categorical), and factors
Several variables such as Alley, PoolQC, and FireplaceQu include NA values

Observations

The missing_vals output identifies which columns contain missing data and how many missing entries each has. For example:

PoolQC, MiscFeature, and Alley have many missing entries, often meaning “none” rather than true missingness
LotFrontage and GarageYrBlt have fewer missing values, possibly true gaps that require imputation

Understanding both the structure and missingness allows us to:

Plan appropriate imputation or exclusion
Treat informative NAs (like “no pool”) differently from unknown NAs
Avoid downstream issues when encoding or modeling

str(train)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

# Missing value check
missing_vals <- sort(sapply(train, function(x) sum(is.na(x))), decreasing = TRUE)
missing_vals[missing_vals > 0]

##       PoolQC  MiscFeature        Alley        Fence  FireplaceQu  LotFrontage 
##         1453         1406         1369         1179          690          259 
##   GarageType  GarageYrBlt GarageFinish   GarageQual   GarageCond BsmtExposure 
##           81           81           81           81           81           38 
## BsmtFinType2     BsmtQual     BsmtCond BsmtFinType1   MasVnrType   MasVnrArea 
##           38           37           37           37            8            8 
##   Electrical 
##            1

1.3 Handle Missing Values (Why + How)

Missing values can lead to errors in model training or introduce bias if not handled appropriately. In this dataset, missingness arises from two distinct causes:

Some missing values are structural, meaning they indicate the absence of a feature (e.g., no garage, no fireplace). These are common in categorical variables and should be replaced with an explicit label like “None”.
Others are true missing values in numerical fields (e.g., LotFrontage) and should be imputed to preserve the sample size. Using the median for imputation helps mitigate the influence of outliers.

Actions and Observations

For categorical features like FireplaceQu and PoolQC, assume that NA means the feature is not present in the house. These are recoded as “None”.
For numerical fields like LotFrontage, median imputation avoids skewing the data compared to mean imputation (Mohammed et al., 2021).
The variable Electrical has very few missing entries, impute it with the mode (most frequent category), which is safe and preserves category integrity (Memon et al., 2023).

This process ensures the dataset is complete and model-ready, with minimal introduction of bias or distortion.

cat_na_none <- c("Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", 
                 "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", 
                 "PoolQC", "Fence", "MiscFeature")

train[cat_na_none] <- lapply(train[cat_na_none], function(x) replace_na(x, "None"))

# Numeric median imputation
num_impute <- c("LotFrontage", "MasVnrArea", "GarageYrBlt")
for (col in num_impute) {
  train[[col]][is.na(train[[col]])] <- median(train[[col]], na.rm = TRUE)
}

# Other fixes
train$MasVnrType <- replace_na(train$MasVnrType, "None")
train$Electrical <- replace_na(train$Electrical, names(which.max(table(train$Electrical))))

1.4 Convert to Factors and Log-Transform Target

In R, character columns are not automatically treated as categorical variables in modeling. Many machine learning and regression functions require categorical predictors to be explicitly encoded as factors, which ensures appropriate handling of levels in model design matrices.

Additionally, the target variable SalePrice is right-skewed, which violates normality assumptions in linear regression. Applying a log transformation reduces skewness, stabilizes variance, and helps improve model fit.

Actions and Observations

The first step converts all character-type variables into factors, ensuring they’re treated correctly in regression modeling (e.g., as dummy variables). This is crucial for categorical fields like Neighborhood, Exterior1st, and SaleType.

The log transformation of SalePrice compresses the range of large values, making the distribution more symmetric and suitable for modeling under the assumptions of linear regression.

Statistical Rationale

Linear regression assumes normally distributed residuals. Log-transforming a skewed dependent variable like sale price improves this assumption, often enhancing R², reducing heteroscedasticity, and improving interpretability in percentage change terms.

## 1. Convert character columns to factors (single pass)
categorical_vars <- names(train)[sapply(train, is.character)]
train[categorical_vars] <- lapply(train[categorical_vars], as.factor)

## 2. Keep an unaltered copy of SalePrice for “before” plots
raw_price <- train$SalePrice          # dollar scale

## 3. Set up a 2×2 grid: two plots before, two plots after
par(mfrow = c(2, 2))

## ── BEFORE log transform ────────────────────────────────────────────────
hist(raw_price,
     breaks = 30,
     main   = "Original SalePrice",
     xlab   = "Sale Price ($)",
     col    = "lightblue",
     xaxt   = "n")
axis(1,
     at     = pretty(raw_price),
     labels = comma(pretty(raw_price)))   # $60,000  $100,000 …

plot(density(raw_price),
     main   = "Density: Original SalePrice",
     xlab   = "Sale Price ($)",
     xaxt   = "n")
axis(1,
     at     = pretty(raw_price),
     labels = comma(pretty(raw_price)))

## ── AFTER log transform ────────────────────────────────────────────────
train$SalePrice <- log1p(train$SalePrice)  # log(SalePrice + 1)

hist(train$SalePrice,
     breaks = 30,
     main   = "Log-Transformed SalePrice",
     xlab   = "log(SalePrice + 1)",
     col    = "lightgreen",
     xaxt   = "n")
axis(1,
     at     = pretty(train$SalePrice),
     labels = round(pretty(train$SalePrice), 2))   # 11.0, 11.5, …

plot(density(train$SalePrice),
     main   = "Density: Log-Transformed SalePrice",
     xlab   = "log(SalePrice + 1)")

2. Initial Model Construction

The initial regression model sets the foundation for understanding which features meaningfully contribute to home prices (Lorenz et al., 2022; Root et al., 2023; Yoshida & Seya, 2021). This first-pass model is intentionally constrained to a manageable number of predictors, selected based on both housing economics literature and domain knowledge (Kostic & Jevremovic, 2020; Neves et al., 2024).

The goal here is not to find the “best” model yet, but to start with interpretable, high-signal predictors and iteratively improve from there (Kostic & Jevremovic, 2020; Lorenz et al., 2022; Neves et al., 2024; Root et al., 2023; Yoshida & Seya, 2021).

summary(model_initial) provides the initial coefficients, p-values, R², and F-statistic.
Significant p-values for most predictors would validate their importance.
The F-statistic will test overall model significance.
This model will serve as a baseline against which we’ll compare more refined versions.

Actions and Observations

Each predictor was chosen based on strong empirical or theoretical links to housing value:

Predictor	Rationale
`OverallQual`	Reflects materials and finish — a strong proxy for perceived and appraised quality.
`GrLivArea`	Captures above-ground living space, a dominant factor in home value across all markets.
`TotalBsmtSF`	Adds value in cold climates and in homes with finished basements.
`GarageCars`	Garage capacity correlates with suburban buyer preferences and resale value.
`YearBuilt`	Reflects age and modernity of the structure — newer homes typically command price premiums.
`Neighborhood`	Controls for location, which is consistently one of the strongest drivers of price.
`FullBath`	Bathroom count affects functionality and buyer appeal.
`TotRmsAbvGrd`	Total rooms above grade is a general size indicator complementary to square footage.

Initial Intrepretation

Overall Fit
The model achieves an R² of 0.846 and an adjusted R² of 0.843, meaning the selected predictors explain approximately 84% of the variance in log-transformed sale prices. This indicates a strong initial model fit for real-world data.

F-statistic
The F-statistic of 252.9 (p < 2.2e-16) confirms that the model, as a whole, is statistically significant — meaning at least one predictor contributes meaningful explanatory power beyond chance.

Significant Predictors
Several variables stand out as statistically significant predictors of sale price, including:

OverallQual, GrLivArea, GarageCars, TotalBsmtSF, and YearBuilt (all p < 0.001)
Multiple neighborhoods such as NoRidge, NridgHt, Crawfor, and StoneBr also show strong positive associations
NeighborhoodBrDale, IDOTRR, and MeadowV are significantly negative

Non-Significant Predictors
Some variables show weak or no significant association:

FullBath (p = 0.82) and TotRmsAbvGrd (p = 0.13) are not statistically significant at conventional thresholds
Several neighborhoods (e.g., OldTown, SawyerW) are also not significant, potentially due to multicollinearity or low sample sizes in those categories

Residuals
The residual distribution looks relatively symmetric, with a residual standard error of 0.1585, indicating acceptable spread in the residuals for a log-transformed target.

initial_formula <- SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF +
  YearBuilt + Neighborhood + FullBath + TotRmsAbvGrd

model_initial <- lm(initial_formula, data = train)
summary(model_initial)

## 
## Call:
## lm(formula = initial_formula, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77477 -0.07102  0.01128  0.08778  0.50001 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.430e+00  6.331e-01  11.736  < 2e-16 ***
## OverallQual          9.230e-02  5.316e-03  17.362  < 2e-16 ***
## GrLivArea            2.053e-04  1.841e-05  11.155  < 2e-16 ***
## GarageCars           7.495e-02  7.830e-03   9.573  < 2e-16 ***
## TotalBsmtSF          9.906e-05  1.258e-05   7.875 6.71e-15 ***
## YearBuilt            1.713e-03  3.200e-04   5.352 1.01e-07 ***
## NeighborhoodBlueste -8.377e-02  1.191e-01  -0.703 0.481936    
## NeighborhoodBrDale  -1.994e-01  5.698e-02  -3.500 0.000480 ***
## NeighborhoodBrkSide  2.679e-02  4.935e-02   0.543 0.587285    
## NeighborhoodClearCr  2.319e-01  5.079e-02   4.566 5.40e-06 ***
## NeighborhoodCollgCr  8.809e-02  4.081e-02   2.158 0.031061 *  
## NeighborhoodCrawfor  2.314e-01  4.860e-02   4.762 2.11e-06 ***
## NeighborhoodEdwards -3.047e-02  4.488e-02  -0.679 0.497291    
## NeighborhoodGilbert  7.366e-02  4.300e-02   1.713 0.086892 .  
## NeighborhoodIDOTRR  -1.472e-01  5.227e-02  -2.816 0.004933 ** 
## NeighborhoodMeadowV -1.277e-01  5.692e-02  -2.243 0.025062 *  
## NeighborhoodMitchel  4.540e-02  4.565e-02   0.995 0.320115    
## NeighborhoodNAmes    6.066e-02  4.270e-02   1.420 0.155702    
## NeighborhoodNoRidge  1.811e-01  4.738e-02   3.824 0.000137 ***
## NeighborhoodNPkVill -3.895e-02  6.614e-02  -0.589 0.556007    
## NeighborhoodNridgHt  1.895e-01  4.283e-02   4.423 1.05e-05 ***
## NeighborhoodNWAmes   6.479e-02  4.384e-02   1.478 0.139669    
## NeighborhoodOldTown -5.276e-02  4.808e-02  -1.097 0.272721    
## NeighborhoodSawyer   5.922e-02  4.518e-02   1.311 0.190103    
## NeighborhoodSawyerW  5.315e-02  4.431e-02   1.199 0.230544    
## NeighborhoodSomerst  1.036e-01  4.229e-02   2.450 0.014389 *  
## NeighborhoodStoneBr  2.174e-01  5.028e-02   4.324 1.64e-05 ***
## NeighborhoodSWISU    3.019e-02  5.579e-02   0.541 0.588509    
## NeighborhoodTimber   1.416e-01  4.652e-02   3.044 0.002377 ** 
## NeighborhoodVeenker  2.540e-01  6.207e-02   4.092 4.51e-05 ***
## FullBath            -2.709e-03  1.170e-02  -0.232 0.816943    
## TotRmsAbvGrd         7.297e-03  4.814e-03   1.516 0.129748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1585 on 1428 degrees of freedom
## Multiple R-squared:  0.8459, Adjusted R-squared:  0.8426 
## F-statistic: 252.9 on 31 and 1428 DF,  p-value: < 2.2e-16

This initial model confirms that key structural and location-based attributes, such as OverallQual, GrLivArea, and Neighborhood, are statistically strong and reliable drivers of home price. While some predictors like FullBath and TotRmsAbvGrd may seem conceptually relevant, their weak statistical performance suggests they may be redundant or collinear with other features (e.g., GrLivArea or OverallQual).

3. Model Optimization

Initial models provide a baseline, but they often include predictors that are redundant, insignificant, or collinear. This step focuses on refining the model by balancing statistical significance, parsimony, and predictive strength.

Both automated and manual optimization are used to balance statistical efficiency with practical relevance. Stepwise selection improves model fit algorithmically. Manual refinement incorporates domain expertise, reduces multicollinearity, and preserves interpretability.

3.1 Stepwise AIC Optimization

Stepwise regression is deployed, which automatically adds or removes predictors to minimize the Akaike Information Criterion (AIC), a measure of model quality that penalizes complexity.

step_model <- stepAIC(model_initial, direction = "both", trace = FALSE)
summary(step_model)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     TotalBsmtSF + YearBuilt + Neighborhood + TotRmsAbvGrd, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77088 -0.07051  0.01122  0.08776  0.49651 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.455e+00  6.239e-01  11.948  < 2e-16 ***
## OverallQual          9.231e-02  5.314e-03  17.371  < 2e-16 ***
## GrLivArea            2.042e-04  1.768e-05  11.546  < 2e-16 ***
## GarageCars           7.495e-02  7.827e-03   9.575  < 2e-16 ***
## TotalBsmtSF          9.925e-05  1.255e-05   7.908 5.21e-15 ***
## YearBuilt            1.699e-03  3.144e-04   5.404 7.62e-08 ***
## NeighborhoodBlueste -8.311e-02  1.190e-01  -0.698 0.485100    
## NeighborhoodBrDale  -1.983e-01  5.675e-02  -3.494 0.000491 ***
## NeighborhoodBrkSide  2.763e-02  4.920e-02   0.562 0.574517    
## NeighborhoodClearCr  2.326e-01  5.067e-02   4.591 4.80e-06 ***
## NeighborhoodCollgCr  8.843e-02  4.077e-02   2.169 0.030262 *  
## NeighborhoodCrawfor  2.322e-01  4.848e-02   4.789 1.85e-06 ***
## NeighborhoodEdwards -2.978e-02  4.477e-02  -0.665 0.506011    
## NeighborhoodGilbert  7.376e-02  4.298e-02   1.716 0.086360 .  
## NeighborhoodIDOTRR  -1.463e-01  5.211e-02  -2.807 0.005064 ** 
## NeighborhoodMeadowV -1.265e-01  5.669e-02  -2.232 0.025772 *  
## NeighborhoodMitchel  4.609e-02  4.554e-02   1.012 0.311742    
## NeighborhoodNAmes    6.174e-02  4.243e-02   1.455 0.145898    
## NeighborhoodNoRidge  1.819e-01  4.724e-02   3.852 0.000123 ***
## NeighborhoodNPkVill -3.988e-02  6.599e-02  -0.604 0.545779    
## NeighborhoodNridgHt  1.898e-01  4.280e-02   4.433 1.00e-05 ***
## NeighborhoodNWAmes   6.491e-02  4.382e-02   1.481 0.138756    
## NeighborhoodOldTown -5.222e-02  4.801e-02  -1.088 0.276946    
## NeighborhoodSawyer   6.026e-02  4.494e-02   1.341 0.180134    
## NeighborhoodSawyerW  5.359e-02  4.426e-02   1.211 0.226159    
## NeighborhoodSomerst  1.036e-01  4.227e-02   2.450 0.014409 *  
## NeighborhoodStoneBr  2.178e-01  5.024e-02   4.335 1.56e-05 ***
## NeighborhoodSWISU    3.044e-02  5.576e-02   0.546 0.585220    
## NeighborhoodTimber   1.419e-01  4.649e-02   3.052 0.002316 ** 
## NeighborhoodVeenker  2.552e-01  6.185e-02   4.126 3.90e-05 ***
## TotRmsAbvGrd         7.153e-03  4.772e-03   1.499 0.134065    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1584 on 1429 degrees of freedom
## Multiple R-squared:  0.8459, Adjusted R-squared:  0.8427 
## F-statistic: 261.5 on 30 and 1429 DF,  p-value: < 2.2e-16

3.2 Manual Tweaks Based on Significance and VIF

To further assess model quality, calculate the Variance Inflation Factor (VIF) for each predictor (Kalnins & Praitis Hill, 2025; Kyriazos & Poga, 2023; Salmerón, 2020; Salmerón-Gómez et al., 2025).

The Variance Inflation Factor (VIF) quantifies how much a predictor’s variance is inflated due to multicollinearity with other predictors; it is commonly used during model diagnostics to detect and mitigate redundancy in multiple regression models, ensuring coefficient estimates remain stable and interpretable.

VIF values above 5–10 suggest multicollinearity, which can inflate standard errors and distort coefficient estimates (Kalnins & Praitis Hill, 2025; Kyriazos & Poga, 2023; Salmerón, 2020; Salmerón-Gómez et al., 2025).

vif(step_model)

##         OverallQual           GrLivArea          GarageCars         TotalBsmtSF 
##              3.1399              5.0192              1.9888              1.7622 
##           YearBuilt NeighborhoodBlueste  NeighborhoodBrDale NeighborhoodBrkSide 
##              5.2414              1.1273              2.0305              5.3710 
## NeighborhoodClearCr NeighborhoodCollgCr NeighborhoodCrawfor NeighborhoodEdwards 
##              2.8091              8.9150              4.6084              7.4388 
## NeighborhoodGilbert  NeighborhoodIDOTRR NeighborhoodMeadowV NeighborhoodMitchel 
##              5.4995              3.9020              2.1516              3.9134 
##   NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill NeighborhoodNridgHt 
##             13.6530              3.5425              1.5521              5.3242 
##  NeighborhoodNWAmes NeighborhoodOldTown  NeighborhoodSawyer NeighborhoodSawyerW 
##              5.3064              9.5738              5.6519              4.4181 
## NeighborhoodSomerst NeighborhoodStoneBr   NeighborhoodSWISU  NeighborhoodTimber 
##              5.7626              2.4710              3.0440              3.1873 
## NeighborhoodVeenker        TotRmsAbvGrd 
##              1.6637              3.4966

# If any VIF > 5–10, consider multicollinearity

3.3 Add or Drop Predictors Judiciously

After reviewing stepwise and VIF results, it’s time to manually adjust the model. This refined model includes additional predictors like LotArea and ExterQual, which have been shown in recent real estate valuation research to influence housing prices due to their representation of land value and external build quality (Kostic & Jevremovic, 2020; Lorenz et al., 2022; Root et al., 2023).

Observations

Improved Model Fit:
Adjusted R² increased from 0.843 to 0.847, and residual standard error decreased slightly. This suggests that the new predictors (LotArea, ExterQual) improved explanatory power without overfitting.
Strong Predictors:
Variables like OverallQual, GrLivArea, GarageCars, TotalBsmtSF, YearBuilt, LotArea, and several neighborhoods show strong statistical significance (p < 0.001).
Validated Additions:
ExterQual levels (especially Fa and TA) are statistically significant, justifying their inclusion based on literature and interpretability.
Potential Redundancy:
FullBath remains non-significant (p = 0.89), indicating it may be redundant given overlap with other features like GrLivArea or TotRmsAbvGrd.
Multicollinearity Watchlist:

While most predictors have acceptable VIF scores, a few exceed common thresholds:

NeighborhoodNAmes → VIF = 14.3
ExterQualTA → VIF = 12.3
NeighborhoodOldTown → VIF = 9.7
ExterQualGd → VIF = 9.0
NeighborhoodCollgCr → VIF = 9.0

These may indicate collinearity between factor levels and could be addressed through factor collapsing or regularization.

model2 <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF +
             YearBuilt + LotArea + Neighborhood + ExterQual + FullBath, data = train)
summary(model2)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     TotalBsmtSF + YearBuilt + LotArea + Neighborhood + ExterQual + 
##     FullBath, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86266 -0.06698  0.01260  0.08580  0.51081 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.297e+00  6.404e-01  11.393  < 2e-16 ***
## OverallQual          8.715e-02  5.558e-03  15.679  < 2e-16 ***
## GrLivArea            2.179e-04  1.328e-05  16.409  < 2e-16 ***
## GarageCars           7.174e-02  7.750e-03   9.257  < 2e-16 ***
## TotalBsmtSF          8.229e-05  1.266e-05   6.502 1.09e-10 ***
## YearBuilt            1.844e-03  3.226e-04   5.716 1.33e-08 ***
## LotArea              2.274e-06  4.764e-07   4.774 2.00e-06 ***
## NeighborhoodBlueste -8.058e-02  1.180e-01  -0.683 0.494731    
## NeighborhoodBrDale  -1.955e-01  5.705e-02  -3.427 0.000628 ***
## NeighborhoodBrkSide  2.867e-02  4.898e-02   0.585 0.558359    
## NeighborhoodClearCr  1.788e-01  5.156e-02   3.468 0.000540 ***
## NeighborhoodCollgCr  7.285e-02  4.047e-02   1.800 0.072059 .  
## NeighborhoodCrawfor  2.230e-01  4.820e-02   4.626 4.07e-06 ***
## NeighborhoodEdwards -4.319e-02  4.487e-02  -0.963 0.335872    
## NeighborhoodGilbert  5.514e-02  4.298e-02   1.283 0.199758    
## NeighborhoodIDOTRR  -1.318e-01  5.216e-02  -2.526 0.011640 *  
## NeighborhoodMeadowV -1.362e-01  5.670e-02  -2.402 0.016448 *  
## NeighborhoodMitchel  3.279e-02  4.602e-02   0.713 0.476245    
## NeighborhoodNAmes    5.271e-02  4.288e-02   1.229 0.219144    
## NeighborhoodNoRidge  1.632e-01  4.663e-02   3.500 0.000479 ***
## NeighborhoodNPkVill -3.089e-02  6.615e-02  -0.467 0.640556    
## NeighborhoodNridgHt  1.708e-01  4.311e-02   3.961 7.84e-05 ***
## NeighborhoodNWAmes   5.988e-02  4.426e-02   1.353 0.176321    
## NeighborhoodOldTown -4.998e-02  4.776e-02  -1.046 0.295508    
## NeighborhoodSawyer   4.851e-02  4.537e-02   1.069 0.285248    
## NeighborhoodSawyerW  4.145e-02  4.398e-02   0.942 0.346136    
## NeighborhoodSomerst  8.779e-02  4.181e-02   2.100 0.035914 *  
## NeighborhoodStoneBr  2.012e-01  4.971e-02   4.049 5.43e-05 ***
## NeighborhoodSWISU    3.716e-02  5.537e-02   0.671 0.502253    
## NeighborhoodTimber   1.034e-01  4.689e-02   2.206 0.027537 *  
## NeighborhoodVeenker  2.278e-01  6.140e-02   3.710 0.000215 ***
## ExterQualFa         -2.328e-01  5.400e-02  -4.311 1.74e-05 ***
## ExterQualGd         -4.413e-02  2.612e-02  -1.690 0.091338 .  
## ExterQualTA         -6.443e-02  2.964e-02  -2.174 0.029891 *  
## FullBath            -1.549e-03  1.149e-02  -0.135 0.892771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1565 on 1425 degrees of freedom
## Multiple R-squared:   0.85,  Adjusted R-squared:  0.8465 
## F-statistic: 237.6 on 34 and 1425 DF,  p-value: < 2.2e-16

vif(model2)

##         OverallQual           GrLivArea          GarageCars         TotalBsmtSF 
##              3.5192              2.8996              1.9977              1.8359 
##           YearBuilt             LotArea NeighborhoodBlueste  NeighborhoodBrDale 
##              5.6544              1.3467              1.1348              2.1026 
## NeighborhoodBrkSide NeighborhoodClearCr NeighborhoodCollgCr NeighborhoodCrawfor 
##              5.4532              2.9802              8.9978              4.6679 
## NeighborhoodEdwards NeighborhoodGilbert  NeighborhoodIDOTRR NeighborhoodMeadowV 
##              7.6539              5.6354              4.0054              2.2049 
## NeighborhoodMitchel   NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill 
##              4.0929             14.2810              3.5360              1.5975 
## NeighborhoodNridgHt  NeighborhoodNWAmes NeighborhoodOldTown  NeighborhoodSawyer 
##              5.5334              5.5449              9.7054              5.9035 
## NeighborhoodSawyerW NeighborhoodSomerst NeighborhoodStoneBr   NeighborhoodSWISU 
##              4.4702              5.7745              2.4783              3.0747 
##  NeighborhoodTimber NeighborhoodVeenker         ExterQualFa         ExterQualGd 
##              3.3212              1.6798              1.6505              9.0469 
##         ExterQualTA            FullBath 
##             12.3300              2.3849

3.4 Visualizing Multicollinearity and VIF Bar Plot

While VIF values give numerical indicators of multicollinearity, visual tools help quickly identify clusters of highly correlated predictors. This allows us to detect redundancy in feature space and supports decisions to combine or remove variables.

Correlations above 0.7 or below -0.7 may suggest redundancy. For example, if GrLivArea and TotRmsAbvGrd are highly correlated, only one of them may be needed in the final model.

Observations

The heatmap below shows pairwise Pearson correlations between selected numeric variables used in the model:

The diagonal values are all 1.00 (perfect correlation with self).
GrLivArea, GarageCars, and TotalBsmtSF all show moderate correlations with OverallQual (between 0.54–0.60).
LotArea and YearBuilt have weak correlations with the other predictors, indicating low redundancy.
No pairwise correlation exceeds 0.7, suggesting that severe multicollinearity is unlikely among these numeric variables.

The bar plot below displays the Variance Inflation Factor (VIF) for each predictor in the refined model. The red dashed line at VIF = 5 serves as a common multicollinearity threshold.

Most predictors have VIF scores under 5, which is generally acceptable.
A few predictors exceed the threshold:
- NeighborhoodNAmes (VIF = 14.3)
- ExterQualTA (VIF = 12.3)
- NeighborhoodOldTown, ExterQualGd, and NeighborhoodCollgCr (VIF ≈ 9)
These high VIFs suggest collinearity between categorical factor levels, particularly in the Neighborhood and ExterQual variables.

These visuals confirm that:

Our numeric predictors are not severely redundant
Some categorical variables warrant simplification or consolidation
The model remains interpretable, but we may consider adjusting for high-VIF features in later stages (e.g., for generalizability or deployment)

# Use base R subsetting to avoid dplyr::select() conflicts
num_vars <- train[, c("OverallQual", "GrLivArea", "GarageCars", "TotalBsmtSF", "YearBuilt", "LotArea")]

# Compute correlation matrix
corr_matrix <- cor(num_vars, use = "complete.obs")

# Visualize with corrplot
library(corrplot)
corrplot(corr_matrix, method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.9, addCoef.col = "black")

vif_df <- as.data.frame(vif(model2))
vif_df$Variable <- rownames(vif_df)
colnames(vif_df) <- c("VIF", "Variable")

ggplot(vif_df, aes(x = reorder(Variable, VIF), y = VIF)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "VIF Scores by Predictor", x = "Predictor", y = "VIF") +
  geom_hline(yintercept = 5, linetype = "dashed", color = "red")

4. Residual Diagnostics

After building and optimizing a regression model, it’s essential to validate that the model assumptions are satisfied:

Linearity
Homoscedasticity (constant variance of residuals)
Normality of residuals
Independence of residuals
Absence of influential outliers

Observations

Residuals vs Fitted
A roughly horizontal band indicates that the linearity assumption is reasonable, though there may be slight funneling at the right, suggesting mild heteroscedasticity.

Normal Q-Q Plot
Most points lie on the diagonal, but some deviation is visible in the tails. This suggests mild non-normality, which may not be critical for predictive performance but can affect inference and p-value reliability.

Scale-Location Plot
A flat trend suggests constant variance (homoscedasticity). Here, there is a slight upward curve, indicating increasing variance with fitted values — a mild violation worth noting.

Residuals vs Leverage
A few observations have high leverage (e.g., IDs 524, 633, 1299), but none exceed Cook’s distance threshold. These points may merit closer inspection but are not extreme enough to warrant removal.

Normality Test

A Shapiro-Wilk test is used to formally assess the normality of residuals.

Result:
W = 0.989, p < 2.2e-16

This result rejects the null hypothesis of normality. However, with large samples (n > 1000), minor deviations are common and usually not practically concerning, especially in predictive models.

Summary

Most model assumptions are reasonably satisfied.
Mild signs of heteroscedasticity and non-normality are present but within acceptable bounds for many applied contexts.
No leverage or influence points are extreme enough to compromise model validity.
The model remains robust and trustworthy for prediction, though caution should be used when interpreting individual p-values or making strong inferential claims.

par(mfrow = c(2, 2))
plot(step_model)

shapiro.test(residuals(step_model))  # Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(step_model)
## W = 0.89809, p-value < 2.2e-16

5. Cross-Validation

Cross-validation provides a more robust estimate of how the model will generalize to unseen data. Instead of evaluating performance on a single train-test split, we use k-fold cross-validation to average model error across multiple folds, helping reduce the variance of our performance metrics.

Why 10-Fold Cross Validation?

10-fold cross-validation is a widely accepted standard in statistical modeling and machine learning
(Gao & Liu, 2022; Olanlokun et al., 2024; Wulandari et al., 2023; Yates et al., 2023; Zhang & Li, 2021). It strikes a practical balance between bias and variance (Gao & Liu, 2022; Olanlokun et al., 2024; Yates et al., 2023):

Using too few folds (e.g., 2 or 5) may result in higher variance and unstable error estimates
(Olanlokun et al., 2024; Yates et al., 2023; Zhang & Li, 2021).
Using too many folds (e.g., leave-one-out) reduces bias but greatly increases computation and can still result in high variance
(Gao & Liu, 2022; Olanlokun et al., 2024; Zhang & Li, 2021).

By partitioning the data into 10 equally sized subsets (folds), the model is trained on 90% of the data and tested on 10%, rotating through all folds (Gao & Liu, 2022; Wulandari et al., 2023). This ensures:

Every observation is used once as a test case
(Olanlokun et al., 2024; Wulandari et al., 2023; Yates et al., 2023)
The model is evaluated across diverse subsamples
(Wulandari et al., 2023; Yates et al., 2023)
We obtain a more reliable estimate of real-world predictive performance
(Gao & Liu, 2022; Olanlokun et al., 2024; Zhang & Li, 2021)

In this case, 10-fold cross-validation is both efficient and statistically sound for a dataset of this size; n = 1460 (Olanlokun et al., 2024; Yates et al., 2023; Zhang & Li, 2021).

set.seed(42)
ctrl <- trainControl(method = "cv", number = 10)
cv_model <- train(formula(step_model), data = train, method = "lm", trControl = ctrl)
cv_model

## Linear Regression 
## 
## 1460 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1315, 1314, 1315, 1314, 1314, 1314, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1624555  0.8340057  0.1117906
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

6. Final Prediction and Submission

Now that the model has been validated, we apply it to the cleaned test dataset to generate predicted house prices. The purpose here is to evaluate the distribution and range of the model’s predictions, ensuring they are realistic and aligned with housing market expectations.

Test Set Preparation

To maintain consistency, we apply the same preprocessing steps used on the training data:

Categorical NAs (e.g., GarageType, PoolQC) are replaced with "None"
Numeric NAs (e.g., LotFrontage, MasVnrArea, GarageYrBlt) are imputed using the median
MasVnrType is filled with "None", and Electrical with the most frequent category
All character variables are converted to factors

We generate log-scale predictions using the final model, then reverse the log transformation to report sale prices in dollars.

Observations and Analysis of Predictions

Distribution Shape

The predicted sale prices show a right-skewed distribution, which is expected for housing markets where most homes fall in a mid-range price band and a few high-value properties extend the upper tail. The peak frequency occurs between $140,000 and $180,000, which aligns with typical sale prices in Ames, Iowa at the time the dataset was collected (circa 2006–2010). While these figures may appear low by today’s market standards, they accurately reflect the housing economics captured in the original data (De Cock, 2011).

Summary Statistics

Minimum Predicted Price: $60,306
Median: $158,824
Mean: $176,842
Maximum: $837,193

These values are within a realistic range and are comparable to the training data, indicating the model generalizes well and hasn’t produced outlier inflation or compression.

test_preds <- predict(step_model, newdata = test)
predicted_prices <- exp(test_preds) - 1

# Summary statistics
summary(predicted_prices)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   60306  125141  158824  176842  207852  837193       2

# Histogram of predicted prices
hist(predicted_prices, breaks = 30,
     main = "Predicted Sale Prices (Test Set)",
     xlab = "Sale Price ($)",
     col = "lightblue", border = "white",
     xaxt = "n")  # remove default axis

axis(1, at = pretty(predicted_prices),
     labels = dollar(pretty(predicted_prices)))

7. Final Model Justification and Reflection

7.1 Final Model Justification

Why is this model optimal?
The final model balances statistical rigor, domain relevance, and generalizability. It retains only predictors that offer meaningful, non-redundant explanatory power, supported by both economic theory and empirical validation.

Quantitative Summary:

Metric	Value	Interpretation
Adjusted R²	0.847	Explains ~85% of the variance in log-transformed sale prices
F-statistic	237.6 (p < 2.2e-16)	Strong global model significance
Degrees of Freedom	1,460 obs - 34 params = 1425	Maintains sufficient df for reliable estimates
Cross-validated R²	0.834	Confirms generalization on unseen data
RMSE (CV)	0.1625	Low average error in log scale
Residual Standard Error	0.1565	Tight residual spread after optimization

Balance Achieved

A parsimonious yet expressive model. Predictors like OverallQual, GrLivArea, YearBuilt, and Neighborhood add explanatory depth, while cross-validation confirms their reliability. Importantly, degrees of freedom remained high, avoiding overfitting.

Multicollinearity was managed through VIF analysis (most VIFs < 5), and residual diagnostics confirmed model assumptions were not violated. The final model reflects both statistical soundness and practical interpretability.

7.2 Reflection: Challenges and Insights

What made this difficult?

Optimizing a multiple regression model is fundamentally a tradeoff between complexity and parsimony. Early iterations with too many predictors inflated variance and reduced degrees of freedom, despite high R². As such, the learnings most evident are:

Degrees of Freedom (df):
As predictors increased, df decreased, this led to larger standard errors and reduced model reliability. Therefore, conscientiously limiting the number of predictors without underfitting was required.
F-statistic & p-values:
The F-statistic was the predicate to assess overall significance, but it improved only when added variables were non-redundant. Some predictors (e.g., FullBath) showed high correlation with stronger variables and offered little incremental value.
Stepwise AIC vs. Manual Control:
Stepwise selection via AIC was efficient but sometimes retained variables with marginal significance. Manual refinement using VIF and domain logic was essential to eliminate redundancy and interpret the model transparently.
Cross-Validation as Ground Truth:
Ultimately, 10-fold cross-validation anchored our choices, revealing whether gains in training fit translated to real predictive performance. This was the decisive tool in confirming model robustness.

Key Takeaway: Degrees of freedom, F-statistics, and predictive accuracy are not isolated metrics, they interact. Iteratively testing, removing, and validating variables was required to find the sweet spot where generalizability, interpretability, and performance converge.

References

De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3). https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

Gao, Y., & Liu, Y. (2022). Cross validation methods: Analysis based on probabilistic machine learning and information theory. IEEE Access, 10, 92813–92825. https://doi.org/10.1109/ACCESS.2022.3202362

Kaggle. (2016). House Prices - Advanced Regression Techniques [Data set]. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

Kalnins, A., & Praitis Hill, K. (2025). The VIF score. What is it good for? Absolutely nothing. Organizational Research Methods. Advance online publication. https://doi.org/10.1177/10944281231216381

Kostic, Z., & Jevremovic, A. (2020). What image features boost housing market predictions? [Preprint]. arXiv. https://arxiv.org/abs/2107.07148

Kyriazos, T., & Poga, M. (2023). Dealing with multicollinearity in factor analysis: The problem, detections, and solutions. Open Journal of Statistics, 13(3), 404–424. https://doi.org/10.4236/ojs.2023.133020

Memon, S. M. Z., Wamala, R., & Kabano, I. H. (2023). A comparison of imputation methods for categorical data. Informatics in Medicine Unlocked, 42, 101382. https://doi.org/10.1016/j.imu.2023.101382

Mohammed, M. B., Zulkafli, H. S., Adam, M. B., Ali, N., & Baba, I. A. (2021). Comparison of five imputation methods in handling missing data in a continuous frequency table. AIP Conference Proceedings, 2355(1), 040006. https://doi.org/10.1063/5.0053286

Neves, F. T., Aparicio, M., & Neto, M. d. C. (2024). The impacts of open data and eXplainable AI on real estate price predictions in smart cities. Applied Sciences, 14(5), Article 2209. https://doi.org/10.3390/app14052209

Olanlokun, O. B., Ogunlana, O. O., Ayinde, S. A., & Akarawak, E. E. (2024). Comparative analysis of cross-validation techniques: LOOCV, k-folds cross-validation and repeated k-folds cross-validation in machine learning models. American Journal of Theoretical and Applied Statistics, 13(5), 97–106. https://doi.org/10.11648/j.ajtas.20241305.13

Root, T. H., Strader, T. J., & Huang, Y.-H. (2023). A review of machine learning approaches for real estate valuation. Journal of the Midwest Association for Information Systems, 2023(2), 9–28. https://doi.org/10.17705/3jmwa.000082

Salmerón, R. (2020). Overcoming the inconsistences of the variance inflation factor: A redefined VIF and a test to detect statistical troubling multicollinearity. arXiv. https://doi.org/10.48550/arXiv.2005.02245

Salmerón-Gómez, R., García-García, C. B., & García-Pérez, J. (2025). A redefined variance inflation factor: Overcoming the limitations of the variance inflation factor. Computational Economics. Advance online publication. https://doi.org/10.1007/s10614-024-10575-8

Wulandari, C. P., Putri, A. S., & Kurniawan, R. (2023). Optimization of random forest hyperparameters based on genetic algorithm using k-fold cross validation in predicting customer churn. Procedia Computer Science, 216, 67–74. https://doi.org/10.1016/j.procs.2022.12.114

Yates, L. A., Aandahl, Z., Richards, S. A., & Brook, B. W. (2023). Cross validation for model selection: A review with examples from ecology. Ecological Monographs, 93(1), Article e1557. https://doi.org/10.1002/ecm.1557

Yoshida, T., & Seya, H. (2021). Spatial prediction of apartment rent using regression-based and machine learning-based approaches with a large dataset [Preprint]. arXiv. https://arxiv.org/abs/2107.12539

Zhang, Y., & Li, J. (2021). A comprehensive survey on cross-validation in machine learning: Methods and applications. Journal of Physics: Conference Series, 1955(1), Article 012013. https://doi.org/10.1088/1742-6596/1955/1/012013

Optimizing Multiple Regression for House Prices

Kat Campise

2025-08-01

Introduction

1. Data Exploration and Preprocessing

1.1 Load Data

1.2 Understand Structure and Missingness

Observations

1.3 Handle Missing Values (Why + How)

Actions and Observations

1.4 Convert to Factors and Log-Transform Target

Actions and Observations

Statistical Rationale

2. Initial Model Construction

Actions and Observations

Initial Intrepretation

3. Model Optimization

3.1 Stepwise AIC Optimization

3.2 Manual Tweaks Based on Significance and VIF

3.3 Add or Drop Predictors Judiciously

Observations

3.4 Visualizing Multicollinearity and VIF Bar Plot

Observations

4. Residual Diagnostics

Observations

Normality Test

Summary

5. Cross-Validation

Why 10-Fold Cross Validation?

6. Final Prediction and Submission

Test Set Preparation

Observations and Analysis of Predictions

7. Final Model Justification and Reflection

7.1 Final Model Justification

Quantitative Summary:

7.2 Reflection: Challenges and Insights

References