We have created a large dataset with 89 attributes. Not all of these will be useful predictors. In this section, we will narrow down our dataset by selecting only the features that will be useful to predict for the inspection outcome. Some of the model types that we’ll experiment have built-in feature selection while others don’t. Either way, it will lead to better results if we remove variables that are redundant or simply just random noise. There are several methods for variable selection but in this case we used a feature selection algorithm called “Boruta” from an R package with the same name.

Boruta

The Boruta algorithm relies on variable importance values that it repeatedly calculates by running Random Forest models on your dataset for a predefined set of predictor variables. It creates a set of random noise variables associated with each attribute by shuffling each value’s row location. After training a Random Forest model it checks the variable importance of each feature against the distribution of corresponding noise variables. With t-tests it determines whether the real predictor is significantly more important than the noise variables. If so it is confirmed as important. The algorithm runs until it has separated all of the variables or it hits a user-specified limit on the number of iterations to go through. This process is expensive and may need hours to run but it neatly separates your variables into useful subsets.

# packages
packages(c("tidyverse", "caret", "Boruta"))

# extract just the predictor variable candidates
dd <- ds %>%
  select(one_of(mod_vars))

# partition dataset for boruta
set.seed(12345)
inTrain <- createDataPartition(dd$o.failed.n, p = 0.6, list = FALSE)
boruta.train.set <- dd[inTrain, -c(1,2)]

# run iterative random forest variable importance test
boruta.train <- Boruta(o.failed ~ ., data = train, doTrace = 2)
# get final decision
fd <- boruta.train$finalDecision
rejected_vars <- fd[which(fd != "Confirmed")] %>% names

The algorithm determines which variables normalize to have no impact. We can look at these rejected variables:

# generate a list of non-useful variables
# get final decision
print(rejected_vars)
##  [1] "e.avgDists.repOfAll.n10" "bere.4"                 
##  [3] "bere.5"                  "bere.6"                 
##  [5] "bere.7"                  "priorityde.1"           
##  [7] "recStatus.3"             "numFailTypes.4"         
##  [9] "numFailTypes.5"          "numViol.3"              
## [11] "numViol.4"               "numViol.5"              
## [13] "prevFail.5"              "prevFail.6"

and then the variables that were accepted:

##  [1] "o.failed.n"             "s.sinceLastInsp"       
##  [3] "s.sinceViolation"       "s.sinceAdded"          
##  [5] "e.avgDists.all.n5"      "e.avgDists.nr.n5"      
##  [7] "e.avgDists.rep.n5"      "e.avgDists.repOfAll.n5"
##  [9] "e.avgDists.all.n10"     "e.avgDists.nr.n10"     
## [11] "e.avgDists.rep.n10"     "s.ownerTaxBalance"     
## [13] "bere.1"                 "bere.2"                
## [15] "bere.3"                 "priorityde.2"          
## [17] "priorityde.3"           "priorityde.4"          
## [19] "priorityde.5"           "priorityde.6"          
## [21] "recStatus.1"            "recStatus.2"           
## [23] "inspDesc.1"             "inspDesc.2"            
## [25] "inspDesc.3"             "inspDesc.4"            
## [27] "inspDesc.5"             "inspDesc.6"            
## [29] "inspDesc.7"             "vpi.isBldgAddkey.1"    
## [31] "vpi.isBldgAddkey.2"     "vpi.isOwner.1"         
## [33] "vpi.isOwner.2"          "ownerDelinquent.1"     
## [35] "ownerDelinquent.2"      "numFailTypes.1"        
## [37] "numFailTypes.2"         "numFailTypes.3"        
## [39] "numFailTypes.6"         "numViol.1"             
## [41] "numViol.2"              "numViol.6"             
## [43] "prevFail.1"             "prevFail.2"            
## [45] "prevFail.3"             "prevFail.4"            
## [47] "prevFail.7"             "violTpe.1"             
## [49] "violTpe.2"              "violTpe.3"             
## [51] "violTpe.4"              "violTpe.5"             
## [53] "violTpe.6"              "violTpe.7"             
## [55] "violTpe.8"

Now we have a list of variables that we want to include as predictors when building the model.

Variable Importance

We are jumping ahead a little bit here, but let’s look at the variable importance from the Random Forest and Gradient Boosting Machine models we will create in the next section.

Interestingly, the importance of some variables varied slightly between the RF and GBM models. However, for the most part, the most powerful predictors carried over from model to model. The categorical inspection description variables, as well our time series predictors proved to be most important.

I the next section we experiment with different machine learning algorithms and select a final model.