Data Preprocessing

This model is meant to act as a tool for inspectors to prioritize which buildings to investigate. Specifically, we wanted to figure out which locations would be most likely to fail or pass a follow-up inspection after already failing one or more times before. You can read more about our motivation and reasoning for choosing this problem in our post about it on the Azavea Blog.

Datasets

For this model we used data from The City of Philadelphia’s Department of Licenses and Inspections. Philadelphia makes all of it’s L&I data available through the City’s Open Data Site, making it an excellent city to use to train the model. To keep the data manageable, we limited it to inspection outcomes from the past 3+ years (2014-2017). The L&I data is stored in a relational database comprised of several datasets that record inspection outcomes at different levels. These levels include:

locations (every inspection and outcome that occurs at a given address)
cases (a whole cycle of inspections at a given location starting with the initial inspection and ending with compliance)
inspections (each of the individual times that an inspector visited a case)
violations (itemized descriptions of each of the ways that a building doesn’t comply with the code)

This chart shows the breakdown of the total number of each type of record over the study period.

These numbers indicate the count of each type after we had cleaned the data and removed records in which the inspector wasn’t able to access the building.

Data Wrangling

Note: Many of these code snippets call helper functions (e.g. filterData()) that I wrote for this project. You can find these functions here.

# packages
packages(c("tidyverse", "data.table", "sp", "rgdal", "lubridate", "plyr",
  "forcats", "caret", "spatstat", "RCurl", "jsonlite"))

# load spatial datasets
load("data/lni_data.Rdata")

# get coordinates as attributes
insp <- getCoords(insp)
viol <- getCoords(viol)

# join violations and inspections dataset
full_dat <- joinSPdfs(viol, insp, "casenumber")

# clean full data frame:
#   rename fields, creteUID, add a violation year field   
full_dat <- cleanData(full_dat)

# filter the data frame
#   remove some NA fields, closed inspections
filtered_dat <- filterData(full_dat)

Compiling these data and summarizing them to the inspection-level required some significant data wrangling. We first joined the violation and inspection datasets on a unique ID. After that we needed to create variables that measure the relationships among each inspection and the other inspections in it’s L&I case. Though the dataset does come with a unique case identifier, it describes each inspection independently of other inspections in that case. Therefore, we needed to develop a process for iterating through the entire dataset, sequencing the inspection events in each case, measuring the time between events, and extrapolating them back to the main dataset. This data pre-processing step may seem tedious but it was crucial to our modeling process.

measureFails <- function(dat, row){
  # Given a row, loop through the dataset measuring the outcomes before
  # and after this point
  #
  # Args:
  #   dat: data.table object containing the measurment row
  #   row: the index of the row
  #
  # Returns:
  #   a data.table of pre- and post- inspection outcomes
  #

  dat <- data.table(dat)

  r.t <- dat[row, c('uid', 'casenumber', 'apfailkey',
                    'inspCompl', 'inspStatus', 'violDte')]

  subdat <- dat[casenumber == r.t$casenumber & apfailkey == r.t$apfailkey,
                c('uid', 'casenumber', 'apfailkey',
                  'inspCompl', 'inspStatus', 'violDte')]

  fails <- data.table(uid = r.t$uid, before = 0, after = 0,
                      rec.inspCompl = 'none', rec.inspStatus = 'none',
                      rec.uid = "none")

  for (r in 1:nrow(subdat)) {

    r.m <- subdat[r, ]

    # if the test insp date is after the reference insp. date
    if (!is.na(as.Date(r.m$inspCompl)) && !is.na(as.Date(r.t$inspCompl))) {

      if (r.t$inspCompl > r.m$inspCompl) {

        if (class(fails$rec.inspCompl) != "Date") {

          # update the "recent" variables
          fails$rec.uid <- r.m$uid
          fails$rec.inspStatus <- r.m$inspStatus
          fails$rec.inspCompl <- r.m$inspCompl

          if (r.m$inspStatus == "Failed") {

            fails$before <- fails$before + 1

          }
        } else if (fails$rec.inspCompl < r.m$inspCompl) {

          # update the "recent" variables
          fails$rec.uid <- r.m$uid
          fails$rec.inspStatus <- r.m$inspStatus
          fails$rec.inspCompl <- r.m$inspCompl

          if (r.m$inspStatus == "Failed") {
            fails$before <- fails$before + 1
          }
        }
      } else if (r.t$inspCompl < r.m$inspCompl && r.m$inspStatus == "Failed") {

        fails$after <- fails$after + 1

      }
    }
  }
  return(fails)
}

measureAllFails <- function(dat) {
  # Apply the measurefails function over every row, combining infot a
  # data frame and binding to the original df
  #
  # Args:
  #   dat: the data frame to measure
  #
  # Returns:
  #   a new data frame with recent inspection data merged to it
  #
  # Side effects:
  #    Prints a progress text bar
  #    saves an Rdata file of the resulting dataet to the outpur folder
  #

  # loop over whole dataset
  rec <- ldply(c(1:nrow(dat)), measureFails, dat = dat, .progress = "text")

  # merge recents with original df
  df_rec <- merge(dat, rec, by = "uid")

  # save the dataset to output file
  saveTimeStampedFile(df_rec, file = "lni_15to17_withRecent", time = FALSE)

  return(df_rec)
}

# call functions on entire dataset
filtered_dat <- measureAllFails(filtered_dat)

This data pre-processing step may seem tedious but the resulting variables would prove to be some our most important predictors.

We continued to filter our dataset down to a more manageable size by removing the each row that corresponded to the initial inspection in each case. Predicting outcomes for these inspections was outside of the scope of this model.

# remove the initial inspection instances
df <- filtered_dat %>%
  filter(rec.inspStatus != "none") %>%
  filter(as.Date(violDte) < as.Date(inspCompl)) %>%
  data.table() %>%
  .[, c("organizati", "unit", "apdesc", "aptype") := NULL] %>%
  na.omit()

Feature Engineering

Feature Engineering proved to be an important aspect of this model. In some cases we used domain knowledge and intuition to manually create new attributes that we believed would be predictive of inspection outcome. We used the dates that we obtained in the previous section to create new “duration” variables that told us about the gaps of time between the date of the inspection and previous events in that case:

# create additional variables
df$zipShort <- substr(df$zip, 1, 5) %>% as.factor()
df$failed <- ifelse(df$inspStatus == "Failed", 1, 0)
df$before <- as.factor(df$before)

# add 3 variables indicating durations from previous events in lifecycle of
#   building case
df <- durationVars(df)

The following plot gives us a good indication of the value of the resulting variable.

This plot hints at the value of the resulting variables for predicting follow-up inspection outcomes. It shows the relative distributions of two different time-based dependent variables for inspections that resulted in a failure and for those that didn’t. The areas in which the density curves diverge indicate where we are able to get predictive power from. For example cases that had gone fewer that 25 or more than ~65 days since an inspection were more likely to end passing their inspections.

At this point we’re working with a dataset at the violation level but we ultimately needed to conducted our analysis at the building inspection level. This means that we needed a dataset in which each row in our describes an individual inspection of one specific building. Our model tries to predict the outcome of each of these inspections. Predictor variables provide information about different levels of the dataset. For example, address-level variables described the location and ownership of the building. By contrast, violation-level variables tell us the total number of violations in a given case and the descriptions of each. Some of the most powerful predictors came from the case level. These variables measured the numbers of previous failures in a case or the duration between individual inspections. Before condensing the dataset to the inspection level, we needed to make sure we saved the information about which individual violations were found upon each inspection:

# extract 'violation type' variable and remove levels occurring < 0.02%
#   of the time
df_violTpe  <- df %>%
  select(apinspkey, violDesc) %>%
  mutate(violTpe = fct_lump(violDesc, p = 0.02)) %>%
  dplyr::rename(id.apinspkey = apinspkey)

# generate a lookup table to get the names of violation types (for later
#   use)
violType_lookup <- getFactorLevelLookup(df_violTpe$violTpe, "violTpe")

# encode violation-type categorical feature as a series of binary
#   numeric dummy variables
df_violTpe <- castAndSumm(df_violTpe, "id.apinspkey", "violTpe", binary = FALSE)

Then group by a unique inspection identifier and summarize our variables:

# collapse dataset to the inspection level, summarizing
ds <- summariseInspections(df)

We also created additional variables that would account for spatial autocorrelation among the independent variable. These variables measured the average distances to the n nearest instances of repeat or non-repeat failure buildings. However, we did some independent tests that did not find any spatial autocorrelation among the dependent variable. When we finally built models we found that these distance variables were only marginally important predictors of inspection outcomes.

External Data Sources

Up until this point, all of our variables we created were either included in one of the two Licenses and Inspections datasets or we derived them from out-of-the-box L&I features. We also tried to extract additional signal from external city datasets.

Building inspection failure is often linked to vacancy so we tried gathering information from the City’s Vacant Property Indicators dataset. We created two different dummy variables that indicated whether or not we found the building’s address and/or owner in the vacancy database.

# read dataset
vpi <- read.csv("http://data.phl.opendata.arcgis.com/datasets/f7ed68293c5\
                e40d58f1de9c8435c3e84_0.csv", stringsAsFactors = TRUE)

# look for address matches
ds$f.vpi.isBldgAddkey <- ifelse(
  ds$l.addresskey %in% vpi$LNIADDRESSKEY, 1, 0
  ) %>% as.factor()

# look for owner name matches
ds$f.vpi.isOwner <- ifelse(
  ds$l.owner %in% vpi$OWNER1, 1, ifelse(ds$l.owner %in% vpi$OWNER2, 1, 0)
  ) %>%
  as.factor()

The other external data that we included were Property Tax Balances from the Philadelphia Department of Revenue. Much like vacancy, tax delinquency is often associated with building inspection so we believed that this would be a useful dataset to include. We pulled in data from the tax balance api and created new variables by joining to our main dataset:

# encode and format api call
url <- URLencode("https://data.phila.gov/carto/api/v2/sql?q=SELECT owner,\
                 total FROM real_estate_tax_balances WHERE total > 0 AND \
                 tax_period > 2015 AND owner != ''")
url <- gsub("22", "27", url)

# get tax delinquency data
delinquent <- getURL(url)

# parse delinquency data from JSON
del <- fromJSON(delinquent)$rows %>%
  group_by(owner) %>%
  dplyr::summarise(s.ownerTaxBalance = mean(total)) %>%
  dplyr::rename(l.owner = owner)

# extract a dataset of tax balance data for each property
ds.temp <- ds %>%
  mutate(l.owner = as.character(l.owner)) %>%
  left_join(del) %>%
  select(s.ownerTaxBalance) %>%
  mutate(
    s.ownerTaxBalance = ifelse(
      is.na(s.ownerTaxBalance), 0, s.ownerTaxBalance)) %>%
  mutate(f.ownerDelinquent = as.factor(ifelse(s.ownerTaxBalance > 5, 1, 0)))

# and bind it to the master dataset
ds <- cbind(ds, ds.temp)

Dealing with categorical variables

Some of the machine learning models that we used in the next section required that all of the predictor variables be numeric. Of course, some of our variables were categorical (e.g. the type of inspection or the binary distinction of whether or not the building was tax delinquent). We got around this issue by using one-hot-encoding. Chris Albon helps illustrate this concept in one of his terrific machine learning flashcards:

One-Hot Encoding #machinelearningflashcards pic.twitter.com/n3gUivTxLc
— Chris Albon (@chrisalbon) August 10, 2017

For each category in each categorical variable, we created a new binary numeric attribute indicating whether it applied to each row:

# get list of factor variable names
names <- ds %>%
  select(starts_with("f.")) %>%
  names %>%
  as.list

# encode each of these as  a series of binary dummy variables
factor_variable_list <- llply(names, castFactorVars, dat = ds,
                              .progress = "text")

# join all categorical data variable dfs together
ds <- plyr::join_all(factor_variable_list, by = "id.apinspkey",
                     type = "left", match = "all") %>%
  left_join(df_violTpe, "id.apinspkey") %>%
  select(-id.apinspkey) %>%
  cbind(ds,. )

# generate two different response variables, factor and numeric
#   for different model types
ds <- ds %>%
  dplyr::rename(o.failed.n = o.failed) %>%
  mutate(o.failed.f = as.factor(o.failed.n)) %>%
  select(o.failed.n, o.failed.f, 2:ncol(.))

# create a vector of variable names that will be used in the models
# and one that won't
mod_vars <- ds %>% select(
  -o.numFails, -starts_with("i."),
  -starts_with("l."), -starts_with("f.")) %>%
  names
non_mod_vars <- setdiff(names(ds), mod_vars)

This process dramatically increased the number of features in our dataset but it gave us the option to include all of our variables into any model. In the next section we will walk through our process for separating predictively important features from non-important ones.