Abstract:
As the number of government algorithms grow, so does the need to evaluate algorithmic fairness. This paper has three goals. First, we ground the notion of algorithmic fairness in the context of disparate impact, arguing that for an algorithm to be fair, its predictions must generalize across different protected groups. Next, two algorithmic use cases are presented with code examples for how to evaluate fairness. Finally, we promote the concept of an open source repository of government algorithmic “scorecards,” allowing stakeholders to compare across algorithms and use cases.


Our sincerest appreciation to Dr. Dyann Daley and Predict Align Prevent for their generous support of this work. We are also extremely grateful to Matt Harris for ensuring our code is in fact, replicable. Finally, we are immensely appreciative of the time and expertise of several reviewers including Drs. Dennis Culhane, Dyann Daley, John Landis, and Tony Smith.

1. Introduction

Algorithms are increasingly making decisions in place of humans. Algorithms effect the products we buy like insurance, credit cards, and bank loans, and the information we are exposed to like shopping recommendations and news articles. Between bestsellers like Cathy O’Neil’s “Weapons of Math Destruction” and relentless news coverage of tech company data mining, society is beginning to understand that algorithms can bring as much peril as they do promise.

Government is still wrestling with how to regulate private sector algorithms - which are closed, their inner-workings cast as intellectual property. In the public sector, where transparency is expected, the calculus is different. Governments today use algorithms to dole out subsidies, determine program eligibility, and prioritize the allocation of limited taxpayer-funded resources. While fairness should be at the heart of government algorithms, it’s still unclear how best to referee public algorithms. The fact is that if human decisions are biased, implicit or otherwise, then the algorithms we train from those decisions will also be biased.

Recently new open source tools have emerged to help governments evaluate algorithms for fairness. Examples include the Ethics & Algorithms Toolkit created by a consortia of authors in government and academia, as well as the University of Chicago’s Aequitas “open source bias toolkit.” The Ethics & Algorithms Toolkit is excellent for developing governance and policy around algorithms, while Aequitas is aimed at more experienced data scientists. Both are invaluable tools and stakeholders would be well advised to integrate them into their technology workflow.

The main goal of this paper is to bridge the gap between these two stakeholder groups by providing code examples that introduce the novice public-sector data scientist to algorithmic fairness. Our second goal is to present an open source standard by which governments can compare their algorithms to those of their peers. Our motivation is informed by the following observations:

  1. All jurisdictions at the local, state, and Federal levels collect comparable administrative data to determine program eligibility as well as to develop budgets and strategic plans.

  2. All jurisdictions share a comparable set of algorithmic use cases, each of which can be realized by leveraging these administrative datasets.

  3. Thus, in the future, all jurisdictions will be developing comparable algorithms that predict comparable outcomes of interest towards the fulfillment of these shared use cases.

In response to public pressure, each jurisdiction will need to evaluate the fairness of their algorithms. One solution is a standard, open source “scorecard,” we call the Open Algorithmic Scorecard (OAS). Each jurisdiction would have a separate scorecard for every algorithm they deploy, featuring a set of simple metrics describing accuracy and bias. These scorecards would live in an open repository allowing one to compare prototype model results to models created elsewhere; promote transparency by filing finished scorecards; and provide an arena where policy-makers, data scientists, academics, civic technologists, and other stakeholders could observe best practices.

This report is a proof-of-concept, demonstrating two use cases. The first use case is a placed-based machine learning algorithm to predict home prices, which many jurisdictions now use for tax assessment purposes. The second use case is a person-level machine learning model for predicting prisoner recidivism. Along the way, two sets of analytics are presented (one for place, one for people) that describe model accuracy and model “generalizability.” English narrative, replicable code, and data visualizations are provided for each. We hope novice and aspiring government data scientists will sharpen their skills by replicating the code found in the appendix below.

The term “algorithm” can pertain to a wide array of decision-making tools. This report focuses exclusively on supervised machine learning algorithms, which we define as a class of machine learning models that learn from a set of observed experiences to predict an experience that has yet to be observed.

Section 2 provides some background on the accuracy/generalizability metrics used to assess fairness in this report. Section 3 and Section 4 present the real estate tax assessment and recidivism models, respectively, as well as the scorecards for each. Section 5 concludes.

2. The Relationship Between Accuracy and Generalizability

2.1 The costs of getting it wrong

Governments use machine learning algorithms to allocate limited taxpayer-funded resources, and a biased model may mean that these resources are misallocated. Resources may be wasted on a population that does not need them or allocated in a way that ultimately proves harmful. A biased algorithm may leave policy makers wondering whether a data-driven approach is any more useful than existing institutional knowledge.

For some use cases, bias may have real social costs. A biased tax assessment model may systematically under or over-assess the value of certain homes. In the former case, City tax coffers lose out on revenue, while gentrifiers freeride on new amenities and services. In the latter case, excessive tax burden in poor communities may lead to greater housing instability and inequality.

Biased algorithms may have more dire consequences for people-oriented use cases, like recidivism. As we discuss below, one example of bias is higher false positive rates for African American ex-offenders compared to Whites. A false positive in this context means that the model predicted that an ex-offender would recidivate but actually did not. When false positives are disproportionately predicted for a protected class, decisions made from that algorithm may come with significant social costs.

2.2 The spectre of disparate impact

Social Scientists are well-versed in the issues of fairness and discrimination. Identifying discrimination in an algorithm however, is just as nuanced and complicated as it is in housing and labor markets. It is unlikely that a jurisdiction would create an algorithm with the express intent of discriminating against a protected class. Given the black box nature of these models, it is far more likely that they would create an algorithm from which decisions have a “disparate impact” on members of a protected class. Disparate impact is a legal theory positing that although a policy or program may not be discriminatory prima facie, it still may have an adverse discriminatory effect even if unintended.

Disparate impact may play a role when machine learning algorithms are poorly fit. There are two general conditions that lead to poor predictions. An “underfit” model does not exhibit a high degree of predictive accuracy, likely because not enough effective predictive variables or “features” are included. An “overfit” model, traditionally, is one that may predict well on training data but fails when used to predict for new observations. Models may also be overfit to one type of experience, predicting differently from one group to the next.

If an algorithm does not generalize to one protected class, its use for resource allocation may have a disparate impact on that group. For example, the recidivism algorithm created below predicts a higher rate of false positives for African Americans relative to Caucasians. This may occur because the algorithm is underfit to the African American “experience.” It may also be that the training data itself is biased, a common critique of prediction in the criminal justice domain. Critics have argued that systematic over-policing of historically disenfranchised communities creates a feedback loop where more reported crime leads to more predicted crime, which leads to more cops on patrol and thus, more reported crimes. It could be that police bias leads to the over-policing of certain communities. It could also be that people with higher propensity to commit crimes sort into these communities. In reality, both likely play a role, but if the relationship goes unobserved, then like any statistical model, systematic error will lead to bias.

It is impossible to identify the effect of unobserved variables. As an alternative, researchers are actively developing a series of fairness metrics. If bias cannot be judged by the input features, perhaps it can be judged by opening the black box and looking for bias in the predictions. We find this review of fairness metrics to be particularly relevant for policy-makers. In the case studies below, the fairness criterion we present hinge on an algorithms ability to generalize across different group typologies - like rich and poor neighborhoods or Caucasian and African American ex-offenders.

3. Home Price Prediction and Real Estate Tax Assessment

3.1 Tax assessment and the hedonic model

It is difficult to participate in the real estate market and not interact with machine learning models. Airbnb’s algorithms recommend rental prices to its hosts. Trulia’s computer vision algorithms convert house photos to home features. Perhaps the most ubiquitous real estate algorithm is Zillow’s Zestimate, which predicts the current market value of a property.

The Zestimate algorithm is very similar to the methodology counties use to assess home values and calculate property tax liability. These methods are rooted in the “hedonic model” - an econometric approach for deconstructing the market price of a good to the value of each constituent part. The hedonic model can estimate the “capitalization effect” or price premium associated with an extra bedroom or the presence of a garage. It can also be used, as is the case with tax assessment, for prediction. Typically, these algorithms are trained on recent transactions, then used to predict value for all houses citywide. The hedonic model relies on several different feature types, each explained below.

The first is parcel/internal characteristics like the size of the lot and the number of bedrooms. Next is neighborhood characteristics including exposure to crime or access to transit. A third is the “spatial component” which hypothesizes that house price is a function, in part, of neighboring prices. These features take their motivation from real estate appraisers who compare similar homes in close proximity (i.e. “comparables”). Properly controlling for comparables requires features that capture the unique spatial scale of prices in a neighborhood. For over two decades, Urban Spatial has developed a host of specialized approaches to account for this spatial component. Interested readers can refer to our work here, although for simplicity, we omit these more complicated features from our model below.

These algorithms are not without inherent biases. There are some key assumptions including 1) that all buyers and sellers have access to the same market information; 2) that neighborhood crime can be measured with the same accuracy as say, the number of bedrooms in a house; and 3) that buyers exhibit homogeneous preferences for amenities, like schools. A final source of bias is an assumption that neighborhoods are in “equilibrium”. This is almost never the case, particularly in gentrifying communities. In these communities, buyers and sellers capitalize future expectations into prices (ie. “what will this house be worth if a new subway station is opened nearby?”). Simply put, buyers and sellers will disagree on the future value of gentrified housing in a neighborhood, which may make it difficult to predict variation in prices at the neighborhood level.

3.2 Accuracy and generalizability in real estate assessment algorithms

Accuracy is simply the difference between the observed price of the home and the predicted price. This difference is often referred to as “error.” Generalizability is a bit more complex. The general approach for assessing bias in these algorithms is to investigate how errors cluster in space. The steps are: 1) train the model; 2) use the trained model to predict for out-of-sample sales; then 3) calculate and map errors. For an assessment algorithm to generalize well, it must exhibit comparable error rates across different neighborhoods and neighborhood contexts. If the model predicts better for rich versus poor neighborhoods or White versus African American neighborhoods then the model may be biased. The spatial arrangement of errors is explored further below.

3.3 Data and exploratory analysis

Our data come from the Philadelphia Office of Property Assessment (OPA). Those interested in replicating the analysis can download the data here or access assessment data directly on OpenDataPhilly. The dataset is comprised of market transactions of single family home sales from July 2017 to July 2018. Sales less than $3,000 and greater than $1,000,000 are removed, as well as observations with missing data. The final dataset includes 21,964 transactions with a mean and standard deviation sale price of $185,950 and $162,873, respectively. The table below provides a description of the variables developed for the model.

Variable Description
nhood What neighborhood the home is in
Season The season the home was sold in
year_built The year the home was built. Split into six categories: prior to 1970, 1970 to 1980, 1980 to 1990, 1990 to 2000, 2000 to 2010, and 2010 or later
own_occ A dummy variable indicating if the home is owner occupied (1) or not (0)
homestead A dummy variable indicating whether the property is participating in the homestead exemption (1) or not (0)
AC A dummy variable indicating whether the property has AC (1) or not (0)
numBed The number of beds in the home
numBath The number of bathrooms in the home
numStories The number of stories of the property
garage_spaces The number of garage spaces the property has
fireplaces The number of fireplaces the property has
total_livable_area The livable area of the home
univDist The nearest neighbor distance between each property and its closest university
parkDist The nearest neighbor distance between each property and its closest park (centroid)
ccDist The nearest neighbor distance between each property and Center City (City Hall)
septaDist The nearest neighbor distance between each property and its closest MFL/BSL stop
permitDist The average nearest neighbor distance between the property and its five closest new construction permits within the last two years
assaultsDist The average nearest neighbor distance between the property and the location of the five closest aggravated assaults
cleanDist The average nearest neighbor distance between the property and its two closest dry cleaners
graffitiDist The average nearest neighbor distance between the property and the location of the five closest 311 calls for graffiti
abandVDist The average nearest neighbor distance between the property and the location of the five closest 311 calls for abandoned vehicles

Philadelphia is missing a surprising amount of data on the internal characteristics of homes. 17% of transactions in our data list zero bedrooms. Perhaps OPA imputes missing values, but for this example, we use a fixed effect to denote when number of bedrooms equals zero. All told, we employ eight house/parcel specific features in the model below.

Figure 3.1 visualizes the mean and standard deviation (as a percentage of the mean) of single-family home prices by neighborhood in Philadelphia. Not surprisingly, high and low price neighborhoods cluster. Perhaps a bit more surprising, is that low priced neighborhoods also exhibit relatively high price variance. This may relate to the gentrification disequilibrium described above.

Not only do prices vary by neighborhood, but they vary by neighborhood type as well. We split neighborhoods into “high” and “low” designations across three different typologies and visualize differences in Figure 3.2. The first is Qualified Census Tracts (QCT), a poverty designation HUD uses to allocate housing tax credits. QCT designations provide a deliberate and policy-relevant threshold for judging generalizability. Neighborhoods that qualify for tax credits exhibit mean single-family home prices that are nearly half that of those that do not. Next, we understand whether the algorithm generalizes to gentrifying neighborhoods. Tracts are designated as “gentrifying” and non-gentrifying" using metrics from the Federal Reserve Bank of Philadelphia. Mean prices differ across gentrifying and non-gentrifying neighborhoods. A third typology is race-related. To determine whether the model generalizes with respect to race, the city is grouped into “majority White” and “majority non-White” census tracts. Mean prices are clearly higher in the former group. A well generalized algorithm should exhibit comparable error rates across each group.

3.4 Modeling

Figure 3.3 visualizes the neighborhood amenity features developed for the model. The goal is to quantify the level of amenity and disamenity “exposure” to each home sale citywide. To quantify the exposure to aggravated assaults for example, each home sale location measures the distance from itself to its k nearest assault neighbors and takes the mean.

Normally a host of features are employed to model the spatial structure, but in this simple example, we include just one set of features - a fixed effect for each neighborhood. Our hypothesis is that explicitly accounting for neighborhood variation helps to control for local comparables as well as any equilibrium effects.

For demonstration purposes, the model is more simplistic than it would be in reality. 10-fold cross validation is performed on a 60% training set to tune the hyperparameters of a Random Forest algorithm. All goodness of fit metrics are reported either from cross-validation or from the 40% test set. Figure 3.4 shows the feature importance associated with the final model.

3.5 Accuracy and generalizability

Accuracy and generalizability is assessed in a variety of ways. First, Figure 3.5 visualizes a scatterplot of observed prices as a function of predicted prices. The pink line represents a hypothetical perfect prediction. The plot suggests that the model fits reasonably well, with reduced accuracy for higher prices. Figure 3.6 echoes this finding by plotting Mean Absolute Percent Error by decile.

Next, the model is used to predict for the withheld 40% test set. Mean Absolute Error (MAE) is the mean absolute value difference between observed and predicted sale prices for the test set. The MAE is $44,058. For context, the average single-family home price in our sample is $186,961. The Root Mean Square Error (RMSE) is similar to the MAE, but errors are squared, averaged, and square rooted. The RMSE which penalizes higher errors is $72,582. The Mean Absolute Percent Error (MAPE) is the mean absolute value difference between observed and predicted sale prices on a percentage basis. The MAPE is 49.7%. 100-fold cross-validation without hyperparameter tuning is performed. This test provides some intuition about how the model would predict for data it has yet to see. The mean MAE across all holdouts is $43,587 and the standard deviation is $1,232, suggesting a model that would generalize to new data.

Figure 3.7 maps the absolute error for test set sales both on a dollar and percentage basis. Clear differences can be observed. The arrangement of errors provides additional intuition. Ideally, the algorithm would account for enough variation in price such that the remaining variation (the error), were randomly distributed across the city. Figure 3.7 clearly illustrates that this is not the case. Different communities exhibit different levels of error. We can use the results of a Global Moran’s I test to find that the spatial configuration of errors exhibits statistically significant clustering (p-value < 0.001).

Correlated errors suggest that spatial bias exists in the model. This is further explored in Figure 3.8 which is generated from “Leave One Group Out Cross Validation” (LOGOCV). Instead of holding out and testing for a random subset, LOGOCV reveals how well the model generalizes to a given neighborhood by training on all but one neighborhood and validating on the hold out. Each neighborhood takes a turn acting as the hold out. The map visualizes average error rates by hold out neighborhood and reaffirms the fact that that the model works better in some parts of Philadelphia than others.

Finally, Figure 3.9 tests how well the algorithm generalizes to the various neighborhood typologies. Interestingly, error rates appear relatively comparable across gentrifying (42.8%) and non-gentrifying tracts (51.3%). However, there are higher error rates in low versus high poverty neighborhoods (25.6% to 70.6%), and between White and non-White neighborhoods (26% to 69.9%).

3.6 Is this model biased? Should it be used?

The model exhibits significant differences in error rates across neighborhood contexts. Deployment of an algorithm like this would be problematic. Sales with the lowest (1st quintile) error rate have average observed price and error rates of $230,696 and 3.6% respectively. Sales with the highest (5th quintile) error rates have average observed price and error rate of $87,644 and 174% respectively. A model biased this way places a disproportionately higher property tax burden on lower-valued homes. In such an instance, an argument could be made that the algorithm has a disparate impact on the low-income families who likely live in these communities. For them, the algorithm may lead to more economic hardship and exasperate housing instability.

Improvements to the model could be made by adding new features to equitably reduce error across space. If new features do not help there are two potential remedies. The first is to come up with a set of rules to make general corrections in instances where, for example, the predicted price is more than 20% that of the neighborhood mean. These corrections are common in predictive algorithms, but as the number of rules increases, the need for a supervised algorithm decreases.

The other approach is to enact policies that mitigate the negative effect of property tax reassessments after they occur. In 2014, the City of Philadelphia enacted the Longtime Owner Occupants Program (LOOP) which freezes assessments that increase 200% or more (triple the base amount) from year to year for homeowners living in the residence for ten years or more. Recently, Philadelphia City Council President Darrell Clarke, who has been highly critical of the tax assessment system, proposed new legislation that would lower the LOOP threshold from 200% to 50%. From the Philadelphia Inquirer (emphasis added):

Clarke’s office analyzed the most recent assessments and found that about 75 percent of households that had assessment increases (from 2017 & 2018) between 50 percent and 200 percent are in census tracts with low to moderate income, meaning their income levels would likely qualify.

Philadelphia and cities like it are using property tax freezes as a way to offset gentrification-induced displacement. While it is likely that gentrification causes increased property taxes, it may also be true that poorly calibrated tax assessment models are partially to blame. Figure 3.10 displays a mock scorecard for the Philadelphia tax assessment algorithm. Note the fairness score, which we make up entirely simply to show that once many algorithms can be compared in a repository setting, it is possible to rank them accordingly.

4. Recidivism Prediction

4.1 Recidivism and predictive parity

In 2016, ProPublica released an evaluation of the COMPAS recidivism prediction algorithm built by a company called Northpointe and currently in use by Florida and other states around the country. ProPublica found that while the algorithm had comparable accuracy rates across different racial groups (what is known as “predictive parity”), there were clear racial differences for errors that had high social costs. This paradox lead Propublica to ask a fascinating question - “how could an algorithm simultaneously be fair and unfair?”

In the criminal justice system, as in life, decisions are made by weighing risks. Among a host of Federal sentencing guidelines, judges are to “protect the public from further crimes of the defendant.” Rhetorically, this sounds straightforward - identify the risk that an individual will cause the public harm and impose a sentence that will reduce this risk. However, bias always plays a role in decision-making. We’d never ask the average citizen to weigh risks and punish accordingly because we don’t believe the average citizen could act with impartiality. Although this is the standard we impose on judges, even they make systematic mistakes.

Can an algorithm help judges make better decisions? A recent paper determined that even with much less data, people make as accurate predictions on whether someone will recidivate compared to COMPAS. On the other hand, studies have shown that introducing prediction into the decision-making process can reduce the odds of re-arrests.

The use of data-driven risk models in the criminal justice system has increased in recent years. These algorithms predict risk for a host of outcomes for use in bail hearings, to determine whether an inmate should be granted parole, and to support sentencing decisions by assessing future criminal behavior. In the case below, the focus is on recidivism. Unlike tax assessment which has fairness implications at the community or household level, recidivism prediction can have a disparate impact on individuals.

Below a recidivism algorithm is developed using the COMPAS data provided by ProPublica. Accuracy and generalizability are discussed and recent research is presented showing how it may not be possible to rid these algorithms of their bias. We conclude by providing some useful context for deploying these algorithms in the face of bias.

4.2 Accuracy and generalizability in recidivism algorithms

A recidivism “classifier” algorithm has two “binary” outcomes - “Recidivate” and “Non-recidivate.” While the “percent of correct predictions” is a simple measure of accuracy, it lacks nuance, particularly given the social costs associated with different types of errors. Through an understanding of the many different approaches for measuring accuracy in binary predictive models, an understanding of generalizability emerges.

The basic premise is to learn the recidivism experience of ex-offenders in the recent past to subject these experiences onto a population for which the propensity to recidivate is unknown. The prediction from the model is a number - a “risk score” - running from 0 to 1, and interpreted as “the probability person i will recidivate.” The analyst chooses a risk score threshold for which an ex-offender is then classified as predicted to recidivate. The model can then be validated by comparing predicted classifications to observed classifications, giving a host of more nuanced error types:

True Positive (“Sensitivity”) - “The person was predicted to recidivate and actually recidivated.”
True Negative (“Specificity”) - “The person was predicted not to recidivate and actually did not recidivate.”
False Positive - “The person was predicted to recidivate and actually did not recidivate.”
False Negative - “The person was predicted not to recidivate and actually did recidivate.”

The severity of classification errors and their associated social costs can only be judged in the context of the use case. The focus here, as it was in the case with tax assessment, is to understand whether these error types generalize well across race. What makes this so difficult is that the observed base rates of recidivism vary significantly across race.

4.3 Data and exploratory analysis

The data for this analysis was acquired by ProPublica as part of a public records request and is currently hosted on their GitHub page. At the time of analysis, we were unable to secure a data dictionary, thus much of the feature engineering routines employed in our code below were copied directly from ProPublica’s IPython Notebook.1 While this is not ideal, it is the nature of working with open data, at times.

After cleaning, the data describes the 6,163 ex-offenders screened by COMPAS in 2013 and 2014. There are 53 columns in the original data describing length of jail stays, type of charges, the degree of crimes committed, and criminal history. Many of these variables were added by Northpointe, the original author of the COMPAS algorithm, and are not relevant to the model building process. There are also a host of variables that Northpointe collects from survey data that do not seem to be present in the dataset. Also, noticeably absent, are data describing economic and educational backgrounds. We return to this shortcoming in Section 4.5. The model developed below is simplistic - it is not a replication of the existing Northpointe algorithm, which is proprietary. The table below describes the features.

Variable Description
sex Categorical variables that indicates whether the ex-offender is male or female
age The age of the person
age_cat Variable that categories ex-offenders into three groups by age: Less than 25, 25 to 45, Greater than 45
race The race of the person
priors_count The number of prior crimes committed
two_year_recid Numerical binary variable of whether the person recidivated or not, where 0 is not recidivate and 1 is the person recidivated
r_charge_desc Description of the charge upon recidivating
c_charge_desc Description of the original criminal charge
c_charge_degree Degree of the original charge
r_charge_degree Degree of the charge upon recidivating
juv_other_count Categorical variable of the number of prior juvenile convictions that are not considered either felonies or misdemeanors
length_of_stay How long the person stayed in jail
Recidivated Character binary variable of whether the person recidivated (Recidivate) or not (notRecidivate)

Figure 4.1 illustrates the most frequent initial charge. Crimes of varying severity are included in the dataset. Figure 4.2 visualizes that for repeat offenders, the recidivism event tends to be a lesser crime than the initial offense.

Figure 4.3 visualizes the rate of recidivism outcomes by race. Note the rate of recidivism for African Americans is twice that (59%) of Caucasians (29%). This has important implications for generalizability as described below.

4.4 Modeling

A logistic regression model is estimated. Again, a more advanced model would be employed in reality, but for demonstration purposes the model is kept simple. Minimal feature selection tests are undertaken and to keep the model simple, only 7 features are employed. There is a naive belief among some that algorithmic discrimination can be prevented by omitting controls for protected groups from the model. Although this is not the case, race predictors are omitted from the algorithm. Figure 4.4 plots the variable importance for the model, defined as standardized regression coefficients. The greatest predictive power comes from features describing the total number of prior convictions for a given ex-offender. As these features skew Figure 4.4, they are omitted from the plot.

The data are split into 75/25 training/test sets. 100-fold cross validation is performed on the 75% training set to test how well the model would generalize to new data. Ex-offenders with predicted probabilities greater than or equal to 50% are classified as “will recidivate,” while those less than 50% are classified as “will not recidivate”. Other approaches to finding appropriate cut-offs are described below. Here, we err on the side of simplicity.

4.5 Accuracy and generalizability

The simplest accuracy metric is defined as (# of True Positives + # of True Negatives) / Total Observations. Model Accuracy for Caucasians, African Americans, and Hispanics is 67%, 68.2%, and 67.8%, respectively. By this metric alone, we could conclude that the model is fair. There are some additional visual representations worth examining as well. Figure 4.5 presents a Receiver Operator Characteristic (ROC) Curve - a traditional measure of accuracy for binary classifiers. The ROC curve illustrates trade-offs in True Positive and False Positive rates for a given threshold.

The diagonal line in Figure 4.5 is the “random guess line.” Along this line, if the model correctly predicts recidivism say, 40% it also mis-classifies recidivism 40% of the time. If the ROC curve, dips below the random guess line, then the model is less effective than a random coin flip. By definition, this is an underfit model. For contrast, an ROC curve that has a perfect right angle with the vertex at the top left of the plot, is perfectly overfit. In such an instance, the interpretation would be that if the model correctly classifies recidivism 100% of the time, it incorrectly classifies it 0% of the time.

Here, the model suggests that if recidivism is classified correctly 75% of the time, it is mis-classified ~45% of the time. The Area Under the Curve (AUC) simply measures the plot space below the ROC curve. An ROC curve along the coin flip line has an AUC of 0.50 or 50%. The overfit ROC curve described above would have an AUC of 1 or 100%. The AUC for this model is 0.73 - which represents marginal accuracy.

AUC is useful for visualizing accuracy across race. Three ROC curves and corresponding AUC metrics are presented in Figure 4.6 - one for each of the three predominant race groups included in the data. The results confirm the finding of predictive parity. These goodness of fit metrics are robust to cross-validation as well. Figure 4.7 shows mean and standard deviation across 100 random test set holdouts for three key goodness of fit metrics. The mean True Negative Rate (Specificity) is lower than the mean True Positive rate (Sensitivity), suggesting that the model is more error-prone in cases where ex-offenders are predicted not to recidivate.

Accuracy is too simplistic of a measure, given the social costs associated with certain types of errors. Generalizing to random holdouts is important, but generalizing across race is more so. Figure 4.8 contrasts observed and predicted recidivism rates. 44.5% of ex-offenders are observed to recidivate across all races, but only 39.6% are predicted to do so. This underprediction is readily apparent for Caucasians and Hispanics, who are predicted to recidivate at far lower rates.

Figure 4.9 compares error rates by race. Again, the model appears equally accurate with respect to race. True Negatives predict correctly that an ex-offender will not recidivate and are far more common for Caucasians and Hispanics. False Negatives predict that an ex-offender will not recidivate, but does. These errors are far more common with Caucasians and Hispanics. True Positives predict correctly that an ex-offender will recidivate and this rate is far higher for African Americans. False Positives predict that an ex-offender will recidivate, but does not. African Americans exhibit higher false positive rates.

Finally, the tradeoff between False Negatives and False Positives are visualized for Caucasian and African American ex-offenders. The below Detection Error Tradeoff (DET) Curve shows in general, that a threshold yielding a lower False Positive rate will yield higher False Negative rates, but the rate of False Negatives will be higher for Caucasians. Conversely, a threshold that yields lower False Negative rates will yield very high False Positive rates, but there will be less bias across the races. The DET curve provides an effective indicator to understand the social costs associated with the model.

4.6 Is this model biased? Should it be used?

Taken together, these metrics describe an algorithm that if used, could have disparate impact on African Americans. Although the model exhibits predictive parity with comparable, across-race accuracy rates, different races exhibit different error rates. These results illustrate how an algorithm can simultaneously be fair and biased. Figure 4.11 presents the scorecard for this algorithm. Once again, the fairness score is shown only for demonstration purposes.



Imagine this algorithm was put into production to help a judge make parole decisions. The greater rates of False Negatives for Caucasians and Hispanics may have critical public safety implications. Although the predicted and observed recidivism rates are comparable overall for African Americans, differences in False Positive rates suggest a model that generates potentially higher social costs for this group.

In a research rebuttal prepared by Northpointe, the authors point out that equal across-race Accuracy and AUC (ie. “predictive parity”) is a better fairness standard compared to non-generalized False Positive and False Negative rates. Researchers are beginning to investigate whether it’s possible for an algorithm to balance these metrics when base rates differ across race. Kleinberg et al. (2016) and Chouldechova (2017) show that it may not be mathematically feasible.

Recidivism base rates vary either because African Americans and Caucasians exhibit a varying propensity for recidivism or because of structural discrimination and over-policing. While it may not be possible to fulfill both fairness criteria at once, this does not mean these algorithms should be abandoned entirely. It may be possible to find classification thresholds that minimize error types with the greatest social cost. These choices must be well communicated to policy makers and stakeholders must have a robust understanding about how these issues inform the use of the algorithm. In addition, developers should consider serving risk scores to decision-makers in a dashboard setting alongside other background indicators from criminal justice, education, and employment data.

Underfitting may be an alternative explanation for varying, across-group base rates. If so, the analyst could add features to better control for across-group differences. The literature, which is still very new, has focused on the issue of base rates, but it is still unclear how this issue may conflate with that of underfitting.

5. Conclusion

The purpose of a tax assessment algorithm is automation. It is far more efficient to predict market values from home sale data than to hire an army of inspectors to visit hundreds of thousands of properties annually. The purpose of a recidivism algorithm is decision-support, not automation. It is unlikely that any jurisdiction would replace a judge, social worker, or program manager with an algorithm alone.

This suggests that it’s on governments to employ skilled data scientist teams that can design, build, and interpret these algorithms in the context of the use case. These teams should not be lead by engineers, but policy makers - those who have the requisite institutional and domain knowledge.

Data science has exploded in the private sector because it encourages decision-making with better marginal returns on investment. While government does not profit in the same way, it too has an incentive to provide its citizens with value. Value may be defined as economic vitality; it may be safer streets or healthier children. If data can help policy makers better allocate limited resources, then we should continue down a path that explores how data can have an impact.

This should be done with great responsibility however. The private sector has demonstrated a willingness to trade-off privacy, equity, and transparency for utility. This is not an option in the public sector. We hope someday that the Open Algorithmic Scorecard or an equivalent repository can be created to help share best practices and inform citizens about how public algorithms may be influencing their lives. We also hope that other researchers and governments continue to publish tutorials, like those presented here, to further disseminate knowledge and share best practices.

6. Appendix: Code for replication

This appendix provides the reader with the necessary code to replicate the analyses above. Begin with Section 6.1 to install and load all required packages. Then set up visualization themes and color palettes in Section 6.2. Code to replicate predicting home price and predicting recidivism can be found in Section 6.3 and Section 6.4 respectively. The code walks through data collection, exploratory analysis, the modeling process, and assessing for accuracy and generalizability. In trying to make the code as accessible as possible, this appendix was written for the beginner or intermediate R user.

6.1 Required Packages

The code was executed using functions from the following packages. Please install and load all packages before proceeding.

  • tidyverse
  • sf
  • jsonlite
  • esri2sf
  • tidycensus
  • scales
  • quantmod
  • FNN
  • spdep
  • parallel
  • doParallel
  • ranger
  • caret
  • mltools
  • ggalluvial
  • ModelMetrics
  • plotROC
  • pdp
  • grid
  • gridExtra
  • QuantPsyc
  • cowplot
  • osmdata

6.2 Visualization Themes

The data visualizations in the following sections are created using these custom themes:

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 14, face = "plain", family = "sans", hjust = 0),
    plot.subtitle = element_text(size = 11, face="italic", family = "sans", hjust = 0),
    plot.caption = element_text(size = 10, hjust=0, face = "plain", family = "sans"),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey85", size = 0.1),
    panel.grid.minor = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_text(size=10),
    axis.text = element_text(size=8),
    axis.title.x = element_text(hjust= 1),
    axis.title.y = element_text(hjust= 1),
    legend.text = element_text(size = 9, face = "plain", family = "sans"),
    legend.title = element_text(size = 10, face = "plain", family = "sans")
  )
}

mapTheme <- function(base_size = 11) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14, face = "plain", family = "sans", hjust = 0),
    plot.subtitle = element_text(size = 11, face="italic", family = "sans", hjust = 0),
    plot.caption = element_text(size = 10, hjust=0, face = "plain", family = "sans"),
    legend.text = element_text(size = 9, face = "plain", family = "sans"),
    legend.title = element_text(size = 10, face = "plain", family = "sans"),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.line = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "white")
  )
}

and these color palettes:

palette_9_colors <- c("#FF2AD4","#E53AD8","#CC4ADC","#996AE5","#7F7BE9",
                      "#668BED","#33ABF6","#19BBFA","#00CCFF")
palette_8_colors <- c("#FF2AD4","#E53AD8","#CC4ADC","#996AE5","#7F7BE9",
                      "#668BED","#33ABF6","#00CCFF")
palette_7_colors <- c("#FF2AD4","#E53AD8","#CC4ADC","#996AE5","#668BED","#33ABF6","#00CCFF")
palette_6_colors <- c("#00CCFF","#33ABF6","#668BED","#996AE5","#CC4ADC","#FF2AD4")
palette_5_colors <- c("#FF2AD4","#CC4ADC","#996AE5","#4C9BF2","#00CCFF")
palette_3_colors <- c("#FF2AD4","#7F7BE9","#00CCFF")
palette_2_colors <- c("#FF2AD4", "#00CCFF")
palette_1_colors <- c("#00CCFF")

A basemap is also used for all maps below.

hydro <- 
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/Hydrographic_Features_Poly/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  filter(CREEK_NAME %in% c("Delaware River", "Schuylkill River") & MUNI == "Philadelphia") %>% 
  rename(geometry = geoms) %>% 
  dplyr::select(geometry)

streets <- esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/Street_Centerline/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  rename(geometry = geoms) %>% 
  filter(ST_NAME %in% c("BROAD", "MARKET")) %>% 
  dplyr::select(geometry)

boundary <-
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/City_Plan_Boundary/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  rename(geometry = geoms) %>% 
  dplyr::select(geometry) %>% 
  st_union()
  
baseMap <- ggplot() +
  geom_sf(data = boundary, aes(), fill = "grey85", color = NA) +
  geom_sf(data = streets, aes(), fill = NA, color = "white") +
  geom_sf(data = hydro, aes(), fill = "slategray2", color = NA) +
  mapTheme()

6.3 Predicting home price

Data and Exploratory Analysis

Data collection

We begin by acquiring data for single family properties sold between July 2017 and July 2018 from OpenDataPhilly through the API. The data can also be downloaded here. We will want to adjust sales price for inflation for this analysis. CPI data is acquired using the getSymbols function from the quantmod package and an annual conversion factor is calculated, which we will multiply the sales price by. Please note that conversion factors were updated with final CPI data in early 2019 and may be different than the estimates originally used, resulting in slightly different adjusted sales prices.

salesJSON <- fromJSON("https://phl.carto.com/api/v2/sql?q=SELECT%20*,%20ST_X(the_geom)%20AS%20lat,%20ST_Y(the_geom)%20AS%20lng%20FROM%20opa_properties_public%20WHERE%20(sale_date%20%3E=%20%272017-07-17%27%20AND%20sale_date%20%3C=%20%272018-07-17%27%20and%20category_code_description%20=%20%27Single%20Family%27)")

getSymbols("CPIAUCSL", src='FRED')
avg.cpi <- apply.yearly(CPIAUCSL, mean)
cf <- avg.cpi/as.numeric(avg.cpi['2018']) #using 2018 as the base year
cf <- cf %>% 
  as.data.frame() %>% 
  mutate(date = rownames(.), 
         year = substring(date, 1, 4)) %>% 
  dplyr::select(-date) %>% 
  spread(year, CPIAUCSL)

Next, we convert the JSON into a dataframe. Several variables are created, such as season the home was sold, while others are reclassified as categorical variables, like year_built. We create an sf object and transform it into the Pennsylvania South State Plane projection.

sales <- as.data.frame(salesJSON$rows) %>%
  dplyr::select(parcel_number, lat, lng, sale_date, sale_price, 
         year_built, homestead_exemption, total_livable_area, 
         central_air, garage_spaces, taxable_land, fireplaces, number_of_bathrooms,
         number_of_bedrooms, number_stories, mailing_city_state) %>% 
  rename(homestead = homestead_exemption,
         AC = central_air,
         numBed = number_of_bedrooms,
         numBath = number_of_bathrooms,
         numStories = number_stories) %>% 
  mutate(sale_year = substring(sale_date, 1, 4),
         month = substring(sale_date, 6, 7),
         garage_spaces = as.factor(garage_spaces),
         fireplaces = as.factor(fireplaces),
         numBed = as.factor(numBed),
         numBath = as.factor(numBath),
         numStories = as.factor(numStories),
         AC = ifelse(AC == "Y", 1, 0),
         own_occ = ifelse(str_detect(mailing_city_state, '\\b(PHILADELPHIA)\\b'), 1, 0),
         price = round(ifelse(sale_year == 2017, sale_price * cf$`2017`, sale_price), 2),
         homestead = ifelse(homestead == 0, 0, 1),
         season = case_when(month == "01" | month == "02" | month == "12" ~ "Winter",
                            month == "03" | month == "04" | month == "05" ~ "Spring", 
                            month == "06" | month == "07" | month == "08" ~ "Summer",
                            month == "09" | month == "10" | month == "11" ~ "Fall"),
         year_built = case_when(year_built < 1970 ~ "1970_earlier",
                                year_built >= 1970 & year_built < 1980  ~ "1970s",
                                year_built >= 1980 & year_built < 1990  ~ "1980s",
                                year_built >= 1990 & year_built < 2000  ~ "1990s",
                                year_built >= 2000 & year_built < 2010  ~ "2000s",
                                year_built >= 2010 & year_built < 2020  ~ "2010s")) %>%
  na.omit() %>% 
  filter(price >= 3000,
         price < 1000000) %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs("+proj=longlat +datum=NAD83")) %>% 
  dplyr::select(-mailing_city_state, -sale_date, -sale_price, -month) %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1],
         homestead = as.factor(homestead),
         AC = as.factor(AC),
         own_occ = as.factor(own_occ),
         season = as.factor(season),
         year_built = as.factor(year_built))

Next, we use the package esri2sf to acquire a shapefile of Philadelphia’s neighborhoods and intersect it with the sales data.

nhoods <- 
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/Philly_Neighborhoods/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  dplyr::select(MAPNAME)

sales <- st_intersection(sales, nhoods)

We include several neighborhood amenities and disamenities in the dataset by calculating nearest neighbor and average nearest neighbor distances with the function nn_function. Data for many of these variables can be found on OpenDataPhilly, while others are retrieved from OpenStreetMap using the package osmdata.

nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}

singlefamXY <-
  sales %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

#UNIVERSITIES
univXY <- 
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/Universities_Colleges/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  dplyr::select(NAME, geoms) %>% 
  rename(geometry = geoms) %>% 
  st_sf() %>% 
  st_centroid() %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

univDist <- nn_function(singlefamXY, univXY, 1)
sales <- cbind(as.data.frame(sales), as.data.frame(univDist)) %>% 
  rename(univDist = pointDistance)

#PARKS CENTROID
parksXY <-
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/PPR_Assets/FeatureServer/0") %>% 
  st_transform(2272) %>% 
  st_centroid() %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

parksDist <- nn_function(singlefamXY, parksXY, 1)
sales <- cbind(as.data.frame(sales), as.data.frame(parksDist)) %>% 
  rename(parkDist = pointDistance)

#CITY HALL CENTROID
cityHallXY <- 
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/CITY_LANDMARKS/FeatureServer/0") %>% 
  filter(NAME == "City Hall",
         FEAT_TYPE == "Municipal Building") %>% 
  st_transform(2272) %>% 
  st_centroid() %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

cityHallDist <- nn_function(singlefamXY, cityHallXY, 1)
sales <- cbind(as.data.frame(sales), as.data.frame(cityHallDist)) %>% 
  rename(ccDist = pointDistance)

#MFL/BSL STOPS
boundary <-
  esri2sf("https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/City_Plan_Boundary/FeatureServer/0") %>% 
  st_transform(2272)

box <- opq(bbox = c(-75.338789, 39.846501, -74.955763, 40.166279)) 

septaXY <- add_osm_feature(opq = box, key = 'station', value = "subway")
septaXY <- osmdata_sf(septaXY)
septaXY <- septaXY$osm_points %>% 
  st_sf() %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>% 
  st_intersection(boundary) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

septaDist <- nn_function(singlefamXY, septaXY, 1)
sales <- cbind(as.data.frame(sales), as.data.frame(septaDist)) %>% 
  rename(septaDist = pointDistance)

#NEW CONSTRUCTION PERMITS
permitsJSON <- fromJSON("https://phl.carto.com/api/v2/sql?q=SELECT%20*,%20ST_X(the_geom)%20AS%20lat,%20ST_Y(the_geom)%20AS%20lng%20FROM%20li_permits%20WHERE%20(permitissuedate%20%3E=%20%272016-07-17%27%20and%20permitissuedate%20%3C=%20%272018-07-17%27)")
permitsXY <- as.data.frame(permitsJSON$rows) %>%
  dplyr::select(permitissuedate, lat, lng, permitdescription) %>% 
  filter(permitdescription == "NEW CONSTRUCTION PERMIT") %>% 
  na.omit() %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs("+proj=longlat +datum=NAD83")) %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

permitDist <- nn_function(singlefamXY, permitsXY, 5)
sales <- cbind(as.data.frame(sales), as.data.frame(permitDist)) %>% 
  rename(permitDist = pointDistance)

#AGGRAVATED ASSAULTS
aggAssaultsJSON <- fromJSON("https://phl.carto.com/api/v2/sql?q=SELECT%20*,%20ST_X(the_geom)%20AS%20lat,%20ST_Y(the_geom)%20AS%20lng%20FROM%20incidents_part1_part2%20WHERE%20(dispatch_date%20%3E=%20%272016-07-17%27%20and%20dispatch_date%20%3C=%20%272018-07-17%27%20and%20Text_General_Code%20in%20(%27Aggravated%20Assault%20No%20Firearm%27,%27Aggravated%20Assault%20Firearm%27))") 
aggAssaultsXY <- as.data.frame(aggAssaultsJSON$rows) %>%
  dplyr::select(lat, lng, text_general_code) %>% 
  na.omit() %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs("+proj=longlat +datum=NAD83")) %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

aggAssaultsDist <- nn_function(singlefamXY, aggAssaultsXY, 5)
sales <- cbind(as.data.frame(sales), as.data.frame(aggAssaultsDist)) %>% 
  rename(aggAssaultsDist = pointDistance)

#DRY CLEANERS
drycleanXY <- add_osm_feature(opq = box, key = 'shop', value = "dry_cleaning")
drycleanXY <- osmdata_sf(drycleanXY)
drycleanXY <- st_geometry(drycleanXY$osm_points) %>% 
  st_sf() %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>% 
  st_intersection(boundary) %>% 
  as.data.frame() %>% 
  dplyr::select(lat, lng) %>% 
  as.matrix()

drycleanDist <- nn_function(singlefamXY, drycleanXY, 2)
sales <- cbind(as.data.frame(sales), as.data.frame(drycleanDist)) %>% 
  rename(drycleanDist = pointDistance)

#GRAFFITI 311 CALLS
graffitiJSON <- fromJSON("https://phl.carto.com/api/v2/sql?q=SELECT%20*,%20ST_X(the_geom)%20AS%20lat,%20ST_Y(the_geom)%20AS%20lng%20FROM%20public_cases_fc%20WHERE(service_name%20=%20%27Graffiti%20Removal%27)")
graffitiXY <- as.data.frame(graffitiJSON$rows) %>%
  dplyr::select(lat, lng, service_name) %>% 
  na.omit() %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs("+proj=longlat +datum=NAD83")) %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

graffitiDist <- nn_function(singlefamXY, graffitiXY, 5)
sales <- cbind(as.data.frame(sales), as.data.frame(graffitiDist)) %>% 
  rename(graffitiDist = pointDistance)

#ABANDONED VEHICLE 311 CALLS
abandVJSON <- fromJSON("https://phl.carto.com/api/v2/sql?q=SELECT%20*,%20ST_X(the_geom)%20AS%20lat,%20ST_Y(the_geom)%20AS%20lng%20FROM%20public_cases_fc%20WHERE(service_name%20=%20%27Abandoned%20Vehicle%27)") 
abandVXY <- as.data.frame(abandVJSON$rows) %>%
  dplyr::select(lat, lng, service_name) %>% 
  na.omit() %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs("+proj=longlat +datum=NAD83")) %>% 
  st_transform(2272) %>% 
  mutate(lat = st_coordinates(.)[,2],
         lng = st_coordinates(.)[,1]) %>%
  as.data.frame() %>% 
  dplyr::select(lat,lng) %>%
  as.matrix()

abandVDist <- nn_function(singlefamXY, abandVXY, 5)
sales <- cbind(as.data.frame(sales), as.data.frame(abandVDist)) %>% 
  rename(abandVDist = pointDistance)

At this point, the data is formatted as a dataframe. We transform it back into an sf object.

sales <- sales %>% 
  st_sf() %>% 
  st_transform(2272)
Exploratory analysis

First, we look at the average sale price and the standard deviation as a percentage of mean price by neighborhood. When mapping our results, we use the basemap and palette_5_colors created above. We also format the legends using the scales package.

salePrice_nhood <- sales %>% 
  group_by(MAPNAME) %>% 
  summarize(sd_price = sd(price),
            avg_price = mean(price),
            count = n(),
            sd_pct = sd_price/avg_price) %>% 
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  left_join(nhoods %>% as.data.frame(), . , by = "MAPNAME") %>% 
  rename(geometry = geoms) %>% 
  st_sf() %>% 
  st_transform(2272)

#average price
baseMap +
  geom_sf(data = salePrice_nhood %>% na.omit(), 
          aes(fill = factor(ntile(avg_price, 5))), colour = "white", alpha = 0.7, lwd = 0.3) +
  scale_fill_manual(values = palette_5_colors,
                     labels=as.character(dollar_format()(round(quantile(salePrice_nhood$avg_price,
                                                        c(.1,.2,.4,.6,.8), na.rm=T)))),
                     name="Price") +
  labs(title = "Average sale price\nby neighborhood")

#standard deviation of price as percent of mean sale price
baseMap +
  geom_sf(data = salePrice_nhood %>% na.omit(), 
          aes(fill = factor(ntile(sd_pct, 5))), colour = "white", alpha = 0.7, lwd = 0.3) +
  scale_fill_manual(values = palette_5_colors,
                     labels=as.character(percent_format()(quantile(salePrice_nhood$sd_pct,
                                                        c(.1,.2,.4,.6,.8), na.rm=T))),
                     name="Percent\n(Quintile Breaks)") +
  labs(title = "Standard deviation as a percentage of\nmean sale price by neighborhood")

Next, we examine the average sale price across three different typologies:

  1. Non-gentrifying versus gentrifying census tracts.
#2000 census tracts
CT00 <- get_decennial(geography = "tract", stat = 42, county = 101, geometry = T, year = 2000,
                      variables = "H002006") %>% 
  dplyr::select(GEOID) %>%
  st_sf() %>% 
  st_transform(2272)

gentrifyingCTs <- st_read("gentrifyingCT.shp") %>% 
  st_transform(2272) %>% 
  rename(GEOID = CTIDFP00) %>% 
  dplyr::select(GEOID, geometry) %>% 
  mutate(gentrify = 0) %>% 
  as.data.frame() %>% 
  left_join(CT00 %>% as.data.frame(), ., by = "GEOID") %>% 
  dplyr::select(-geometry.y) %>% 
  rename(geometry = geometry.x) %>% 
  mutate(gentrify = ifelse(is.na(gentrify) == "TRUE", 1, 0)) %>% 
  st_sf() %>% 
  st_transform(2272)

gentrifyTypo <- st_join(sales, gentrifyingCTs, join = st_within, largest = TRUE) %>% 
  na.omit() %>% 
  group_by(gentrify) %>% 
  summarize(meanPrice = mean(price)) %>% 
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  mutate(Category = c("Gentrifying v. Non-gentrifying")) %>% 
  rename(typo = gentrify)
  1. Majority non-White versus majority White census tracts.
#2010 census tracts
CTvar <- get_acs(geography = "tract", variables = c("B02001_002E", 
                                                    "B01001_001E"), 
                 year = 2016, state = 42, county = 101, geometry = T) %>%
  dplyr::select(variable,estimate, GEOID) %>%
  as.data.frame() %>%
  group_by(variable) %>%
  spread(variable, estimate) %>%
  rename(TotalWhite = B02001_002, 
         TotPop = B01001_001) %>%
  mutate(pctW = TotalWhite/TotPop,
         race = case_when(pctW >= 0.5 ~ 0,
                          pctW < 0.5 ~ 1)) %>% 
  st_sf() %>% 
  st_transform(2272)

minorityTypo <- st_join(sales, CTvar, join = st_within, largest = TRUE) %>% 
  na.omit() %>% 
  group_by(race) %>% 
  summarize(meanPrice = mean(price)) %>% 
  as.data.frame() %>% 
  dplyr::select(race, meanPrice) %>% 
  mutate(Category = c("White v. Non-white")) %>% 
  rename(typo = race)
  1. Qualified Census Tracts (QCT) versus not QCT.
qualifiedCTs <- st_read("phillyQCT.shp") %>% 
  st_transform(2272) %>% 
  dplyr::select(GEOID) %>% 
  mutate(Qualified = 1) %>% 
  as.data.frame() %>% 
  left_join(CTvar %>% as.data.frame(), ., by = "GEOID") %>% 
  dplyr::select(-geometry.y) %>% 
  rename(geometry = geometry.x) %>% 
  dplyr::select(GEOID, Qualified, geometry) %>% 
  mutate(Qualified = case_when(Qualified == 1 ~ 1,
                               is.na(Qualified) == TRUE ~ 0)) %>% 
  st_sf() %>% 
  st_transform(2272)

qualifiedTypo <- st_join(sales, qualifiedCTs, join = st_within, largest = TRUE) %>% 
  na.omit() %>% 
  group_by(Qualified) %>% 
  summarize(meanPrice = mean(price)) %>%
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  mutate(Category = c("Non QCTs v. QCTs")) %>% 
  rename(typo = Qualified)

To compare average sale price in these typologies, we plot the data as a grouped barplot.

groupedBar <- rbind(minorityTypo, qualifiedTypo, gentrifyTypo) %>% 
  mutate(hj = c(-4.5, -0.5, -2, -1.25, -2, -0.5),
         vj = c(-1.5, 2.5, -1.5, 2.5, -1.5, 2.5),
         type = c("White", "Non-white", "Non QCT", "QCT","Gentrifying","Non-gentrifying"))

ggplot(groupedBar, aes(factor(Category), meanPrice, fill = as.factor(typo))) + 
  geom_bar(stat="identity", position = "dodge", width = 0.5) +
  scale_fill_manual(values =  palette_2_colors) +
  scale_y_continuous(label = dollar) +
  labs(x = "Typology",
       y = "Average sale price",
       title = "Average sale price across typologies",
       caption = "Figure 3.2") +
  geom_text(aes(label = type), size = 3.5, vjust = groupedBar$vj, y = 0, hjust = groupedBar$hj, 
              fontface = "bold", 
              color = "white", angle = 90) +
  plotTheme() +
  theme(legend.position = "none")

Modeling

Prior to modeling, we look at the spatial distribution of the average nearest neighbor distances created above. For example, we map the average distance to new construction permits.

baseMap + 
  geom_point(data = sales, aes(x = lng, y = lat, colour = factor(ntile(permitDist, 5))), size = 0.1) +
  scale_color_manual(values = palette_5_colors,
                     labels=as.character(round(quantile(sales$permitDist, na.rm=T))),
                     name="Average\nnearest\nneighbor\ndistance\n(feet)") +
  labs(title = "Distance to\nnew construction permits",
       subtitle = "k = 5") + 
  guides(colour = guide_legend(override.aes = list(size=1.5)))

Next, we run preliminary linear regressions. We log some of the variables based on those results to increase predictive power. Our inclusion and rejection of variables for the final machine learning algorithm is also based on these results.

sales <- sales %>% 
  as.data.frame() %>% 
  mutate(logccDist = log(ccDist),
         logUniv = log(univDist),
         MAPNAME = as.factor(MAPNAME),
         logAssault = log(aggAssaultsDist),
         logArea = log(total_livable_area),
         test = is.infinite(logArea)) %>% 
  rename(nhood = MAPNAME) %>% 
  filter(test != "TRUE") %>% 
  dplyr::select(-test)

When we assess accuracy and generalizability below, we pull out the final model we train in this section and use it to run 100-fold cross validation and spatial cross validation. However, we encountered an error that the train function in the caret package was returning an mtry paramter that was higher than the number of predictor variables. As a work around, we one hot code the factor variables.

onehotdata <- sales %>% dplyr::select(nhood, numBed, numBath, numStories, 
                                      year_built, season, garage_spaces)
onehotdata <- data.table::data.table(onehotdata)

onehot <- one_hot(onehotdata, dropUnusedLevels = TRUE) %>% 
  as.data.frame() %>% 
  mutate_if(is.integer,as.factor)

salesForMod <- cbind(sales, onehot) %>% dplyr::select(-numBed, -numBath, -numStories, 
                                                -year_built, -season, -garage_spaces, 
                                                -total_livable_area,
                                                -taxable_land, -fireplaces, -univDist, 
                                                -drycleanDist, -ccDist, -septaDist, 
                                                -permitDist, -aggAssaultsDist,
                                                -graffitiDist, -abandVDist, -parkDist)
salesForMod <- data.frame(salesForMod, check.names = TRUE)

We then split our data into 60/40 training/test sets, ensuring that categorical levels with very few observations are put into the training set.

set.seed(5224)
TrainIndex <- createDataPartition(paste(salesForMod$numBed_13, salesForMod$numbBed_31,
                                        salesForMod$numBath_6, salesForMod$numBath_10,
                                        salesForMod$numBath_21, salesForMod$numStories_11,
                                        salesForMod$numStories_24, salesForMod$numStories_27,
                                        salesForMod$numStories_30, salesForMod$numStories_32,
                                        salesForMod$numStories_34, salesForMod$numStories_39,
                                        salesForMod$numStories_45, salesForMod$garage_spaces_13,
                                        salesForMod$nhood_East.Park, 
                                        salesForMod$nhood_Mechanicsville,
                                        salesForMod$nhood_University.City,
                                        salesForMod$nhood_Woodland.Terrace), 
                                  p = .6,
                                  list = FALSE,
                                  times = 1)
train <- salesForMod[TrainIndex,] 
test <- salesForMod[-TrainIndex,]

Then we train the Random Forest model. We run the model with parallelization to take advantage of extra cores on the machine. It is important to note that our results are due to the random nature of cross-validation. Therefore, our exact results will not be produced.

cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

mod.rf_woLag <- caret::train(log(price) ~ ., 
                             data = train %>% 
                               as.data.frame() %>% 
                               dplyr::select(-lat, -lng, -geometry, -nhood, -parcel_number, 
                                             -sale_year),
                             trControl = caret::trainControl(method = "cv", number = 10, 
                                                             allowParallel = TRUE),
                             na.action = na.omit,
                             method = "rf")

stopCluster(cluster)
registerDoSEQ()

Finally, we predict back onto the test set and calculate error.

pred.rf_woLag <- predict(mod.rf_woLag, test)

predValues.rf <- 
  data.frame(observedPrice = test$price,
             lng = test$lat,
             lat = test$lng,
             nhood = test$nhood,
             predictedPrice = exp(pred.rf_woLag)) %>% 
  mutate(error = predictedPrice - observedPrice,
         absError = abs(predictedPrice - observedPrice),
         percentAbsError = abs((predictedPrice - observedPrice) / observedPrice)) %>% 
  st_as_sf(coords = c("lat", "lng"), crs = st_crs(2272)) %>% 
  mutate(lat = st_coordinates(.)[,1],
         lng = st_coordinates(.)[,2])

We also extract the variable importance from the final model created above and plot it to see what variables had the greatest predictive power.

varImp <- as.data.frame(mod.rf_woLag$finalModel$importance) %>% 
  mutate(Variable = row.names(.)) %>% 
  arrange(desc(IncNodePurity)) %>% 
  rename(Importance = IncNodePurity) %>% 
  top_n(20, Importance) %>% 
  mutate(color = sample(palette_9_colors, 20, replace = TRUE))

ggplot(varImp, aes(x = reorder(Variable, Importance), y = Importance, fill = Variable)) +
  geom_bar(stat = "identity", aes(fill=color)) +
  coord_flip() +
  labs(x = " ",
       title = "Variable importance",
       subtitle = "Random Forest model") +
  scale_fill_identity() +
  plotTheme() + 
  theme(legend.position = "none")

Accuracy and Generalizability

Accuracy

To assess accuracy of the model, we plot observed prices as a function of predicted prices.

ggplot(data = predValues.rf, aes(x=predictedPrice, y=observedPrice)) +
  geom_point(size = 0.3) +
  geom_abline(color = "#FF2AD4", size = 0.75) +
  labs(x = "Predicted price",
       y = "Observed price",
       title = "Predicted vs. observed sale price",
       subtitle = "Test set points") +
  scale_x_continuous(label = dollar) +
  scale_y_continuous(label = dollar) +
  plotTheme()

Next, we look at error by decile. Here we use two funtions: quantile_error and mape. As their names suggest, the first calculates the error for each decile by grouping the predictions and the second calculates the mean absolute percent error for that decile.

quantile_error <- function(pred,obs,quant){
  preds <- data.frame(pred = pred, obs = obs) %>%
    filter(quantile(seq(0,max(obs)), quant)>obs)
  return(preds)
}

mape <- function(pred,obs){
  mean(abs((pred-obs)/obs))
}

quantile_errors <- predValues.rf %>%
  as_tibble() %>%
  mutate(mdl_name = "Random Forest") %>% 
  nest(-mdl_name) %>% 
  mutate(q = list(seq(0,1,0.01)),
         pred   = map(data, "predictedPrice"),
         test_y = map(data, "observedPrice")) %>%
  dplyr::select(-data) %>%
  unnest(q, .preserve = c(pred, test_y)) %>%
  filter(q != 0) %>% 
  mutate(q_dat  = pmap(list(pred, test_y, q), quantile_error),
         q_pred = map(q_dat, "pred"),
         q_obs  = map(q_dat, "obs"),
         q_MAPE  = map2_dbl(q_pred, q_obs, mape),
         y_max  = quantile(seq(0,max(predValues.rf$observedPrice)), q),
         q_cnt  = nrow(predValues.rf) - map_int(q_dat, nrow))

We then create a plot for mean absolute percent error by decile and count of observations per decile.

#mean absolute percent error by decile
ggplot(data = quantile_errors, aes(x=q, y=q_MAPE)) +
  geom_line(size = 0.75) +
  scale_x_continuous(breaks=seq(0,1,0.1), labels = seq(0,1,0.1)) +
  scale_y_continuous(limits = c(0,max(quantile_errors$q_MAPE)), label = percent)+
  labs(y = "MAPE",
       x = "Decile",
       title = "Mean absolute percent error by decile",
       subtitle = "Test set points") +
  plotTheme()

#count of observations for each decile
ggplot(data = quantile_errors, aes(x=q, y=q_cnt)) +
  geom_line(size = 0.75) +
  scale_x_continuous(breaks=seq(0,1,0.1), labels = seq(0,1,0.1)) +
  labs(y = "Number of Predictions",
       x = "Decile",
       title = "Number of Predictions per Decile",
       subtitle = "Test set points") +
  plotTheme()

Finally, we calculate mean absolute error and root mean square error for 100-fold cross validation. We perform 100-fold cross validation by using a for loop that populates the dataframe cv_results. As we did above, we split the data so categorical levels with few observations are put into the training set. We use the ranger package to run the Random Forest model using the hyperparameters of the final model trained above.

cv_results <- data.frame()
mtry_best <- mod.rf_woLag$bestTune$mtry
ntree_best <- mod.rf_woLag$finalModel$ntree

for(i in 1:100) {
  #print what cycle the for loop is running
  print(i)
  
  #create a data partition - taking factor variables into consideration
  #60/40 split for training and test set
  thisIndex <- createDataPartition(paste(salesForMod$numBed_13, salesForMod$numbBed_31,
                                        salesForMod$numBath_6, salesForMod$numBath_10,
                                        salesForMod$numBath_21, salesForMod$numStories_11,
                                        salesForMod$numStories_24, salesForMod$numStories_27,
                                        salesForMod$numStories_30, salesForMod$numStories_32,
                                        salesForMod$numStories_34, salesForMod$numStories_39,
                                        salesForMod$numStories_45, salesForMod$garage_spaces_13,
                                        salesForMod$nhood_East.Park, 
                                        salesForMod$nhood_Mechanicsville,
                                        salesForMod$nhood_University.City,
                                        salesForMod$nhood_Woodland.Terrace), 
                                   p = .6,
                                   list = FALSE,
                                   times = 1)
  
  #use the partition to separate the data
  trainData <- salesForMod[thisIndex,] 
  testData <- salesForMod[-thisIndex,]
  
  #run the model
  #hyperparameters (ntrees and mtry) come from the caret::train final model run above
  thisMod <- ranger(log(price) ~ ., 
                    data = trainData %>% 
                      dplyr::select(-lat, -lng, -geometry, -nhood, -parcel_number, -sale_year),
                    num.trees = ntree_best,
                    mtry = mtry_best)
  
  #use this model to predict using the test set
  pred <- predict(thisMod, testData)
  pred <- data.frame(pred$predictions)
  thisPrediction <- pred %>% 
    mutate(prediction = exp(pred.predictions)) %>% 
    cbind(testData$price, .) %>% 
    mutate(error = testData$price - prediction,
           absError = abs(error),
           rmse = sqrt(mean(error^2))) %>% 
    dplyr::select(absError, rmse)
  
  #calculate MAE
  thisMAE <- mean(thisPrediction$absError) %>% as.data.frame()
  thisRMSE <- mean(thisPrediction$rmse) %>% as.data.frame()
  cvList <- cbind(thisMAE, thisRMSE)
  colnames(cvList) <- c("MAE", "RMSE")
  
  #add results to list
  cv_results <- rbind(cv_results, cvList)
}
Generalizability

To assess generalizability, we start by mapping the absolute error and the absolute percent error on the test set.

#absolute error on test set
baseMap +
  geom_point(data = predValues.rf, aes(x = lat, y = lng, colour = factor(ntile(absError, 5))), 
          size = 0.5, alpha = 0.7) +
  scale_color_manual(values= palette_5_colors,
                     labels=as.character(dollar_format()(round(quantile(predValues.rf$absError,
                                                        c(.1,.2,.4,.6,.8), na.rm=T)))),
                     name="absolute\nerror") +
  labs(title = "Absolute error on test set points") + 
  guides(colour = guide_legend(override.aes = list(size=2)))

#absolute percent error on test set
baseMap +
  geom_point(data = predValues.rf, aes(x = lat, y = lng, colour = factor(ntile(percentAbsError, 5))), 
          size = 0.5, alpha = 0.7) +
 scale_color_manual(values= palette_5_colors,
                    labels=as.character(percent_format()(quantile(predValues.rf$percentAbsError,
                                                        c(.1,.2,.4,.6,.8), na.rm=T))),
                     name="absolute\npercent\nerror") +
  labs(title = "Absolute percent error on test set points") + 
  guides(colour = guide_legend(override.aes = list(size=2)))

We also calculate Global Moran’s I to determine if errors exhibit statistically significant clustering.

predValues.rf <- predValues.rf %>% 
  as.data.frame()

testCoords <- predValues.rf %>% 
  dplyr:: select(lat, lng) %>% 
  as.matrix()

neighborPointsTest <- knn2nb(knearneigh(testCoords, 5))
spatialWeights <- nb2listw(neighborPointsTest, style="W")

moransTest_testPoints <- moran.mc(predValues.rf$absError, spatialWeights, nsim = 999)

We then perform “Leave One Group Out Cross Validation” or LOGOCV to determine how well the model performs across neighborhoods. We use the spatialCV function to pull out a neighborhood, train the model on all other neighborhoods, test on the holdout, and then calculate mean absolute error and mean absolute percent error.

spatialCV <- function(dataFrame, uniqueID) {
  
  #initialize variables  
  training <- 0
  testing <- 0
  tuner <- 0
  currentUniqueID <- 0
  uniqueID_List <- 0
  y <- 0
  endList <- list()
  
  #create a list that is all the neighborhood unique ids in the data frame
  uniqueID_List <- unique(dataFrame[[uniqueID]])  
  x <- 1
  y <- length(uniqueID_List)
  
  #create a counter and while it is less than the number of neighborhoods...  
  while(x <= y) 
  {
    #print what cycle the loop is one
    print(x)
    
    #call a current neighborhood   
    currentUniqueID <- uniqueID_List[x]
    
    #create a training set of units not in that neighborhood and a test set of units that are
    training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
    testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
    
    #save your model as a tuner
    tuner <- ranger(log(price) ~ ., 
                    data = training %>% 
                      dplyr::select(-lat, -lng, -geometry, -nhood, -parcel_number, -sale_year),
                    num.trees = ntree_best,
                    mtry = mtry_best)
    
    #come up with predictions on the test neighborhood
    pred_spatialCV <- predict(tuner, testing)
    pred_spatialCV <- data.frame(pred_spatialCV$predictions)
    thisPred <- pred_spatialCV %>% 
      mutate(prediction = exp(pred_spatialCV.predictions)) %>% 
      cbind(testing$price, .) %>% 
      mutate(absError = abs(testing$price - prediction),
             percentAbsError = abs((prediction - testing$price) / testing$price)) %>% 
      dplyr::select(absError, percentAbsError)
    
    #calculate MAE and count the number of observations
    spatialCV_MAE <- mean(thisPred$absError) %>% as.data.frame()
    spatialCV_MAPE <- mean(thisPred$percentAbsError) %>% as.data.frame()
    countTestObs <- nrow(testing)
    
    #combine the current results
    thisList <- cbind(currentUniqueID, spatialCV_MAE, spatialCV_MAPE, countTestObs)
    colnames(thisList) <- c("uniqueID", "MAE", "MAPE", "numObs")
    #add the current results to the final list
    endList <- rbind(endList, thisList)
    #iterate counter    
    x <- x + 1 
    
  } 
  #return the final list
  return (as.data.frame(endList))
}

We input the dataframe, test, and the neighborhood variable, nhood, into the function and then join the data to the neighborhood sf object to map the results.

spatialCV_results <- spatialCV(test, "nhood")

spatialCV_forMapping <- left_join(nhoods, spatialCV_results, 
                                           by = c("MAPNAME" = "uniqueID"))

baseMap +
  geom_sf(data = spatialCV_forMapping %>% na.omit(), 
          aes(fill=factor(ntile(MAPE, 5))), colour = "white", alpha = 0.7, lwd = 0.3) +
  labs(title= "Accuracy across Philadelphia neighborhoods",
       subtitle = "Leave one group out cross validation (test set)") +
  scale_fill_manual(values= palette_5_colors,
                    labels=as.character(percent_format()(quantile(spatialCV_forMapping$MAPE,
                                                        c(.1,.2,.4,.6,.8), na.rm=T))),
                    name="Mean\nabsolute\npercent\nerror") +
  mapTheme()

Lastly, we compare mean absolute percent error across our three typologies.

predValues.rf <- predValues.rf %>% 
  st_sf() %>% 
  st_transform(2272)

minorityError <- st_join(predValues.rf, CTvar, join = st_within, largest = TRUE) %>%
  na.omit() %>%
  group_by(race) %>% 
  summarize(mean_MAPE = mean(percentAbsError)) %>% 
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  mutate(Typology = c("White v. Non-White")) %>% 
  rename(typo = race)

qualifiedError <- st_join(predValues.rf, qualifiedCTs, join = st_within, largest = TRUE) %>% 
  na.omit() %>% 
  group_by(Qualified) %>% 
  summarize(mean_MAPE = mean(percentAbsError)) %>% 
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  mutate(Typology = c("Non QCTs v. QCTs")) %>% 
  rename(typo = Qualified)

gentrifyingError <- st_join(predValues.rf, gentrifyingCTs, join = st_within, largest = TRUE) %>% 
  na.omit() %>% 
  group_by(gentrify) %>% 
  summarize(mean_MAPE = mean(percentAbsError)) %>% 
  as.data.frame() %>% 
  dplyr::select(-geometry) %>% 
  mutate(Typology = c("Gentrifying v. Non-gentrifying")) %>% 
  rename(typo = gentrify)

typoError <- rbind(minorityError, qualifiedError, gentrifyingError) %>% 
  mutate(hj = c(-0.55, -1, -0.1, -2.5, -0.25, -0.25),
         vj = c(-1.5, 2.5, -1.5, 2.5, -1.5, 2.5),
         type = c("White", "Non-white", "Non QCTs", "QCTs","Gentrifying","Non-gentrifying"))

ggplot(typoError, aes(factor(Typology), mean_MAPE, fill = as.factor(typo))) + 
    geom_bar(stat="identity", position = "dodge", width = 0.5) +
    scale_fill_manual(values =  palette_2_colors) +
    labs(x = "Typology",
         y = "Mean MAPE",
         title = "Average error across typologies",
         subtitle = "Test set points") +
    scale_y_continuous(label = percent, limits = c(0,1)) +
    geom_text(aes(label = type), size = 3.5, vjust = typoError$vj, y = 0, hjust = typoError$hj, 
              fontface = "bold", 
              color = "white", angle = 90) +
    plotTheme() +
    theme(legend.position = "none")

6.4 Predicting recidivism

Data and Exploratory Analysis

Data collection

The data for this case study was retrieved from ProPublica’s GitHub page. Unfortunately, there is no data dictionary associated with the data, so we follow the steps taken in ProPublica’s IPython Notebook to clean the data.

raw_data <- read.csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")

df <- 
  raw_data %>%
  filter(days_b_screening_arrest <= 30) %>%
  filter(days_b_screening_arrest >= -30) %>%
  filter(is_recid != -1) %>%
  filter(c_charge_degree != "O") %>%
  filter(priors_count != "36") %>%
  filter(priors_count != "25") %>%
  mutate(length_of_stay = as.numeric(as.Date(c_jail_out) - as.Date(c_jail_in)),
         priors_count = as.factor(priors_count),
         Recidivated = as.factor(ifelse(two_year_recid == 1,"Recidivate","notRecidivate"))
         ) %>%
  dplyr::select(sex,age,age_cat,race,priors_count,two_year_recid,r_charge_desc,
         c_charge_desc,c_charge_degree,r_charge_degree,juv_other_count,
         length_of_stay,priors_count,Recidivated) 
Exploratory Analysis

We begin examining the data by looking at the nine most frequent initial charges in the dataset.

df %>%
  group_by(c_charge_desc) %>%
  summarize(count = n()) %>%
  mutate(rate = count/sum(count)) %>%
  arrange(-rate) %>%
  head(9) %>%
  ggplot(aes(reorder(c_charge_desc, rate, FUN = max), 
             rate, fill = c_charge_desc)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values=palette_9_colors) +
  theme(legend.position = "none") +
  labs(x = "Charge", y = "Rate",
       title="Most frequent initial charges") +
  plotTheme()

We then look at the difference between initial committed crimes and crimes committed upon recidivating. We create an alluvial diagram to present this information using the package ggalluvial.

df %>%
   filter(r_charge_degree != "") %>%
   mutate(Recidivism_crime = substr(r_charge_degree,2,2),
          Initial_crime = c_charge_degree) %>%
   filter(Recidivism_crime == "M" | Recidivism_crime == "F") %>%
   mutate(Recidivism_crime = ifelse(Recidivism_crime == "M", "Misdemeanor", "Felony"),
          Initial_crime = ifelse(Initial_crime == "M", "Misdemeanor", "Felony")) %>%
   group_by(Recidivism_crime,Initial_crime) %>%
   summarize(n=n()) %>%
   mutate(crime_combo = paste(Recidivism_crime,Initial_crime,sep=" ")) %>%
     ggplot(.,aes(weight = n, axis1 = Initial_crime, axis2 = Recidivism_crime)) +
       geom_alluvium(width = 1/12) +
       geom_stratum(width = 1/12, 
                    fill = c("#FF2AD4", "#00CCFF", "#FF2AD4", "#00CCFF"), 
                    color = NA) +
       geom_label(stat = "stratum", label.strata = TRUE) +
       scale_x_continuous(breaks = 1:2, 
                          labels = c("Initial Crime", "Recidivism Crime")) +
       plotTheme() +
       theme(axis.text.x = element_text(size=15),
             axis.text.y=element_blank(),
             axis.ticks.y=element_blank()) +
       theme(plot.margin=unit(c(1,1,1.5,1.2),"cm")) +
       labs(title="Repeat offender's next crime more likely to be lower degree")

Lastly, we explore the rate of recidivism by race.

df %>%
  group_by(Recidivated,race) %>%
  summarize(n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  arrange(race) %>%
  ggplot(., aes(reorder(race,-freq),freq, fill=factor(Recidivated))) +
  geom_bar(stat="identity", position = "dodge") +
  labs(title="Recidivism rate by race",
       y = "Rate",
       x = "Race") +
  scale_fill_manual(values=palette_2_colors,
                    name = "Recidivism") +
  plotTheme()

Modeling

We prepare the data for modeling by splitting the dataset into 75/25 training/test sets.

set.seed(336)
trainIndex <- createDataPartition(df$priors_count, p = .75,
                                    list = FALSE,
                                    times = 1)
train <- df[ trainIndex,]
test  <- df[-trainIndex,]

We then train the model using the train function in the caret package. Once again, our results are due to the random nature of cross-validation. Therefore, our exact results will not be produced.

set.seed(125)
reg.noRace.cv <- train(Recidivated ~ ., data = train %>%
                                               dplyr::select(-race,
                                               -r_charge_degree,
                                               -r_charge_desc,
                                               -c_charge_degree,
                                               -c_charge_desc,
                                               -two_year_recid),
 family="binomial"(link="logit"), method="glm", 
 metric="ROC",
 trControl = trainControl(method = "cv",number = 100,classProbs = TRUE, summaryFunction=twoClassSummary))

We predict on the test set and use a 50% threshold to create the variable predClass, which represents whether an ex-offender is predicted to receidivate or not.

testProbs <- 
  data.frame(class = test$two_year_recid,
               probs = predict(reg.noRace.cv, test, type="prob")$Recidivate,
               predClass = ifelse(predict(reg.noRace.cv, test, type="prob")$Recidivate > .5 ,1,0),
               regression = "No Race",
               Race = test$race)

Lastly, we standardize the coefficients of the final model using lm.beta from the QuantPsyc package to plot feature importance. Here, we filter out the variable priors_count because its inclusion skewed the plot.

recidVarImp <- as.data.frame(lm.beta(reg.noRace.cv$finalModel)) %>% 
  mutate(Variable = row.names(.)) %>% 
  rename(Importance = "lm.beta(reg.noRace.cv$finalModel)") %>% 
  filter(!str_detect(Variable,'priors_count')) %>% 
  mutate(Importance = abs(Importance), 
         color = palette_6_colors)

ggplot(recidVarImp, aes(x = reorder(Variable, Importance), y = Importance, fill = Variable)) +
  geom_bar(stat = "identity", aes(fill = color)) +
  coord_flip() +
  labs(x = " ",
       title = "Variable importance") +
  scale_fill_identity() +
  plotTheme() + 
  theme(legend.position = "none")

Accuracy and Generalizability

Accuracy

We begin by assessing the model’s accuracy. We use the package plotROC to plot the ROC curve to illustrate the tradeoff between getting the prediction correct versus incorrect.

ggplot(testProbs %>% filter(regression == "No Race"), 
         aes(d = as.numeric(class), m = probs)) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "Goodness of fit: ROC Curve",
       subtitle=paste("AUC =", 
                    testProbs %>% 
                     group_by(regression) %>% 
                     summarize(AUC = auc(class,probs)) %>% 
                     filter(regression == "No Race") %>% 
                     pull(2) %>% as.character() %>% substr(.,1,4))) +
  plotTheme()

Then, we pull out the 100-fold cross validation that was performed in the train function above. We calculate the mean and standard deviation for AUC, True Positive Rate, and True Negative Rate. We plot the mean values as a barplot and standard deviations as error bars.

cv.results <- 
  data.frame(AUC = reg.noRace.cv$resample[,1],
             Sensitivity = reg.noRace.cv$resample[,2],
             Specificity = reg.noRace.cv$resample[,3],
             Regression ="No Race")

cv.means <- cv.results %>%
  group_by(Regression) %>%
  summarize(mean.AUC = mean(AUC),
            mean.TruePositive.Rate = mean(Sensitivity),
            mean.TrueNegative.Rate = mean(Specificity)) %>%
  gather(Statistic, value, c(mean.AUC,mean.TruePositive.Rate, mean.TrueNegative.Rate)) %>%
  dplyr::select(Statistic, value)

cv.sd <- cv.results %>% 
  group_by(Regression) %>% 
  summarize(mean.AUC = sd(AUC),
            mean.TruePositive.Rate = sd(Sensitivity),
            mean.TrueNegative.Rate = sd(Specificity)) %>% 
  gather(Statistic, sd, c(mean.AUC,mean.TruePositive.Rate, mean.TrueNegative.Rate)) %>% 
  left_join(., cv.means, by = "Statistic")

ggplot(cv.sd,aes(Statistic,value)) +
  geom_bar(aes(fill=Statistic), position="dodge", stat="identity", width = .5) +
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.2,
                 position="dodge") +
  scale_fill_manual(values = palette_3_colors) +
  plotTheme() +
  labs(title="Goodness of fit metrics from 100-fold cross validation",
       x="Statistic",
       y="Rate",
       subtitle = "Error bars represent standard deviation") +
  theme(legend.position = "none")
Generalizability

Next, we look at model generalizability. We start by plotting ROC curves by race and creating a table of AUC by race. Due to plotting limitations, we make the table a tableGrob object using the grid and gridExtra packages. We change the color of each row to coordinate with our color scheme.

auc <- testProbs %>% 
  group_by(Race) %>% 
  summarize(AUC = auc(class,probs)) %>% 
  mutate(AUC = as.character(round(AUC, 3))) %>% 
  filter(Race == "African-American" | Race == "Caucasian" | Race == "Hispanic") %>%
  arrange(Race) 

aucTableData <- rbind(data.frame(Race = "Race", AUC = "AUC"), auc)

cols <- rep("white", nrow(aucTableData))
rowsToChange <- 2
cols[rowsToChange] <- "#FF2AD4"
rowsToChange <- 3
cols[rowsToChange] <- "#7F7BE9"
rowsToChange <- 4
cols[rowsToChange] <- "#00CCFF"

fonts <- rep("black", nrow(aucTableData))
rowsToChange <- c(2,3,4)
fonts[rowsToChange] <- "white"

hj <- matrix(c(0, 0.5), ncol=2, nrow=nrow(aucTableData), byrow=TRUE)
x <- matrix(c(0.02, 0.5), ncol=2, nrow=nrow(aucTableData), byrow=TRUE)

#theme for table
t1 <- ttheme_default(core=list(bg_params = list(fill=cols),
                               fg_params = list(hjust=as.vector(hj), x = as.vector(x), col = fonts)
                               ))

#make table
aucTable <-tableGrob(aucTableData, theme = t1, rows = NULL, cols = NULL)

rocPlots <- ggplot(testProbs %>% filter(Race == "Caucasian" | Race == "African-American" | Race == "Hispanic"), 
       aes(d = as.numeric(class), m = probs, group=Race, colour=Race)) + 
  geom_roc(n.cuts = 50, labels = FALSE) +
  scale_color_manual(values = palette_3_colors) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title="ROC Curves by race") +
  plotTheme()

ggdraw() +
  draw_plot(rocPlots) +
  draw_plot(aucTable, x = 0.5, y = -.1, width = .2, height = .75)

Then we create a Detection Error Tradeoff (DET) Curve by race to explore the tradeoff between False NEgative Rates and False Positive Rates.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    this_prediction <- data.frame()
    
    this_prediction <-
      data %>%
      mutate(predclass = ifelse(probs > x, 1,0)) %>%
      count(predclass,class) %>%
      summarize(True_Negative = sum(n[predclass==0 & class==0]),
                True_Positive = sum(n[predclass==1 & class==1]),
                False_Negative = sum(n[predclass==0 & class==1]),
                False_Positive = sum(n[predclass==1 & class==0]),
                FPR = False_Positive / sum(n[class == 0]),
                FNR = False_Negative / sum(n[class == 1])) %>%
      mutate(Threshold = x)
    
    all_prediction <- rbind(all_prediction,this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}

detcurve_byrace <- iterateThresholds(testProbs %>% group_by(Race) %>% dplyr::select(class, probs))

detcurve_byrace %>%
  group_by(Race) %>% 
  filter(Race == "African-American" | Race == "Caucasian") %>% 
  ggplot(.,aes(FPR, FNR, color = Race)) +
  geom_line(size = 1) +
  geom_abline(slope = -1, intercept = 1, size = 1.5, color = 'grey') +
  scale_color_manual(values = c("#FF2AD4", "#00CCFF")) +
  labs(title = "Detection Error Tradeoff (DET) curve",
       x = "False Positive Rate",
       y = "False Negative Rate",
       subtitle = "By race") +
  plotTheme()

We also compare the rate of observed recidivism to the rate of predicted recidivism by race.

testProbs %>%
  filter(Race == "Caucasian" | Race == "African-American" | Race == "Hispanic") %>%
  group_by(Race) %>%
  summarize(Observed.recidivism = sum(class) / n(),
            Predicted.recidivism = sum(predClass) / n()) %>%
  ungroup() %>%
  gather(key,value,Observed.recidivism:Predicted.recidivism) %>%
  mutate(key = case_when(key == "Observed.recidivism" ~ "Observed Recidivism",
                         key == "Predicted.recidivism" ~ "Predicted Recidivism")) %>% 
  ggplot(.,aes(Race,value)) +
    geom_bar(aes(fill=Race),position="dodge",stat="identity") +
    scale_fill_manual(values = palette_3_colors) +
    facet_wrap(~key) +
    plotTheme() +
   labs(title="Observed and predicted recidivism",
       x="Race",y="Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_rect(fill = "white"),
        strip.text = element_text(hjust = 0, size = 11))

Lastly, we create a confusion matrix by race that examines Accuracy, the True Positive Rate, the True Negative Rate, the False Positive Rate, and the False Negative Rate.

testProbs %>%
  filter(Race == "Caucasian" | Race == "African-American" | Race == "Hispanic") %>%
  filter(regression == "No Race") %>%
  group_by(Race) %>%
  count(predClass,class) %>%
  summarize(sum_trueNeg = sum(n[predClass==0 & class==0]),
            sum_truePos = sum(n[predClass==1 & class==1]),
            sum_falseNeg = sum(n[predClass==0 & class==1]),
            sum_falsePos = sum(n[predClass==1 & class==0]),
            total=sum(n)) %>%
  mutate(True_Positive = sum_truePos / total,
         True_Negative = sum_trueNeg / total,
         False_Negative = sum_falseNeg / total,
         False_Positive = sum_falsePos / total,
         Accuracy = (sum_truePos + sum_trueNeg) / total) %>%
  dplyr::select(Race, True_Positive, True_Negative, False_Negative, False_Positive,Accuracy) %>%
  gather(key,value,True_Positive:Accuracy) %>%
    ggplot(.,aes(key,value,fill=Race)) +
    geom_bar(aes(fill=Race),position="dodge",stat="identity") +
    scale_fill_manual(values = palette_3_colors) +
    labs(title="Confusion matrix rates by race",
         subtitle = "50% threshold",
         x="Outcome",y="Rate") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    plotTheme()

  1. We reached out to Propublica via email and requested a data dictionary, however, they responded that they did not have one.