Abstract:
As the number of government algorithms grow, so does the need to evaluate algorithmic fairness. This paper has three goals. First, we ground the notion of algorithmic fairness in the context of disparate impact, arguing that for an algorithm to be fair, its predictions must generalize across different protected groups. Next, two algorithmic use cases are presented with code examples for how to evaluate fairness. Finally, we promote the concept of an open source repository of government algorithmic “scorecards,” allowing stakeholders to compare across algorithms and use cases.
Our sincerest appreciation to Dr. Dyann Daley and Predict Align Prevent for their generous support of this work. We are also extremely grateful to Matt Harris for ensuring our code is in fact, replicable. Finally, we are immensely appreciative of the time and expertise of several reviewers including Drs. Dennis Culhane, Dyann Daley, John Landis, and Tony Smith.
Algorithms are increasingly making decisions in place of humans. Algorithms effect the products we buy like insurance, credit cards, and bank loans, and the information we are exposed to like shopping recommendations and news articles. Between bestsellers like Cathy O’Neil’s “Weapons of Math Destruction” and relentless news coverage of tech company data mining, society is beginning to understand that algorithms can bring as much peril as they do promise.
Government is still wrestling with how to regulate private sector algorithms - which are closed, their inner-workings cast as intellectual property. In the public sector, where transparency is expected, the calculus is different. Governments today use algorithms to dole out subsidies, determine program eligibility, and prioritize the allocation of limited taxpayer-funded resources. While fairness should be at the heart of government algorithms, it’s still unclear how best to referee public algorithms. The fact is that if human decisions are biased, implicit or otherwise, then the algorithms we train from those decisions will also be biased.
Recently new open source tools have emerged to help governments evaluate algorithms for fairness. Examples include the Ethics & Algorithms Toolkit created by a consortia of authors in government and academia, as well as the University of Chicago’s Aequitas “open source bias toolkit.” The Ethics & Algorithms Toolkit is excellent for developing governance and policy around algorithms, while Aequitas is aimed at more experienced data scientists. Both are invaluable tools and stakeholders would be well advised to integrate them into their technology workflow.
The main goal of this paper is to bridge the gap between these two stakeholder groups by providing code examples that introduce the novice public-sector data scientist to algorithmic fairness. Our second goal is to present an open source standard by which governments can compare their algorithms to those of their peers. Our motivation is informed by the following observations:
All jurisdictions at the local, state, and Federal levels collect comparable administrative data to determine program eligibility as well as to develop budgets and strategic plans.
All jurisdictions share a comparable set of algorithmic use cases, each of which can be realized by leveraging these administrative datasets.
Thus, in the future, all jurisdictions will be developing comparable algorithms that predict comparable outcomes of interest towards the fulfillment of these shared use cases.
In response to public pressure, each jurisdiction will need to evaluate the fairness of their algorithms. One solution is a standard, open source “scorecard,” we call the Open Algorithmic Scorecard (OAS). Each jurisdiction would have a separate scorecard for every algorithm they deploy, featuring a set of simple metrics describing accuracy and bias. These scorecards would live in an open repository allowing one to compare prototype model results to models created elsewhere; promote transparency by filing finished scorecards; and provide an arena where policy-makers, data scientists, academics, civic technologists, and other stakeholders could observe best practices.
This report is a proof-of-concept, demonstrating two use cases. The first use case is a placed-based machine learning algorithm to predict home prices, which many jurisdictions now use for tax assessment purposes. The second use case is a person-level machine learning model for predicting prisoner recidivism. Along the way, two sets of analytics are presented (one for place, one for people) that describe model accuracy and model “generalizability.” English narrative, replicable code, and data visualizations are provided for each. We hope novice and aspiring government data scientists will sharpen their skills by replicating the code found in the appendix below.
The term “algorithm” can pertain to a wide array of decision-making tools. This report focuses exclusively on supervised machine learning algorithms, which we define as a class of machine learning models that learn from a set of observed experiences to predict an experience that has yet to be observed.
Section 2 provides some background on the accuracy/generalizability metrics used to assess fairness in this report. Section 3 and Section 4 present the real estate tax assessment and recidivism models, respectively, as well as the scorecards for each. Section 5 concludes.
Governments use machine learning algorithms to allocate limited taxpayer-funded resources, and a biased model may mean that these resources are misallocated. Resources may be wasted on a population that does not need them or allocated in a way that ultimately proves harmful. A biased algorithm may leave policy makers wondering whether a data-driven approach is any more useful than existing institutional knowledge.
For some use cases, bias may have real social costs. A biased tax assessment model may systematically under or over-assess the value of certain homes. In the former case, City tax coffers lose out on revenue, while gentrifiers freeride on new amenities and services. In the latter case, excessive tax burden in poor communities may lead to greater housing instability and inequality.
Biased algorithms may have more dire consequences for people-oriented use cases, like recidivism. As we discuss below, one example of bias is higher false positive rates for African American ex-offenders compared to Whites. A false positive in this context means that the model predicted that an ex-offender would recidivate but actually did not. When false positives are disproportionately predicted for a protected class, decisions made from that algorithm may come with significant social costs.
Social Scientists are well-versed in the issues of fairness and discrimination. Identifying discrimination in an algorithm however, is just as nuanced and complicated as it is in housing and labor markets. It is unlikely that a jurisdiction would create an algorithm with the express intent of discriminating against a protected class. Given the black box nature of these models, it is far more likely that they would create an algorithm from which decisions have a “disparate impact” on members of a protected class. Disparate impact is a legal theory positing that although a policy or program may not be discriminatory prima facie, it still may have an adverse discriminatory effect even if unintended.
Disparate impact may play a role when machine learning algorithms are poorly fit. There are two general conditions that lead to poor predictions. An “underfit” model does not exhibit a high degree of predictive accuracy, likely because not enough effective predictive variables or “features” are included. An “overfit” model, traditionally, is one that may predict well on training data but fails when used to predict for new observations. Models may also be overfit to one type of experience, predicting differently from one group to the next.
If an algorithm does not generalize to one protected class, its use for resource allocation may have a disparate impact on that group. For example, the recidivism algorithm created below predicts a higher rate of false positives for African Americans relative to Caucasians. This may occur because the algorithm is underfit to the African American “experience.” It may also be that the training data itself is biased, a common critique of prediction in the criminal justice domain. Critics have argued that systematic over-policing of historically disenfranchised communities creates a feedback loop where more reported crime leads to more predicted crime, which leads to more cops on patrol and thus, more reported crimes. It could be that police bias leads to the over-policing of certain communities. It could also be that people with higher propensity to commit crimes sort into these communities. In reality, both likely play a role, but if the relationship goes unobserved, then like any statistical model, systematic error will lead to bias.
It is impossible to identify the effect of unobserved variables. As an alternative, researchers are actively developing a series of fairness metrics. If bias cannot be judged by the input features, perhaps it can be judged by opening the black box and looking for bias in the predictions. We find this review of fairness metrics to be particularly relevant for policy-makers. In the case studies below, the fairness criterion we present hinge on an algorithms ability to generalize across different group typologies - like rich and poor neighborhoods or Caucasian and African American ex-offenders.
It is difficult to participate in the real estate market and not interact with machine learning models. Airbnb’s algorithms recommend rental prices to its hosts. Trulia’s computer vision algorithms convert house photos to home features. Perhaps the most ubiquitous real estate algorithm is Zillow’s Zestimate, which predicts the current market value of a property.
The Zestimate algorithm is very similar to the methodology counties use to assess home values and calculate property tax liability. These methods are rooted in the “hedonic model” - an econometric approach for deconstructing the market price of a good to the value of each constituent part. The hedonic model can estimate the “capitalization effect” or price premium associated with an extra bedroom or the presence of a garage. It can also be used, as is the case with tax assessment, for prediction. Typically, these algorithms are trained on recent transactions, then used to predict value for all houses citywide. The hedonic model relies on several different feature types, each explained below.
The first is parcel/internal characteristics like the size of the lot and the number of bedrooms. Next is neighborhood characteristics including exposure to crime or access to transit. A third is the “spatial component” which hypothesizes that house price is a function, in part, of neighboring prices. These features take their motivation from real estate appraisers who compare similar homes in close proximity (i.e. “comparables”). Properly controlling for comparables requires features that capture the unique spatial scale of prices in a neighborhood. For over two decades, Urban Spatial has developed a host of specialized approaches to account for this spatial component. Interested readers can refer to our work here, although for simplicity, we omit these more complicated features from our model below.
These algorithms are not without inherent biases. There are some key assumptions including 1) that all buyers and sellers have access to the same market information; 2) that neighborhood crime can be measured with the same accuracy as say, the number of bedrooms in a house; and 3) that buyers exhibit homogeneous preferences for amenities, like schools. A final source of bias is an assumption that neighborhoods are in “equilibrium”. This is almost never the case, particularly in gentrifying communities. In these communities, buyers and sellers capitalize future expectations into prices (ie. “what will this house be worth if a new subway station is opened nearby?”). Simply put, buyers and sellers will disagree on the future value of gentrified housing in a neighborhood, which may make it difficult to predict variation in prices at the neighborhood level.
Accuracy is simply the difference between the observed price of the home and the predicted price. This difference is often referred to as “error.” Generalizability is a bit more complex. The general approach for assessing bias in these algorithms is to investigate how errors cluster in space. The steps are: 1) train the model; 2) use the trained model to predict for out-of-sample sales; then 3) calculate and map errors. For an assessment algorithm to generalize well, it must exhibit comparable error rates across different neighborhoods and neighborhood contexts. If the model predicts better for rich versus poor neighborhoods or White versus African American neighborhoods then the model may be biased. The spatial arrangement of errors is explored further below.
Our data come from the Philadelphia Office of Property Assessment (OPA). Those interested in replicating the analysis can download the data here or access assessment data directly on OpenDataPhilly. The dataset is comprised of market transactions of single family home sales from July 2017 to July 2018. Sales less than $3,000 and greater than $1,000,000 are removed, as well as observations with missing data. The final dataset includes 21,964 transactions with a mean and standard deviation sale price of $185,950 and $162,873, respectively. The table below provides a description of the variables developed for the model.
| Variable | Description |
|---|---|
| nhood | What neighborhood the home is in |
| Season | The season the home was sold in |
| year_built | The year the home was built. Split into six categories: prior to 1970, 1970 to 1980, 1980 to 1990, 1990 to 2000, 2000 to 2010, and 2010 or later |
| own_occ | A dummy variable indicating if the home is owner occupied (1) or not (0) |
| homestead | A dummy variable indicating whether the property is participating in the homestead exemption (1) or not (0) |
| AC | A dummy variable indicating whether the property has AC (1) or not (0) |
| numBed | The number of beds in the home |
| numBath | The number of bathrooms in the home |
| numStories | The number of stories of the property |
| garage_spaces | The number of garage spaces the property has |
| fireplaces | The number of fireplaces the property has |
| total_livable_area | The livable area of the home |
| univDist | The nearest neighbor distance between each property and its closest university |
| parkDist | The nearest neighbor distance between each property and its closest park (centroid) |
| ccDist | The nearest neighbor distance between each property and Center City (City Hall) |
| septaDist | The nearest neighbor distance between each property and its closest MFL/BSL stop |
| permitDist | The average nearest neighbor distance between the property and its five closest new construction permits within the last two years |
| assaultsDist | The average nearest neighbor distance between the property and the location of the five closest aggravated assaults |
| cleanDist | The average nearest neighbor distance between the property and its two closest dry cleaners |
| graffitiDist | The average nearest neighbor distance between the property and the location of the five closest 311 calls for graffiti |
| abandVDist | The average nearest neighbor distance between the property and the location of the five closest 311 calls for abandoned vehicles |
Philadelphia is missing a surprising amount of data on the internal characteristics of homes. 17% of transactions in our data list zero bedrooms. Perhaps OPA imputes missing values, but for this example, we use a fixed effect to denote when number of bedrooms equals zero. All told, we employ eight house/parcel specific features in the model below.
Figure 3.1 visualizes the mean and standard deviation (as a percentage of the mean) of single-family home prices by neighborhood in Philadelphia. Not surprisingly, high and low price neighborhoods cluster. Perhaps a bit more surprising, is that low priced neighborhoods also exhibit relatively high price variance. This may relate to the gentrification disequilibrium described above.
Not only do prices vary by neighborhood, but they vary by neighborhood type as well. We split neighborhoods into “high” and “low” designations across three different typologies and visualize differences in Figure 3.2. The first is Qualified Census Tracts (QCT), a poverty designation HUD uses to allocate housing tax credits. QCT designations provide a deliberate and policy-relevant threshold for judging generalizability. Neighborhoods that qualify for tax credits exhibit mean single-family home prices that are nearly half that of those that do not. Next, we understand whether the algorithm generalizes to gentrifying neighborhoods. Tracts are designated as “gentrifying” and non-gentrifying" using metrics from the Federal Reserve Bank of Philadelphia. Mean prices differ across gentrifying and non-gentrifying neighborhoods. A third typology is race-related. To determine whether the model generalizes with respect to race, the city is grouped into “majority White” and “majority non-White” census tracts. Mean prices are clearly higher in the former group. A well generalized algorithm should exhibit comparable error rates across each group.
Figure 3.3 visualizes the neighborhood amenity features developed for the model. The goal is to quantify the level of amenity and disamenity “exposure” to each home sale citywide. To quantify the exposure to aggravated assaults for example, each home sale location measures the distance from itself to its k nearest assault neighbors and takes the mean.
Normally a host of features are employed to model the spatial structure, but in this simple example, we include just one set of features - a fixed effect for each neighborhood. Our hypothesis is that explicitly accounting for neighborhood variation helps to control for local comparables as well as any equilibrium effects.
For demonstration purposes, the model is more simplistic than it would be in reality. 10-fold cross validation is performed on a 60% training set to tune the hyperparameters of a Random Forest algorithm. All goodness of fit metrics are reported either from cross-validation or from the 40% test set. Figure 3.4 shows the feature importance associated with the final model.
Accuracy and generalizability is assessed in a variety of ways. First, Figure 3.5 visualizes a scatterplot of observed prices as a function of predicted prices. The pink line represents a hypothetical perfect prediction. The plot suggests that the model fits reasonably well, with reduced accuracy for higher prices. Figure 3.6 echoes this finding by plotting Mean Absolute Percent Error by decile.
Next, the model is used to predict for the withheld 40% test set. Mean Absolute Error (MAE) is the mean absolute value difference between observed and predicted sale prices for the test set. The MAE is $44,058. For context, the average single-family home price in our sample is $186,961. The Root Mean Square Error (RMSE) is similar to the MAE, but errors are squared, averaged, and square rooted. The RMSE which penalizes higher errors is $72,582. The Mean Absolute Percent Error (MAPE) is the mean absolute value difference between observed and predicted sale prices on a percentage basis. The MAPE is 49.7%. 100-fold cross-validation without hyperparameter tuning is performed. This test provides some intuition about how the model would predict for data it has yet to see. The mean MAE across all holdouts is $43,587 and the standard deviation is $1,232, suggesting a model that would generalize to new data.
Figure 3.7 maps the absolute error for test set sales both on a dollar and percentage basis. Clear differences can be observed. The arrangement of errors provides additional intuition. Ideally, the algorithm would account for enough variation in price such that the remaining variation (the error), were randomly distributed across the city. Figure 3.7 clearly illustrates that this is not the case. Different communities exhibit different levels of error. We can use the results of a Global Moran’s I test to find that the spatial configuration of errors exhibits statistically significant clustering (p-value < 0.001).
Correlated errors suggest that spatial bias exists in the model. This is further explored in Figure 3.8 which is generated from “Leave One Group Out Cross Validation” (LOGOCV). Instead of holding out and testing for a random subset, LOGOCV reveals how well the model generalizes to a given neighborhood by training on all but one neighborhood and validating on the hold out. Each neighborhood takes a turn acting as the hold out. The map visualizes average error rates by hold out neighborhood and reaffirms the fact that that the model works better in some parts of Philadelphia than others.
Finally, Figure 3.9 tests how well the algorithm generalizes to the various neighborhood typologies. Interestingly, error rates appear relatively comparable across gentrifying (42.8%) and non-gentrifying tracts (51.3%). However, there are higher error rates in low versus high poverty neighborhoods (25.6% to 70.6%), and between White and non-White neighborhoods (26% to 69.9%).
The model exhibits significant differences in error rates across neighborhood contexts. Deployment of an algorithm like this would be problematic. Sales with the lowest (1st quintile) error rate have average observed price and error rates of $230,696 and 3.6% respectively. Sales with the highest (5th quintile) error rates have average observed price and error rate of $87,644 and 174% respectively. A model biased this way places a disproportionately higher property tax burden on lower-valued homes. In such an instance, an argument could be made that the algorithm has a disparate impact on the low-income families who likely live in these communities. For them, the algorithm may lead to more economic hardship and exasperate housing instability.
Improvements to the model could be made by adding new features to equitably reduce error across space. If new features do not help there are two potential remedies. The first is to come up with a set of rules to make general corrections in instances where, for example, the predicted price is more than 20% that of the neighborhood mean. These corrections are common in predictive algorithms, but as the number of rules increases, the need for a supervised algorithm decreases.
The other approach is to enact policies that mitigate the negative effect of property tax reassessments after they occur. In 2014, the City of Philadelphia enacted the Longtime Owner Occupants Program (LOOP) which freezes assessments that increase 200% or more (triple the base amount) from year to year for homeowners living in the residence for ten years or more. Recently, Philadelphia City Council President Darrell Clarke, who has been highly critical of the tax assessment system, proposed new legislation that would lower the LOOP threshold from 200% to 50%. From the Philadelphia Inquirer (emphasis added):
Clarke’s office analyzed the most recent assessments and found that about 75 percent of households that had assessment increases (from 2017 & 2018) between 50 percent and 200 percent are in census tracts with low to moderate income, meaning their income levels would likely qualify.
Philadelphia and cities like it are using property tax freezes as a way to offset gentrification-induced displacement. While it is likely that gentrification causes increased property taxes, it may also be true that poorly calibrated tax assessment models are partially to blame. Figure 3.10 displays a mock scorecard for the Philadelphia tax assessment algorithm. Note the fairness score, which we make up entirely simply to show that once many algorithms can be compared in a repository setting, it is possible to rank them accordingly.