Introduction

Many jurisdictions outsource the provision of critical social services to local nonprofits, which in turn provide client level data back to government agencies to be used for planning and evaluation. These data are some of the most critical measures cities have for tracking poverty, health, and economic well being.

Yet, anecdotally, much of these data go unused either because they are stored in opaque legacy systems or because agencies do not have the data science resources on hand to convert these administrative data to actionable insight.

Recently, the Human Services Department of Seattle, Washington did something extraordinarily unprecedented, releasing seven years of client level service provision data as open data. Initially prepared for a city hackathon, these data are, to our knowledge, the first of their kind ever released as open data by a government anywhere in the U.S. These are just the kind of data that grab our attention at Urban Spatial and we see it as an opportunity to develop some proof-of-concept ideas to help other jurisdictions think about unlocking new value in their own client level data.

These data come from Seattle’s Promoting Healthy Aging initiative, but the approach outlined below should generalize to other Service Delivery Systems including those from homelessness, substance abuse treatment, mental health, and many other critical services.

Each row of these data represents the provision of a single service and includes information like type of service, the date, the Service Provider, the client, and more. The individuals responsible for making these data public spent a great deal of effort perturbing the data to prevent reidentification, but critically, there is a unique ID assigned to each client which does not vary by year. This allows us to track client behaviors across services and across time.

Much of what Planners do is ensure that the supply of some public resource, in this case elderly and disability services, are aligned with the demand for these services. This is incredibly difficult and there are a host of reasons why supply may not meet demand.

It could be that legislatures do not appropriate enough resources to meet the demand; it may not be possible to observe the true demand for resources; insufficient audit controls may allow Service Providers to inefficiently match supply and demand; and finally, it could be that the degree to which supply and demand align varies significantly across certain neighborhoods and populations.

The framework we present below allows us to address these limitations, at least to some degree. The goal is to create a tool that addresses four key use cases:

  1. Where is the actual and forecasted demand for services across space?
  2. Where does the provision of services currently meet that demand?
  3. How are Service Providers collaborating or competing in shared Service Areas?
  4. What can Service Provider audits tell us about the efficacy of these programs?

These use cases are addressed in three sections below. The next section demonstrates with code examples, how these data can help understand the provision of services across populations and over time and space. Section 3 mines these data for predictive relationships to help forecast demand in the near future. Section 4 presents wireframes for an application explicitly designed to help balance service demand with service supply.

Client Level Indicators

The City of Seattle’s open data portal currently hosts seven years of Aging and Disability Services data. The data includes demographic and household information, general diagnosis codes, as well as service consumption. Over this time, the number of clients receiving Aging and Disability Services has increased from 32,036 clients in 2010 to 48,443 in 2016 (Figure 1).

Clients included in this dataset reside in neighborhoods throughout the Seattle-King County area. Figure 2 identifies neighborhoods in southern Seattle as having the highest number of services per acre as well as services per client compared to other neighborhoods.

Who comprises the client population?

Although these services are intended for an aging population, the clientele receiving services has gotten younger over the seven years. Figure 3 shows that the percent of clients aged 65 and above has fallen from 81.4% in 2010 to 72.4% in 2016, while the percent of clients aged 40 to 64 has increased from 13.7% to 21.0%.

Figure 4 shows that overall, the racial breakdown of clients has remained relatively consistent, over the seven years.

However, the percent of lower income clients has increased. In Figure 5 below, ‘Very Low’ income refers to clients who earn less than 30% of the Area Median Income (AMI). Clients who are ‘Low’ income earn less than 50% of the AMI, while those classified as ‘Moderate’ income earn less than 80% of the AMI. ‘Above Moderate’ refers to clients who earn more than 80% of the AMI.

The percent of clients considered Very Low income has risen from 68.3% in 2010 to 73.7% in 2016.

Clients who are homeless are particularly vulnerable and may rely on these services to a greater degree. They comprise a small portion of the dataset each year, but Figure 6 shows that the percent of homeless clients has increased from 1.20% to 1.80%.

In contrast, the percent of clients who are able to live alone and the percent of clients who are disabled have remained relatively constant (Figure 7).

What do these clients need services for?

Clients in the Aging and Disability Services dataset can receive services for a wide variety of reasons. There are eleven ‘Service Areas’ for which data are collected. Figure 8 illustrates the total number of services provided for each Service Area from 2010 to 2016.

Figure 9 explore these service areas in more detail, illustrating how the number of services provided changes over the seven years. The left-side shows the service areas that provide greater than 20,000 services and the right-side plots service areas that give fewer than 20,000 services. There are multiple service areas, such as Transportation, that have not been tracked over all seven years.

Multiple variables in the dataset also fall into what are known as Activities of Daily Living (ADLs) and Instrumental Activities of Daily Living (IADLs). ADLs are basic self-care tasks such as eating, bathing, dressing, toileting, mobility, and grooming. IADLs are slightly more complex skills. They include managing finances, handling transportation, shopping, preparing meals, using the telephone or other communication devices, managing medications, doing laundry, housework, and basic home maintenance.

Not all clients within the dataset will need assistance with these activities, as these tasks are most difficult to manage for patients experiencing early stages of dementia. However, as Figure 10 shows, there is a sizeable population of clients who need help with ADLs and IADLs.

Clientele over time

Despite the need for social services growing, over 50% of clients in the dataset only receive services for one year (Figure 11). The average number of years receiving services is only slightly longer at 2.29 years.

In fact, there are almost as many clients exiting the system as entering each year (Figure 12).

Certain groups do receive services for longer, particularly very low income clients, clients who are 65 years and older, and Aferican American clients (Figure 13).

Taken together, these indicators tell the story of increased demand for vital public services across time. What can these data tell us about client demand in the future? Or demand for a new client who has never received services in the past? To answer these questions, we turn to the sorts of machine learning algorithms used mainly in the healthcare and insurance industries to forecast risk and service uptake. A robust forecast can help Seattle and other cities collecting these kind of data do a better job ensuring service provision matches service demand.

Forecasting Demand

Descriptive indicators like those visualized in Section 2 are useful in helping stakeholders understand important trends. However, the data is really powerful when used to help stakeholders make more efficient and effective planning and budgetary decisions.

In this section we use these data to develop a service demand forecasting model, to predict the count of services taken by a client for a single year conditional on data collected in the system. As we show in Section 4, these forecasts can help agencies make more informed resource allocation decisions to ensure service demand is met across space and across populations.

We choose a single service type, Nutrition services, which make up 31% of the total services tracked in the data. Our motivation is to borrow the observed, client/service experience from the recent past and test the extent to which that experience generalizes to clients in the near future.

The forecast is not only optimized for accuracy, defined as the difference between observed and predicted service counts, but for the extent to which predictions generalize across different populations and client types. These are measures of fairness, which have become an important component of the machine learning models we build at Urban Spatial.

Below we provide some brief insight into our algorithms, including visual evidence of their effectiveness.

Modeling Nutrition services

The models were trained on a sample of 2010 - 2015 data and then tested on clients who received services in 2016. The data are split into five different ‘stories’, each including features on demographics, location, activities clients need help with, patient history, and Service Providers for that service. The policy motivation for these stories is to divide the data into policy relevant buckets; the machine learning motivation is that this is a form of dimensionality reduction.

Three separate models are estimated for each story - a Negative Binomial model, a Random Forest model, and an XGBoost model. We then ‘ensemble’ the predictions from these sub-models into a final model. Three different ensembling approaches are used - a Random Forest model, an xgbLinear model, and an xgbTree model.

Figure 14 shows how well each ensemble model predicts for the 2016 test set. The grey bar represents the total observed count of Nutrition services in 2016. Each colored bar then shows how well a given ensemble model predicts total count of services. The Random Forest ensemble with XGBoost inputs performs the best according to this metric, predicting 95,658 nutrition services. The total observed count of nutrition services is 93,877, meaning a difference of only 1.90%.

To measure how fair or generalizable our models are across different populations, model performance is compared across groups using Mean Absolute Error (MAE; Figure 15). MAE is simply the mean difference between observed and predicted service counts, regardless of whether that difference is positive or negative.

In Figure 15, each grey bar represents the mean observed service counts for that population while each colored bar represents the mean service count error (MAE) from the model by ensemble type. We’re looking for consistency across population types here - that the model for instance, works equally as well for Caucasians as it does African Americans. That appears to be the case, for the most part.

One of the most important stories in the ensemble model is the history of services taken by the client. This is the idea that the type of services and service counts that a client took in past years helps to predict service counts for this year. A big challenge is to use our ensembling approach to predict for ‘New Clients’ who obviously have no service history. We find our predictions are robust to this population despite the absence of service history. This is likely because new clients take fewer services than those who have been in the system for several years.

The overall mean absolute error is 0.42, suggesting our service count forecasting models are robust. As a further test, Figure 16 explores how well the model predicts relative to different hold out years. Here only three ensemble types are included. We see the model generalizing well to different hold out years except for 2014. Future work should investigate this further.

With some forecasted measures of demand in hand, we develop a Service Delivery System in the next section in order to better match service demand with service supply.

From data to decision-making intelligence

In this section we develop some wireframe concepts for a dashboard application that hits on the four previously mentioned use cases:

  1. Where is the actual and forecasted demand for services across space?
  2. Where does the provision of services currently meet that demand?
  3. How are Service Providers collaborating or competing in shared Service Areas?
  4. What can Service Provider audits tell us about the efficacy of these programs?

Our ability to address each use case will become apparent as each wireframe is explained. To create the dashboard, several assumptions are made. First, we assume that Service Providers gather and upload address-level data about each service. Second, we presume that client-level data is uploaded by Service Providers at regular intervals, such as monthly or quarterly. Third, we speculate that Service Providers may be raising funds from sources outside of the City of Seattle. Finally, we recognize that the City of Seattle, as shown below, is only a subset of the geographic area the full dataset includes.

The user begins by evaluating service demand at the city level. Here we use tract data to investigate latent demand, instead of the revealed demand we observe in the data. Using the side-panel, the user can choose a feature to be mapped at the Census tract level for a given year. Citywide indicators, such as total services given and total services forecasted are shown at the bottom of the side-panel. The user can also click on a Census tract to get more detailed information and the option to explore Service Providers within that area.


When the dashboard is expanded, four plots are displayed describing data for that Census tract. The upper left plot displays the forecasted service counts compared to the observed counts for the previous four quarters for both that tract and the city. The next plot compares client composition from the Seattle client data to Census data, and the third plot shows what activities of daily living all clients need help with. The Auditing Metrics box is left blank for now, as it’s only applicable in a ‘Service Provision’ context.


Now that the user understands service demand across Seattle, the next step in the workflow is to explore service supply. There are several ways to navigate to the Service Provision interface. When this occurs, the map will zoom into the tract and will show the geospatial Service Areas of the Service Providers working in that community.

The user can compare two Service Providers by moving them from the box on the left to the box on the right and clicking ‘map’. The service areas for these Service Providers will be displayed on the map and the user can click on a service area to learn more about a given Service Provider.


When the dashboard is expanded in Service Provision mode, the user can get a sense for how Service Providers working in a community are competing or complementing one another. Are they serving similar populations? Are they providing similar services? The dashboard shows observed versus forecasted services for an area by the two Service Providers.

Here Auditing Metrics will be displayed. If jurisdictions are going to provide capital to these non-profits, they should expect to see value in return. No actual auditing metrics are included in the box, only because it’s unclear what data are collected for audits. We do believe however, that all users of this dashboard, including the Service Providers, should know objectively how well they are serving their populations and how well they are doing relative to other Service Providers. We hope such a comparison might help make Service Providers more accountable.

Conclusion

Seattle’s generous open data contribution provides an opportunity for data scientists and policy wonks like us at Urban Spatial, to think critically about how these data can generate new value. As more jurisdictions wrestle with the legal, data governance, and financial costs of integrating across-agency data, these service delivery data reveal important use cases that individual agencies can investigate on their own.

The client-level data analyzed in this case study is collected by most cities and states to support the allocation of millions of dollars of taxpayer resources to our most vulnerable. We hope this case study helps other agencies consider how their data can be used to inform resource allocation in the future.

We hope you will consider taking the code in the Appendix below for a spin. You can also check out a slide deck created for this project here.

Code Appendix: Data Wrangling and Exploratory Analysis

The above analysis uses Aging and Disability Services data from 2010 to 2016, which can be found on the City of Seattle’s open data portal. The following code appendix walks through how to acquire the data through the API, clean the data, and create exploratory plots.

Begin by installing and loading the following packages:

library(RSocrata)
library(tidyverse)
library(sf)
library(scales)
library(tidycensus)

We use the below color palettes, as well as a custom mapTheme() and plotTheme() to create the plots.

mapTheme <- function() {
  theme(
    plot.title = element_text(size = 14, family = "sans", face = "plain", hjust = 0),
    plot.subtitle=element_text(size = 11, family = "sans", hjust = 0),
    plot.caption=element_text(size = 10, family = "sans", face = "italic", hjust = 0),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    legend.title = element_text(size = 10, family = "sans"),
    legend.text = element_text(size = 9, family = "sans"),
    panel.border = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_line(colour = "white")
  )
}

plotTheme <- function() {
  theme(
    plot.title = element_text(size = 14, family = "sans", face = "plain", hjust = 0),
    plot.subtitle=element_text(size = 11, family = "sans", hjust = 0),
    plot.caption=element_text(size = 10, family = "sans", face = "italic", hjust = 0), 
    axis.title.x = element_text(size = 10, family = "sans", face = "plain", 
                                hjust = 1, vjust = -0.5),
    axis.title.y = element_text(size = 10, family = "sans", face = "plain", 
                                hjust = 1, vjust = 1),
    axis.text = element_text(size = 9, family = "sans", face = "plain"),
    panel.background = element_blank(),
    panel.grid.minor = element_line(colour = "gray"),
    panel.grid.major = element_line(colour = "gray"),
    axis.ticks = element_blank(),
    legend.title = element_text(size = 10, family = "sans"),
    legend.text = element_text(size = 9, family = "sans"),
    axis.line = element_blank()
  )
}

palette_9_colors <- c("#fffcba", "#edf8b1", "#c7e9b4", "#7fcdbb", 
                      "#41b6c4", "#1d91c0", "#225ea8", "#253494", "#081d58")
palette_5_colors <- c("#fffcba", "#c7e9b4", "#7fcdbb", "#1d91c0", "#225ea8")
palette_4_colors <- c("#fffcba", "#c7e9b4", "#41b6c4", "#225ea8")
palette_3_colors <- c("#fffcba", "#7fcdbb", "#225ea8")
palette_2_colors <- c("#fffcba", "#225ea8")

For replicability purposes, we acquire the Aging and Disability Services data through the API using the RSocrata package. The read.socrata function takes the URL for the dataset. We save each year separately - ensuring the year is coded properly - and then rbind every year together into one dataset.

#use the api to get the data for each year
#make sure the year is set correctly, and if not set it to the correct year
services10 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/b2rz-ma2f") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2010, serviceyear))

services11 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/32ek-fqau") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2011, serviceyear))

services12 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/bk4b-z4j9") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2012, serviceyear))

services13 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/gd5n-q9e4") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2013, serviceyear)) 

services14 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/e7wh-bzzm") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2014, serviceyear))

services15 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/gp2p-r888") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2015, serviceyear))

services16 <- read.socrata("https://data.seattle.gov/Community/Aging-and-Disability-Services-Client-Level-Data-20/vhuq-wqf8") %>% 
  mutate(serviceyear = ifelse(is.na(serviceyear) == "TRUE", 2016, serviceyear))

#combine all years of data into one data frame
allyears <- rbind(services10, services11, services12, services13, services14, services15, services16)

Because these data are manually entered and each Service Provider collects different information, there are many inconsistencies and misclassifications. Following the metadata provided, we clean the dataset, filtering out any clientids that are misclassified and recoding the features we are interested in.

cleanDat <- allyears %>%
  filter(clientid != "",
         is.na(clientid) != "TRUE",
         clientid != "Y",
         is.na(serviceareaid) != "TRUE") %>% 
  mutate(racecode = ifelse(racecode == 9, 0,racecode),
         incomecode = ifelse(incomecode %in% c(5, 6, 7, 8), 0,incomecode),
         livealone = case_when(livealone %in% c("", 0, "u", "U", " ") ~ "U",
                               livealone %in% c("n", "N") ~ "N",
                               livealone %in% c("y", "Y") ~ "Y"),
         agerange = ifelse(agerange %in% c("N", "", " ", "-22", "-3", "-423"), "U", agerange),
         limitedenglish = case_when(limitedenglish %in% c("", " ", 0, "u", "U") ~ "U",
                                    limitedenglish %in% c("n", "N") ~ "N",
                                    limitedenglish %in% c("y", "Y") ~ "Y"),
         nutritionalrisk = case_when(nutritionalrisk %in% c("", " ", "u", "U") ~ "U",
                                     nutritionalrisk %in% c(0, 1,2) ~ "Good",
                                     nutritionalrisk %in% c(3,4,5) ~ "Moderate",
                                     nutritionalrisk %in% c(6,7,8,9) ~ "High",
                                     nutritionalrisk %in% c("n", "N") ~ "N",
                                     nutritionalrisk %in% c("y", "Y") ~ "Y"),
         singleparent = case_when(singleparent %in% c("", " ", "u", "U") ~ "U",
                                  singleparent %in% c("n", "N") ~ "N",
                                  singleparent %in% c("y", "Y") ~ "Y"),
         householdwithchildren = case_when(householdwithchildren %in% c("", " ", "u", "U") ~ "U",
                                           householdwithchildren %in% c("n", "N") ~ "N",
                                           householdwithchildren %in% c("y", "Y") ~ "Y"),
         eating = case_when(eating %in% c("", " ", "u", "U") ~ "U",
                            eating %in% c("n", "N") ~ "N",
                            eating %in% c("y", "Y") ~ "Y"),
         toileting = case_when(toileting %in% c("", " ", "u", "U") ~ "U",
                               toileting %in% c("n", "N") ~ "N",
                               toileting %in% c("y", "Y") ~ "Y"),
         walking = case_when(walking %in% c("", " ", "u", "U") ~ "U",
                             walking %in% c("n", "N") ~ "N",
                             walking %in% c("y", "Y") ~ "Y"),
         gettingplaces = case_when(gettingplaces %in% c("", " ", "u", "U") ~ "U",
                                   gettingplaces %in% c("n", "N") ~ "N",
                                   gettingplaces %in% c("y", "Y") ~ "Y"),
         transferring = case_when(transferring %in% c("", " ", "u", "U") ~ "U",
                                  transferring %in% c("n", "N") ~ "N",
                                  transferring %in% c("y", "Y") ~ "Y"),
         dressing = case_when(dressing %in% c("", " ", "u", "U") ~ "U",
                              dressing %in% c("n", "N") ~ "N",
                              dressing %in% c("y", "Y") ~ "Y"),
         bathing = case_when(bathing %in% c("", " ", "u", "U") ~ "U",
                             bathing %in% c("n", "N") ~ "N",
                             bathing %in% c("y", "Y") ~ "Y"),
         medicalmanagement = case_when(medicalmanagement %in% c("", " ", 1, 2, 3, 4, 5, 6, 7, 
                                                                8, 9, 10, 11, 12, "u", "U") ~ "U",
                                       medicalmanagement %in% c("n", "N") ~ "N",
                                       medicalmanagement %in% c("y", "Y") ~ "Y"),
         cooking = case_when(cooking %in% c("", " ", 2016, 2015, 2014, 2013, 2012, 
                                            2011, 2010, "u", "U") ~ "U",
                             cooking %in% c("n", "N") ~ "N",
                             cooking %in% c("y", "Y") ~ "Y"),
         shopping = case_when(shopping %in% c("", " ", 1034, 872, "u", "U") ~ "U",
                              shopping %in% c("n", "N") ~ "N",
                              shopping %in% c("y", "Y") ~ "Y"),
         driving = case_when(driving %in% c("", " ", 11, 19, 47, "u", "U") ~ "U",
                             driving %in% c("n", "N") ~ "N",
                             driving %in% c("y", "Y") ~ "Y"),
         heavyhousework = case_when(heavyhousework %in% c("", " ", 134, 135, 30, 
                                                          67, 12, "u", "U") ~ "U",
                                    heavyhousework %in% c("n", "N") ~ "N",
                                    heavyhousework %in% c("y", "Y") ~ "Y"),
         phoning = case_when(phoning %in% c("", " ", 1, 14, 15, 28, 35, 21, 3, 29, 30, 32, 
                                            33, 34, 38, 40, 36, 52, 8, "u", "U") ~ "U",
                             phoning %in% c("n", "N") ~ "N",
                             phoning %in% c("y", "Y") ~ "Y"),
         moneymanagement = case_when(moneymanagement %in% c("", " ", "Case", "Meals", 
                                                            "Contact", "u", "U") ~ "U",
                                     moneymanagement %in% c("n", "N") ~ "N",
                                     moneymanagement %in% c("y", "Y") ~ "Y"),
         chores = case_when(chores %in% c("", " ", 2616, 2693, "u", "U") ~ "U",
                            chores %in% c("n", "N") ~ "N",
                            chores %in% c("y", "Y") ~ "Y"),
         disabilitystatus = case_when(disabilitystatus %in% c("", " ", "-", "u", "U") ~ "U",
                                      disabilitystatus %in% c("n", "N") ~ "N",
                                      disabilitystatus %in% c("y", "Y") ~ "Y"),
         geographiclocation = ifelse(geographiclocation %in% c("", " ", "N"), "U", 
                                     geographiclocation),
         homeless = case_when(homeless %in% c("", " ", "u", "U") ~ "U",
                              homeless %in% c("n", "N") ~ "N",
                              homeless %in% c("y", "Y") ~ "Y"))

Plots on client composition are then made using the dataframe cleanDat. Using the tidyverse package, we pipe (%>%) directly into the ggplot. We style the plot by adding in our color scheme, using our custom plotTheme(), and adding commas or percent symbols to the axis labels through the scales package.

#number of clients each year
cleanDat %>% 
    group_by(year) %>%
    summarise(uniqueClients = length(unique(clientid))) %>% 
    ggplot(., aes(as.factor(year), uniqueClients)) +
    geom_bar(stat = "identity", fill = "#225ea8") +
    labs(title = "Number of clients over the last seven years",
         y = "unique clients",
         x = " ") +
    scale_y_continuous(labels = comma) +
    plotTheme()

#race breakdown of clientele
cleanDat %>% 
  group_by(year) %>%
  summarise(White = sum(racecode %in% c(6))/n(),
            Black = sum(racecode %in% c(3))/n(),
            Hisp = sum(racecode %in% c(5))/n(),
            Other = sum(racecode %in% c(0, 1, 2, 4, 7, 8))/n()) %>%
  select(-year) %>% 
  gather() %>% 
  mutate(year = rep(2010:2016, 4)) %>% 
  ggplot(., aes(factor(key), value, fill = factor(key))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~year) + 
  scale_fill_manual(values = palette_4_colors) +
  labs(title = "Racial breakdown of clientele",
       x = " ",
       y = "percent",
       caption = "Other includes: Multiracial, American Indian, Asian-American,\nNative or Pacific Islander, Other, and Unknown") +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent_format()) +
  plotTheme() +
  theme(legend.position = "none",
        strip.background = element_rect(fill="white"),
        strip.text = element_text(size = 11, family = "sans", hjust = 0))

#income breakdown of clientele
cleanDat %>%
  group_by(year) %>% 
  summarise("Very low" = sum(incomecode %in% c(1))/n(),
            Low = sum(incomecode %in% c(2))/n(),
            Moderate = sum(incomecode %in% c(3))/n(),
            "Above moderate" = sum(incomecode %in% c(4))/n()) %>% 
  dplyr::select(-year) %>% 
  gather() %>% 
  mutate(year = rep(2010:2016, 4)) %>% 
  ggplot(., aes(as.factor(year), value, fill = reorder(factor(key), value))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Income of clientele",
       subtitle = "Clientele is getting poorer",
       x = " ",
       y = "percent") +
  scale_fill_manual(values = palette_4_colors,
                    name = "Income Groups") +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent_format()) +
  plotTheme()

#age breakdown of clientele
cleanDat %>%
  group_by(year) %>% 
  summarise("65 yr +" = sum(agerange %in% c("65 to 69", "70 to 74", "75 to 79",
                                            "80 to 84", "85 to 89", "90 to 94", 
                                            "95 to 99", "100 to 104", "105 to 109",
                                            "110 to 114", "115 to 119"))/n(),
            "40 to 64 yr" = sum(agerange %in% c("40 to 44", "45 to 49", "50 to 54",
                                                "55 to 59", "60 to 64"))/n(),
            "20 to 39 yr" = sum(agerange %in% c("20 to 24", "25 to 29", "30 to 34", 
                                                "35 to 39"))/n()) %>% 
  dplyr::select(-year) %>% 
  gather() %>% 
  mutate(year = rep(2010:2016, 3)) %>% 
  ggplot(., aes(as.factor(year), value, fill = reorder(factor(key), value))) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = palette_3_colors,
                    name = "Age Groups") +
  labs(title = "Age breakdown of clientele",
       subtitle = "Clientele is getting younger",
       x = " ",
       y = "percent") +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent_format()) +
  plotTheme()

We can explore other indicators, such as homelessness and disability status as well.

#examining homelessness
cleanDat %>%
  group_by(year) %>% 
  summarise(pcthomeless = sum(homeless == "Y")/n()) %>% 
  ggplot(., aes(as.factor(year), pcthomeless)) +
  geom_bar(stat = "identity", fill = "#225ea8") +
  labs(title = "Percent of clients who are homeless",
       y = "percent",
       x = " ") +
  scale_y_continuous(limits=c(0,0.025), labels = percent_format()) +
  plotTheme()

#clients who are disabled
cleanDat %>%
  group_by(year) %>% 
  summarise(pctdisabled = sum(disabilitystatus == "Y")/n()) %>% 
  ggplot(., aes(as.factor(year), pctdisabled)) +
  geom_bar(stat = "identity", fill = "#225ea8") +
  labs(title = "Percent of clients who are disabled",
       x = " ",
       y = "percent") +
  scale_y_continuous(limits=c(0,1), labels = percent_format()) +
  plotTheme()

#clients who livealone
cleanDat %>%
  group_by(year) %>% 
  summarise(pctlivealone = sum(livealone == "Y")/n()) %>% 
  ggplot(., aes(as.factor(year), pctlivealone)) +
  geom_bar(stat = "identity", fill = "#225ea8") +
  labs(title = "Percent of clients who live alone",
       x = " ",
       y = "percent") +
  scale_y_continuous(limits=c(0,1), labels = percent_format()) +
  plotTheme()

Next, we look at the spatial distribution of services. The neighborhoods found in the dataset are groupings of zip codes. We acquire zipcodes using the tidycensus package and regroup the data into the appropriate neighborhoods.

#make a list of zip codes that we are interested in 
zipcodeList <- c(98014, 98019, 98024, 98045, 98050, 98065, 98068, 98224, 98251, 98288, 98004, 98005, 98006, 98007, 98008, 98009, 98015, 98027, 98029, 98033, 98034, 98039, 98040, 98052, 98053, 98059, 98073, 98074, 98075, 98083, 98011, 98028, 98041, 98072, 98082, 98155, 98160, 98107, 98117, 98102, 98112, 98122, 98106, 98126, 98101, 98104, 98111, 98114, 98121, 98129, 98154, 98161, 98164, 98174, 98181, 98184, 98191, 98108, 98124, 98134, 98103, 98125, 98133, 98105, 98115, 98145, 98177, 98185, 98195, 98109, 98119, 98199, 98118, 98144, 98116, 98136, 98146, 98010, 98022, 98025, 98038, 98051, 98001, 98002, 98003, 98023, 98030, 98031, 98032, 98035, 98042, 98047, 98054, 98055, 98056, 98057, 98058, 98062, 98063, 98064, 98071, 98092, 98093, 98131, 98132, 98138, 98148, 98151, 98158, 98166, 98168, 98170, 98171, 98178, 98188, 98190, 98198, 98354, 98013, 98070)

#get all zip codes and then filter them for the ones in the above list
zips <- get_acs(geography = "zip code tabulation area", year = 2016, 
                variables = "B00001_001", geometry = T) %>% 
  filter(GEOID %in% zipcodeList) %>% 
  #select out columns we don't want (only need the geometry)
  select(-NAME, -variable, -estimate) %>% 
  #regroup into the neighborhoods
  mutate(geographiclocation = case_when(GEOID %in% c(98014, 98019, 98024, 98045, 98050, 98065, 
                                               98068, 98224, 98251, 98288) ~ "East Rural",
                                  GEOID %in% c(98004, 98005, 98006, 98007, 98008, 98009, 98015, 
                                               98027, 98029, 98033, 98034, 98039, 98040, 98052, 
                                               98053, 98059, 98073, 98074, 98075, 
                                               98083) ~ "East Urban",
                                  GEOID %in% c(98011, 98028, 98041, 98072, 98082, 
                                               98155, 98160) ~ "North Urban",
                                  GEOID %in% c(98107, 98117) ~ "Seattle Neighborhoods: Ballard",
                                  GEOID %in% c(98102, 98112) ~ "Seattle Neighborhoods: 
                                  Capitol Hill",
                                  GEOID == 98122 ~ "Seattle Neighborhoods: Central Seattle",
                                  GEOID %in% c(98106, 98126) ~ "Seattle Neighborhoods: Delridge",
                                  GEOID %in% c(98101, 98104, 98111, 98114, 98121, 
                                               98129, 98154, 98161, 98164, 98174, 98181, 98184, 
                                               98191) ~ "Seattle Neighborhoods: Downtown",
                                  GEOID %in% c(98108, 98124, 98134) ~ "Seattle Neighborhoods: 
                                  Duwamish",
                                  GEOID == 98103 ~ "Seattle Neighborhoods: Lake Union",
                                  GEOID %in% c(98125, 98133) ~ "Seattle Neighborhoods: 
                                  NE Seattle",
                                  GEOID %in% c(98105, 98115, 98145, 98177, 98185, 
                                               98195) ~ "Seattle Neighborhoods: North Seattle",
                                  GEOID %in% c(98109, 98119, 98199) ~ "Seattle Neighborhoods:
                                  Queen Anne",
                                  GEOID %in% c(98118, 98144) ~ "Seattle Neighborhoods: SE 
                                  Seattle",
                                  GEOID %in% c(98116, 98136, 98146) ~ "Seattle Neighborhoods: 
                                  SW Seattle",
                                  GEOID %in% c(98010, 98022, 98025, 98038, 98051) ~ "South Rural",
                                  GEOID %in% c(98001, 98002, 98003, 98023, 98030, 98031, 98032, 
                                               98035, 98042, 98047, 98054, 98055, 98056, 98057, 
                                               98058, 98062, 98063, 98064, 98071, 98092, 98093, 
                                               98131, 98132, 98138, 98148, 98151, 98158, 98166, 
                                               98168, 98170, 98171, 98178, 98188, 
                                               98190, 98198, 98354) ~ "South Urban",
                                  GEOID %in% c(98013, 98070) ~ "Vashon")) %>% 
  group_by(geographiclocation) %>% 
  #union the zip code geometries together to get one large geometry for the neighborhood
  summarise(geometry = st_union(geometry)) %>% 
  st_sf() %>%   #transform into an sf object
  st_transform(4326)

Several variables are then calculated looking at the number of services over time and what years each client was present in the data. This is saved as newDat.

#demographic data we want to keep in the dataset
interestedIn <- list("incomecode", "agerange", "bathing", "chores", "cooking", 
                     "disabilitystatus", "dressing", "driving", "eating", 
                     "geographiclocation", "gettingplaces", "heavyhousework", "homeless",
                     "householdwithchildren","limitedenglish", "livealone", 
                     "medicalmanagement", "moneymanagement", "numberofchildren",
                     "nutritionalrisk","phoning", "racecode", "shopping", 
                     "singleparent", "toileting", "transferring", "walking")

newDat <- cleanDat %>% 
  group_by(year, clientid, serviceareaid) %>% #group by year, client, and service type
  summarise(count = n()) %>% #count the number of each service each client receives
  ungroup() %>%
  #rename the servicearea to take year
  mutate(serviceareaid = as.character(paste0("servicearea_",serviceareaid, "_", year))) %>% 
  select(-year) %>% #remove the year column
  group_by(clientid) %>%   #group by clientid
  #make data wide to each service type is a column and is filled with count
  spread(serviceareaid, count) %>%  
  replace(is.na(.) == TRUE, 0) %>%  #fill columns with NA with zeros
  ungroup() %>%   #ungroup the data
  #calculate the total number of services per year
  mutate(total_services_2010 = rowSums(select(., contains('_2010'))),    
         total_services_2011 = rowSums(select(., contains('_2011'))),
         total_services_2012 = rowSums(select(., contains('_2012'))),
         total_services_2013 = rowSums(select(., contains('_2013'))),
         total_services_2014 = rowSums(select(., contains('_2014'))),
         total_services_2015 = rowSums(select(., contains('_2015'))),
         total_services_2016 = rowSums(select(., contains('_2016'))))  %>% 
  #calculate total services and whether a client was present in a given year (binary)
  mutate(total_services = rowSums(select(., contains('total_services_'))),
         present10 = ifelse(total_services_2010 == 0, 0, 1),
         present11 = ifelse(total_services_2011 == 0, 0, 1),
         present12 = ifelse(total_services_2012 == 0, 0, 1),
         present13 = ifelse(total_services_2013 == 0, 0, 1),
         present14 = ifelse(total_services_2014 == 0, 0, 1),
         present15 = ifelse(total_services_2015 == 0, 0, 1),
         present16 = ifelse(total_services_2016 == 0, 0, 1))  %>% 
  #calculate the number of years a client received services
  mutate(numYrRec = rowSums(select(., contains('present')))) %>% 
  select(., clientid, contains('total'), contains('present'), starts_with('n')) %>% 
  left_join(., cleanDat %>% 
              select(clientid, geographiclocation, which((names(.) %in% interestedIn)==TRUE)) %>% 
              distinct(clientid, .keep_all = TRUE), 
            by = "clientid")

We join ‘newDat’ to the neighborhoods and summarize the number of services and number of clients for each neighborhood. We then map services per acre and services per client.

zipsdat <- left_join(newDat, zips, by = "geographiclocation") %>% 
  mutate(total_services = as.numeric(total_services)) %>% 
  group_by(geographiclocation) %>% 
  summarise(servicesSum = sum(total_services),
            clientSum = length(unique(clientid))) %>% 
  left_join(., zips, by = "geographiclocation") %>% 
  st_sf() %>% 
  st_transform(4326) %>% 
  mutate(acre = as.numeric(st_area(.)*2.29568e-5),
         normalizedTotal = servicesSum/acre,
         servByClient = servicesSum/clientSum)

zipsdat %>% 
  ggplot(.) +
  geom_sf(aes(fill = ntile(normalizedTotal, 5)), color = "white") +
  scale_fill_gradientn(colors = palette_9_colors,
                       labels = as.character(round(quantile(zipsdat$normalizedTotal,
                                                         c(0.1, .2, .4,.6, .8))),2),
                       name = "services\nper acre\n(Quintile\nbreaks)") +
  labs(title = "Services per acre delivered in each neighborhood",
       subtitle = "2010 to 2016", 
       caption = "Figure 2") +
  mapTheme()

zipsdat %>% 
  ggplot(.) +
  geom_sf(aes(fill = ntile(servByClient, 5)), color = "white") +
  scale_fill_gradientn(colors = palette_9_colors,
                       labels = as.character(round(quantile(zipsdat$servByClient,
                                                         c(0.1, .2, .4,.6, .8))),2),
                       name = "services\nper client\n(Quintile\nbreaks)") +
  labs(title = "Services per client by neighborhood",
       subtitle = "2010 to 2016") +
  mapTheme()

Using this new data, we also create several plots exploring the longevity of clients receiving services including the average number of years receiving services, as well as, client entrances and exits.

#number of years receiving services
newDat %>% 
  group_by(numYrRec) %>% 
  summarise(number = n()) %>% 
  mutate(pct = number/sum(number)) %>% 
  ggplot(., aes(x = as.factor(numYrRec), y = pct)) +
  geom_bar(stat = "identity", fill = "#225ea8") +
  labs(x = "Number of years gettting services",
       y = "percent of clients",
       title = "Longevity of clients receiving services",
       caption = "Figure 12") +
  scale_y_continuous(limits = c(0, 1),
                     labels = percent_format()) +
  plotTheme()

#entrances and exits
data.frame(
  "2010" = c(sum(newDat$present10), 0),
  "2011" = c(newDat %>% filter(present10 == 0 & present11 == 1) %>% n_distinct(), 
            newDat %>% filter(present10 == 1 & present11 == 0) %>% n_distinct()),
  "2012" = c(newDat %>% filter(present11 == 0 & present12 == 1) %>% n_distinct(), 
            newDat %>% filter(present11 == 1 & present12 == 0) %>% n_distinct()),
  "2013" = c(newDat %>% filter(present12 == 0 & present13 == 1) %>% n_distinct(), 
            newDat %>% filter(present12 == 1 & present13 == 0) %>% n_distinct()),
  "2014" = c(newDat %>% filter(present13 == 0 & present14 == 1) %>% n_distinct(), 
            newDat %>% filter(present13 == 1 & present14 == 0) %>% n_distinct()),
  "2015" = c(newDat %>% filter(present14 == 0 & present15 == 1) %>% n_distinct(), 
            newDat %>% filter(present14 == 1 & present15 == 0) %>% n_distinct()),
  "2016" = c(newDat %>% filter(present15 == 0 & present16 == 1) %>% n_distinct(), 
            newDat %>% filter(present15 == 1 & present16 == 0) %>% n_distinct())
) %>% 
  gather() %>% 
  mutate(type = rep(c("enter", "exit"), 7),
         key = substring(key, 2,5)) %>% 
  ggplot(., aes(fill = type, x = key, y = value)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = palette_2_colors,
                    name = " ") +
  scale_y_continuous(labels = comma) +
  labs(x = " ", 
       y = "number of clients",
       title = "Entrances and exits of clients over seven years",
       subtitle = "Base year: 2010",
       caption = "'Entrance' defined as not being present in the previous year but present in the current year;\n'Exit' defined as being present in the previous year but not present in the current year") +
  plotTheme()

#number of years receiving services per group
rbind(
  newDat %>% 
  mutate(group = case_when(agerange %in% c("65 to 69", "70 to 74", "75 to 79",
                                           "80 to 84", "85 to 89", "90 to 94", 
                                           "95 to 99", "100 to 104", "105 to 109",
                                           "110 to 114", "115 to 119") ~ "65 yrs +",
                           agerange %in% c("0 to 4", "5 to 9", "10 to 14", 
                                           "15 to 19") ~ "< 20 yrs",
                           agerange %in% c("40 to 44", "45 to 49", "50 to 54",
                                           "55 to 59", "60 to 64") ~ "45 to 65 yrs",
                           agerange %in% c("20 to 24", "25 to 29", "30 to 34", 
                                           "35 to 39") ~ "20 to 45 yrs",
                           agerange == "U" ~ "Age Unknown")) %>% 
  group_by(group) %>% 
  summarise(count = n(),
            length = sum(present10,present11,present12,present13,
                         present14,present15,present16)) %>% 
  mutate(avgTime = round(length/count,2)) %>% 
    select(group, avgTime),
  newDat %>% 
  filter(racecode %in% c(3,5,6)) %>% 
  group_by(racecode) %>% 
  summarise(count = n(),
            length = sum(present10,present11,present12,present13,
                         present14,present15,present16)) %>% 
  mutate(avgTime = round(length/count,2),
         group = case_when(racecode == 3 ~ "Black",
                              racecode == 5 ~ "Hispanic",
                              racecode == 6 ~ "White")) %>% 
  select(group, avgTime),
 newDat %>% 
  filter(incomecode %in% c(1,2,3,4)) %>% 
  group_by(incomecode) %>% 
  summarise(count = n(),
            length = sum(present10,present11,present12,present13,
                         present14,present15,present16)) %>% 
  mutate(avgTime = round(length/count,2),
         group = case_when(incomecode == 1 ~ "Very Low Income",
                           incomecode == 2 ~ "Low Income",
                           incomecode == 3 ~ "Moderate Income",
                           incomecode == 4 ~ "Above Moderate Income")) %>% 
  select(group, avgTime)
) %>% 
  rename(avgTime_yrs = avgTime) %>% 
  ggplot(., aes(x = reorder(group, avgTime_yrs), y = avgTime_yrs)) +
  geom_bar(stat = "identity", fill = "#225ea8") +
  coord_flip() +
  labs(title = "Average length of receiving services by demographic groups",
       y = "years",
       x = " ",
       caption = "Figure 14") +
  plotTheme()