Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Dalanda Jalloh

Published

February 9, 2026

Assignment Overview

Scenario

You are a data analyst for NYS Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidycensus)
library(tidyverse)
library(knitr)
library(dplyr)

# Set your Census API key
#I already had a Census API 
Sys.getenv("CENSUS_API_KEY")

[1] "2d2aa3af9b6d2b64442f2967f0fa738353db39d2"

# Choose your state for analysis - assign it to a variable called my_state
my_state<- "New York"

State Selection: I chose New York for this analysis because I grew up in Brooklyn and went to school in Buffalo and therefore would like to understand the state better.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
NY_census<- get_acs(
  geography="county",
  variables= c(median_income="B19013_001", total_pop="B01003_001"), 
  state="NY",
  year=2022, 
  survey="acs5", 
  output= "wide"
)

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()

NY_census_cleaned<- NY_census %>% 
  mutate(
   NAME= str_remove(NAME, "County"),
   NAME= str_remove(NAME, "New York"),
   NAME= str_remove(NAME, ",")
  )
  

 # Display the first few rows
glimpse(NY_census_cleaned)

Rows: 62
Columns: 6
$ GEOID          <chr> "36001", "36003", "36005", "36007", "36009", "36011", "…
$ NAME           <chr> "Albany  ", "Allegany  ", "Bronx  ", "Broome  ", "Catta…
$ median_incomeE <dbl> 78829, 58725, 47036, 58317, 56889, 63227, 54625, 61358,…
$ median_incomeM <dbl> 2049, 1965, 890, 1761, 1778, 2736, 1754, 2475, 2526, 28…
$ total_popE     <dbl> 315041, 47222, 1443229, 198365, 77000, 76171, 127440, 8…
$ total_popM     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Display the first few rows

## 2.2 Data Quality Assessment

**Your Task:** Calculate margin of error percentages and create reliability categories.

**Requirements:**
- Calculate MOE percentage: (margin of error / estimate) * 100
- Create reliability categories:
  - High Confidence: MOE < 5%
  - Moderate Confidence: MOE 5-10%  
  - Low Confidence: MOE > 10%
- Create a flag for unreliable estimates (MOE > 10%)

**Hint:** Use `mutate()` with `case_when()` for the categories.


::: {.cell}

```{.r .cell-code}
# Calculate MOE percentage and reliability categories using mutate()

NY_census_cleaned_reliability<- NY_census_cleaned %>% 
mutate(moe_per_income = round(median_incomeM/median_incomeE*100,2))

NY_census_cleaned_reliability<- NY_census_cleaned_reliability %>% 
mutate(reliability = case_when(
      moe_per_income < 5 ~ "High Confidence",
      moe_per_income >= 5 & moe_per_income <= 10 ~ "Moderate",
      moe_per_income > 10 ~ "Low Confidence"
    )
  )
# Create a summary showing count of counties in each reliability category
NY_census_summary<- NY_census_cleaned_reliability %>% 
group_by(reliability) %>% 
count() %>% 
ungroup() %>% 
mutate(percent= n/sum(n)*100) 
#here I needed to ungroup because I wanted to do a calculation across the entire dataset 
# Hint: ,use count() and mutate() to add percentages

:::

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage

top5 <- NY_census_cleaned_reliability %>%
  arrange(desc(moe_per_income)) %>%
  slice(1:5) %>% 
  select(NAME, median_incomeE, median_incomeM, moe_per_income, reliability)


# Format as table with kable() - include appropriate column names and caption
kable(top5,
      col.names = c("NAME", "reliability", "median_incomeE", "median_incomeM", "moe_per_income"),
      caption = " NY Counties with Highest Income Data Uncertainty",
      format.args = list(big.mark = ","))

NY Counties with Highest Income Data Uncertainty
NAME	reliability	median_incomeE	median_incomeM	moe_per_income
Hamilton	66,891	7,622	11.39	Low Confidence
Schuyler	61,316	5,818	9.49	Moderate
Greene	70,294	4,341	6.18	Moderate
Yates	63,974	3,733	5.84	Moderate
Essex	68,090	3,590	5.27	Moderate

Data Quality Commentary:

[Write 2-3 sentences explaining what these results mean for algorithmic decision-making. Consider: Which counties might be poorly served by algorithms that rely on this income data? What factors might contribute to higher uncertainty?]

The results show that decisions made by census data from these 5 counties in an algorithim will perpetually create decisions that are not representive of the populations true economic barriers. In the case of the Department of Human Services of NYS, it will be diffcult for the agency to asesss who is really in need of social benefits. They may over-estimate or under-estimate the needs of some families based on the inaccuracies of reported income.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

#to do this first
library(stringr)

NY_census_cleaned_reliability <- NY_census_cleaned_reliability %>%
  mutate(NAME = str_trim(NAME))
  
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties

selected_counties<- NY_census_cleaned_reliability %>% 
filter(NAME %in% c("Hamilton", "Erie", "Kings")) 

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category

selected_counties %>% 
select(reliability, median_incomeE, NAME, moe_per_income) %>% 
kable(
      col.names = c("reliability", "median_incomeE", "NAME", "moe_per_income"),
      caption = "Selected Counties",
      format.args = list(big.mark = ","))

Selected Counties
reliability	median_incomeE	NAME	moe_per_income
High Confidence	68,014	Erie	1.18
Low Confidence	66,891	Hamilton	11.39
High Confidence	74,692	Kings	1.27

Comment on the output: [write something :)] My table shows that of the three counties I choose, Erie has the highest confident rate, followed by Kings.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

acs_variables<- c(black="B03002_004", hispanic="B03002_012", white="B03002_003", totalpop="B03002_001")

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter

selected_counties_census <- get_acs(
  geography = "tract",
  variables = acs_variables,
  state = "NY",
  county= c("Erie", "Hamilton", "Kings"),
  year = 2022,
  survey= "acs5",
  output = "wide" ) # Makes analysis 

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
library(dplyr)

selected_counties_census <- selected_counties_census %>%
  mutate(per_black = round(blackE / totalpopE * 100, 2)) %>%
  relocate(per_black, .after = blackM)
  
selected_counties_census<- selected_counties_census %>% 
mutate(per_white= round(whiteE/ totalpopE* 100, 2)) %>% 
relocate(per_white, .after= whiteM)

selected_counties_census<- selected_counties_census %>% 
mutate(per_his= round(hispanicE/ totalpopE * 100, 2)) %>% 
relocate(per_his, .after=hispanicM)

# Add readable tract and county name columns using str_extract() or similar

selected_counties_census<- selected_counties_census %>% 
  mutate(
   NAME= str_remove(NAME, "County"),
   NAME= str_remove(NAME, "New York"),
   NAME= str_remove(NAME, ";")
  )

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

highest_his<- selected_counties_census %>% 
arrange(desc(per_his)) %>% 
slice(1:5)

# Calculate average demographics by county using group_by() and summarize()

avg_dem_county<- selected_counties_census %>% 
group_by(
case_when(
str_detect(NAME, "Erie")~ "Erie", 
str_detect(NAME, "Kings")~ "Kings", 
str_detect(NAME, "Hamilton")~ "Hamilton")) %>% 
summarise(
avg_black= weighted.mean(per_black, totalpopE, na.rm=TRUE),
avg_his= weighted.mean(per_his, totalpopE, na.rm=TRUE),
avg_white= weighted.mean(per_white, totalpopE, na.rm=TRUE)
)

# Show: number of tracts, average percentage for each racial/ethnic group

avg_dem_county_1<- selected_counties_census %>% 
  group_by(
            county=case_when(
str_detect(NAME, "Erie")~ "Erie", 
str_detect(NAME, "Kings")~ "Kings", 
str_detect(NAME, "Hamilton")~ "Hamilton")) %>% 
  summarise(
n_tracts=n(),
avg_black= weighted.mean(per_black, totalpopE, na.rm=TRUE),
avg_his= weighted.mean(per_his, totalpopE, na.rm=TRUE),
avg_white= weighted.mean(per_white, totalpopE, na.rm=TRUE)
)

# Create a nicely formatted table of your results using kable()

avg_dem_county_1 %>% 
kable(
      col.names = c("county", "n_tracts", "avg_black", "avg_his", "avg_white"),
      caption = "Erie, Hamilton and King's County Summary",
      format.args = list(big.mark = ","))

Erie, Hamilton and King’s County Summary
county	n_tracts	avg_black	avg_his	avg_white
Erie	261	12.538996	5.961517	73.58904
Hamilton	4	1.080619	2.000866	92.06217
Kings	805	28.329923	18.900765	36.07294

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

selected_counties_census<- selected_counties_census %>% 
mutate(
moe_black= round(blackM/blackE* 100, 2), 
moe_white= round(whiteM/whiteE* 100, 2),
moe_his= round(hispanicM/hispanicE * 100, 2))

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

selected_counties_census <- selected_counties_census %>% 
  mutate(MOE_flag_black = ifelse(moe_black > 15, "flag", "OK")) %>% 
  mutate(MOE_flag_white = ifelse(moe_white > 15, "flag", "OK")) %>%
  mutate(MOE_flag_his   = ifelse(moe_his   > 15, "flag", "OK"))
  
selected_counties_census<- selected_counties_census %>% 
mutate(county= str_extract(NAME, "Erie|Kings|Hamilton")) %>% 
relocate(county, .after=NAME)

# Create summary statistics showing how many tracts have data quality issues

selected_counties_census_stats<- selected_counties_census %>% 
group_by(MOE_flag_black) %>% 
summarise(n_tracts= n())

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues

pattern_all <- selected_counties_census %>%
  pivot_longer(
    cols = c(MOE_flag_black, MOE_flag_white, MOE_flag_his),
    names_to = "group",
    values_to = "MOE_flag"
  ) %>%
  group_by(group, MOE_flag) %>%
  summarise(n_tracts = n(),
  avg_total_pop= mean(totalpopE, na.rm= TRUE),
  avg_black= mean(blackE, na.rm= TRUE),
  avg_white=mean(whiteE, na.rm=TRUE),
  avg_hisp= mean(hispanicE, na.rm=TRUE),
  .groups = "drop")
  


# Calculate average characteristics for each group:
# - population size, demographic percentages


#done on the above section


# Use group_by() and summarize() to create this comparison
#response: I think I combined this portion with the earlier question 
# Create a professional table showing the patterns

kable(pattern_all,
      col.names = c("group", "MOE_flag", "n_tracts", "avg_total_pop", "avg_black", "avg_white", "avg_hisp"),
      caption = "MOE PAtterns Across Rate",
      format.args = list(big.mark = ","))

MOE PAtterns Across Rate
group	MOE_flag	n_tracts	avg_total_pop	avg_black	avg_white	avg_hisp
MOE_flag_black	OK	6	2,403.167	1,873.3333	156.8333	189.3333
MOE_flag_black	flag	1,064	3,403.687	815.0620	1,569.8882	528.3365
MOE_flag_his	OK	2	2,865.000	426.5000	909.0000	983.0000
MOE_flag_his	flag	1,068	3,399.075	821.7350	1,563.1873	525.5805
MOE_flag_white	OK	140	4,183.814	137.7429	3,565.9500	191.7357
MOE_flag_white	flag	930	3,279.794	923.8516	1,260.2892	576.8204



**Pattern Analysis:** [Describe any patterns you observe. Do certain types of communities have less reliable data? What might explain this?]

- make histogram to see 
-


# Part 5: Policy Recommendations

## 5.1 Analysis Integration and Professional Summary

**Your Task:** Write an executive summary that integrates findings from all four analyses.

**Executive Summary Requirements:**
1. **Overall Pattern Identification**:
2. **Equity Assessment**: Which communities face the greatest risk of algorithmic
3. **Root Cause Analysis**: What underlying factors drive both data quality 
4. **Strategic Recommendations**: What should the Department implement to address these systematic issues?

**Executive Summary:**
I found that my hispanic and black population tracts in Kings, Erie and Hamilton county had a higher absolute count of margins of error. Although my white census tract did have high margins of error as well, the ratio was significantly lower. I noticed that factors such as size and how far away it was from a major city may impact reliability. The census tracks with higher margins of error in my white population were more concentrated in Erie County than they were in King's county which represents Brooklyn, NYC. I suspect that more rural areas with sparser populations do not contain as much reliable data. In addition, it is likely that smaller tracts have higher margins of error as one error would account for majority of the variability due to the small population size. Before the Department of Human Services issue out resources for their outreach program, they need to assess how many people are really in need of the services and up to which percentage of margin of error they are willing to accept. By creating an index, like done in this analysis, the department will be able to assess the inaccuracies across the geography. This margin of error should be cross referenced or normalized by total population. For example, a margin of error of 50% looks very different in a total population of 10 the  people vs 10,000. Because Hamilton County only has two census tracts, I propose that they invest in having one or two employees take a new count of residents in order to best allocate funding across where people live. Overall, it is clear there is variability in data accuracy across racial groups and rural or urban groups. Further investigation will be needed to assess whether or not the invariability is concentrated in these racial groups or if majority of these racial groups are located in areas where due to their distance from an urban hub, their data is unreliable. 


## 6.3 Specific Recommendations

**Your Task:** Create a decision framework for algorithm implementation.


::: {.cell}

```{.r .cell-code}
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
recommendations<- NY_census_cleaned_reliability %>% 
  select(NAME, median_incomeE, moe_per_income, reliability) %>% 
  mutate(algorithm_rec= case_when(reliability=="High Confidence" ~ "Safe for algorithmic decisions", reliability=="Moderate"~ "Use with caution - monitor outcomes", reliability== "Low Confidence"~ "Requires manual review or additional data")
                    )# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

# Format as a professional table with kable()

kable(recommendations,
      col.names = c("Name", "Median_incomeE", "moe_per_income", "reliability", "algorithim_rec"),
      caption = "Recommendations",
      format.args = list(big.mark = ","))

Recommendations
Name	Median_incomeE	moe_per_income	reliability	algorithim_rec
Albany	78,829	2.60	High Confidence	Safe for algorithmic decisions
Allegany	58,725	3.35	High Confidence	Safe for algorithmic decisions
Bronx	47,036	1.89	High Confidence	Safe for algorithmic decisions
Broome	58,317	3.02	High Confidence	Safe for algorithmic decisions
Cattaraugus	56,889	3.13	High Confidence	Safe for algorithmic decisions
Cayuga	63,227	4.33	High Confidence	Safe for algorithmic decisions
Chautauqua	54,625	3.21	High Confidence	Safe for algorithmic decisions
Chemung	61,358	4.03	High Confidence	Safe for algorithmic decisions
Chenango	61,741	4.09	High Confidence	Safe for algorithmic decisions
Clinton	67,097	4.18	High Confidence	Safe for algorithmic decisions
Columbia	81,741	3.39	High Confidence	Safe for algorithmic decisions
Cortland	65,029	4.42	High Confidence	Safe for algorithmic decisions
Delaware	58,338	3.67	High Confidence	Safe for algorithmic decisions
Dutchess	94,578	2.66	High Confidence	Safe for algorithmic decisions
Erie	68,014	1.18	High Confidence	Safe for algorithmic decisions
Essex	68,090	5.27	Moderate	Use with caution - monitor outcomes
Franklin	60,270	4.81	High Confidence	Safe for algorithmic decisions
Fulton	60,557	4.37	High Confidence	Safe for algorithmic decisions
Genesee	68,178	4.57	High Confidence	Safe for algorithmic decisions
Greene	70,294	6.18	Moderate	Use with caution - monitor outcomes
Hamilton	66,891	11.39	Low Confidence	Requires manual review or additional data
Herkimer	68,104	4.79	High Confidence	Safe for algorithmic decisions
Jefferson	62,782	3.64	High Confidence	Safe for algorithmic decisions
Kings	74,692	1.27	High Confidence	Safe for algorithmic decisions
Lewis	64,401	4.16	High Confidence	Safe for algorithmic decisions
Livingston	70,443	3.99	High Confidence	Safe for algorithmic decisions
Madison	68,869	4.04	High Confidence	Safe for algorithmic decisions
Monroe	71,450	1.35	High Confidence	Safe for algorithmic decisions
Montgomery	58,033	3.63	High Confidence	Safe for algorithmic decisions
Nassau	137,709	1.39	High Confidence	Safe for algorithmic decisions
New York	99,880	1.78	High Confidence	Safe for algorithmic decisions
Niagara	65,882	2.67	High Confidence	Safe for algorithmic decisions
Oneida	66,402	3.27	High Confidence	Safe for algorithmic decisions
Onondaga	71,479	1.57	High Confidence	Safe for algorithmic decisions
Ontario	76,603	2.94	High Confidence	Safe for algorithmic decisions
Orange	91,806	1.94	High Confidence	Safe for algorithmic decisions
Orleans	61,069	4.89	High Confidence	Safe for algorithmic decisions
Oswego	65,054	3.26	High Confidence	Safe for algorithmic decisions
Otsego	65,778	4.51	High Confidence	Safe for algorithmic decisions
Putnam	120,970	4.03	High Confidence	Safe for algorithmic decisions
Queens	82,431	1.06	High Confidence	Safe for algorithmic decisions
Rensselaer	83,734	2.27	High Confidence	Safe for algorithmic decisions
Richmond	96,185	2.60	High Confidence	Safe for algorithmic decisions
Rockland	106,173	2.88	High Confidence	Safe for algorithmic decisions
St. Lawrence	58,339	3.47	High Confidence	Safe for algorithmic decisions
Saratoga	97,038	2.26	High Confidence	Safe for algorithmic decisions
Schenectady	75,056	3.03	High Confidence	Safe for algorithmic decisions
Schoharie	71,479	3.96	High Confidence	Safe for algorithmic decisions
Schuyler	61,316	9.49	Moderate	Use with caution - monitor outcomes
Seneca	64,050	5.24	Moderate	Use with caution - monitor outcomes
Steuben	62,506	2.87	High Confidence	Safe for algorithmic decisions
Suffolk	122,498	1.18	High Confidence	Safe for algorithmic decisions
Sullivan	67,841	4.35	High Confidence	Safe for algorithmic decisions
Tioga	70,427	3.99	High Confidence	Safe for algorithmic decisions
Tompkins	69,995	4.01	High Confidence	Safe for algorithmic decisions
Ulster	77,197	4.52	High Confidence	Safe for algorithmic decisions
Warren	74,531	4.74	High Confidence	Safe for algorithmic decisions
Washington	68,703	3.41	High Confidence	Safe for algorithmic decisions
Wayne	71,007	3.10	High Confidence	Safe for algorithmic decisions
Westchester	114,651	1.56	High Confidence	Safe for algorithmic decisions
Wyoming	65,066	3.38	High Confidence	Safe for algorithmic decisions
Yates	63,974	5.84	Moderate	Use with caution - monitor outcomes

:::

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: Counties suitible for immediate algorithmic implementation
Counties requiring additional oversight: [List counties with moderate confidence data and describe what kind of monitoring would be needed]
Counties needing alternative approaches: Hamilton County had the lowest reliability results.

Questions for Further Investigation

[I would like to explore, ]

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on [07-Feb-26]

Reproducibility: - All analysis conducted in R version [your version] - Census API key required for replication - Complete code and documentation available at: [your portfolio URL]

Methodology Notes: All choices that were made should be reproducible; choices that were made were in accordance to a certain index to calculate MOE percentage. Results would change for example if a stricter margin of error to not surpass was chosen.

Limitations: [Note any limitations in your analysis - sample size issues, geographic scope, temporal factors, etc.] Limitations of my analysis include a more robust comparison between urban and rural areas to understand if the true inequity arises from geogrpahical location or racial group.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html