Blog

Automation for Accountability

Common Knowledge
September 30, 2022
Automation for Accountability
Sample of the Violation Tracker Worker running in RStudio, Source: The Commons

The Commons first built the Violation Tracker 1.0 map application for Chesapeake Legal Alliance (CLA) in Spring of 2021. Developed to daylight facilities and construction sites in Maryland with wetlands, stormwater, and wastewater discharge permits, the map synthesizes inspection records and grades permit holders based on compliance, non-compliance, violation and enforcement history. When the map was built, the Maryland Department of the Environment (MDE) stored data on their own site. Users noted the interface was severely deficient in usability, and for our purposes, lacked programmatic access to the data through an Application Programming Interface, or API. This makes updating the map difficult. New inspection and site data required a manual download, passing it through a scoring script, and then updating the map application. This was no small data task - roughly ten new facilities and hundreds of inspection records are added each month. Thanks to efforts from CLA and other environmental organizations, the data landscape has changed since we built the map, enabling us to make dramatic improvements to data freshness with an automated workflow for data updates. Before we dive into the workflow, let's first take a moment to understand the legislative history that moved MDE from paper permits to programmatic data access.

Excerpt from draft of Maryland General Assembly bill HB 204, Source: Maryland General Assembly

Policy Supporting Data

Improvements in permit accessibility first began in 2015, when the federal government required all permits covered under the Clean Water Act to be submitted electronically, rather than paper copies. Further advancements were made in 2021, when CLA helped craft and pass Maryland General Assembly bill HB 204, forcing MDE to increase data availability and reporting of inspection and compliance data. As a consequence, MDE shifted data hosting from its own site to the Maryland Open Data Portal - a programmatically accessible data clearinghouse for state data. These improvements in reporting are particularly important given MDE’s documented failures to properly staff and maintain their enforcement arm and the litany of outdated permits, known as zombie permits. A prime example is Valley Proteins, a chicken waste rendering plant on Maryland’s Eastern Shore, with a wastewater discharge permit that expired back in 2006. Community outrage spurred an MDE investigation finding “overflows from treatment tanks”, leaking chlorine, and visible discharge in nearby creeks. In response, Valley Proteins faced multiple lawsuits and enforcement actions from MDE. Better data and oversight could have uncovered this problem sooner, however, legislative efforts over the past decade have paid dividends, moving from data only accessible by a Freedom of Information Act Request (FOIA), to a programmatically accessible database. This enables us to create a workflow to automatically pull in data daily to the Violation Tracker, increasing much needed public accountability and oversight. So how exactly did we modify the Violation Tracker, and what is the process to move data from a state portal, to a custom mapping application? The answer is, you guessed it: an automated workflow.

Wastewater runoff from Valley Proteins, Source: Maryland Department of the Environment

Working out Workflows

Automated workflows enable frequent or computationally intensive data tasks to be run on a scheduled basis. In short, an automated workflow or ‘worker’ is an application which ingests data, performs a task, and outputs the data - without human intervention. For example, if you wanted a map of the U.S. with analysis of USGS stream data and NOAA precipitation data, a worker would be helpful. You could fetch the data and perform analysis when a user interacts with the map or on application startup, but fetching and analyzing data from the country’s massive array of 26,000 NOAA and 11,000 USGS stations quickly becomes computationally intensive. A worker could solve this problem, performing analysis daily, and provide the data for the front-end application for a seamless user experience.

As workflows go, the Violation Tracker worker is simple, but represents a significant improvement in the front-end application, ensuring greater accountability of MDE and wastewater discharge permit holders in Maryland. This article provides a high-level walk through of the workflow and applications required. If you would like to explore the code involved, see here. To learn more about the Violation Tracker, see our original write-up here, including information about disproportionate non-compliance and unresolved violations in environmentally vulnerable communities.

Diagram of Violation Tracker Worker

The Central Node

The process starts with the central node, a dockerized R script, running in Airflow. That might be a mouthful (like it was for me when I started this project!), so let us unpack it. The code itself is written in R, the statistical programming language; much of this code was reused from the original Violation Tracker scoring script. The R script is then bundled into a Docker container - a self-encapsulated environment containing all the code necessary to run the script including packages, dependencies, and base R code. Docker containers allow for applications to run separately from their operating system, so whether you are using Mac, Windows, Amazon Web Services (AWS), or in our case Airflow, a dockerized application will reliably perform in its container. Originally created by Airbnb and made open source, Airflow allows applications to run at scheduled intervals following a DAG or Directed Acyclic Graphs. Another helpful aspect of Airflow is its ability to handle environment variables. The worker’s environment variables are secret keys for MapBox, Water Reporter, and AWS. It is notoriously bad practice to include secrets in the code itself, so Airflow requests the keys from AWS’s Secrets Manager and passes them to the container, ensuring security and preventing unauthorized access to our data.

View of Maryland Open Data Portal

Supporting Roles

When the worker runs in Airflow the script first fetches compliance, enforcement, and violation data from the Open Maryland Data Portal operated by Socrata technology. The request is made easy by Socrata’s RSocrata package, allowing us to request each dataset with just a single line of code. However, a major complication is the data does not include latitude and longitude information (despite this data being submitted with every permit), so we rely on MapBox’s geocoder API to locate our addresses. Some addresses are nice and clean, others not, and messy ones cannot be geocoded accurately or at all.

Good

Bad

To avoid redundant geocoding, we store previously attempted geocoded addresses in AWS, successful or not. The worker fetches this information, compares it to the Open Maryland Data Portal data, and only requests location data for newly added addresses from MapBox. It might only cost $0.05 per geocode, but if we geocode  ~10,000 locations every day it would add up to $182,500 a year! Our method needs to geocode only ten new locations a month, about $6 a year.

In addition to using the Mapbox API for addresses, we use the Water Reporter API’s handy watershed request function. This takes our latitude and longitude points and returns the location’s watershed for displaying on the map. New permit locations are then appended to the AWS files with location and watershed data for use next time the worker runs. Finally, the raw permit information from the Open Maryland Data Portal is passed through the scoring script written for the original application. This process gives each permit holder a grade based on their permit history. That data gets stored in AWS for use by the front end application. From start to finish, the worker takes roughly 5 minutes and costs less than a cent in computing and geocoding costs. Comparatively, a manual update would take at least an hour and could only be performed monthly or quarterly. Manually updating also introduces the potential for human error, whereas the worker has been running for two months at the time of writing, without a single error.

Pulling it Together

Key for the Violation Tracker Map

For the front-end mapping application, only minor changes were required to incorporate the data hosted in AWS. While creating the worker, we made sure to follow the same data model used in the original application, so variable names and types are identical. Small copy alterations were made to reflect data freshness and sourcing from the Open Maryland Data Portal. One upside of storing the data in AWS is the application no longer requires the ~5MB file of 30,000+ permit records, noticeably improving application load times.

It is important to remember this would not be possible if not for the tireless efforts of CLA, advancements to the Clean Water Act and active lobbying from environmental organizations and concerned citizens. To learn more about CLA and read their writeup about the Violation Tracker 2.0, see here. If you have a wastewater discharge permit holder or nuisance construction site nearby, you can follow along with the now continuously up-to-date Violation Tracker.