DisclosureDisco

A data pipeline developed as part of the Digital Democracy Project within the Institute for Advanced Technology and Public Policy (IATPP). DisclosureDisco automates the retrieval, parsing, and storage of financial disclosure data from the FPPC database, enhancing public accessibility to critical transparency records.

Technologies Used

  • Database: MySQL for structured storage and efficient querying.
  • Web Scraping: Selenium for automated data retrieval from web sources.
  • API Integration: REST APIs to fetch structured financial data.
  • Parsing: PyPDF for extracting text from PDF disclosures.

Features

  • Automated extraction of financial disclosures from the FPPC database.
  • Efficient parsing and transformation of raw data into a structured format.
  • Database integration for organized and scalable data storage.
  • Support for both web-based and PDF-based data sources.

Challenges

One of the main challenges was handling inconsistencies in financial disclosure formats, particularly with PDFs ranging from structured digital documents to scanned image-based files. Extracting text from structured PDFs was straightforward, but dealing with scanned documents required additional preprocessing techniques like OCR. Additionally, managing dynamic website elements during scraping and optimizing parsing for large-scale data further refined our approach to data engineering and automation.

Impact

We significantly improved the time required for clients to analyze financial disclosure documents by automating data retrieval and parsing. This reduced the manual effort needed to extract relevant information, streamlining the analysis process. The remaining tasks within this pipeline primarily involve human verification to ensure accuracy and completeness.