Methodology

Sources

Each of the 21 oversight bodies has a dedicated automated collector. It fetches the body’s published report list, identifies new PDFs not yet on file, and records them with their metadata. Most collectors run on a weekly or monthly schedule. Recent ingest activity is visible in the site changelog.

All source URLs resolve to the originating body’s own website. caoversight.org does not host PDFs; every report link points to the body’s official source.

Storage

Ingested reports are stored in a structured database. Each record holds:

OCR and text extraction

When a PDF contains embedded text (as most modern reports do), the text is extracted directly from the PDF structure. When embedded text is absent or low-quality — as is common with scanned reports from the 1990s and earlier — the PDF is processed with Surya OCR, an open-source document OCR model that runs on UnGovr’s own servers. Each report receives a text quality score after extraction; scores below a threshold are flagged as unreliable.

What “with extracted text” means

On each source page, the report list notes how many reports have extracted text. A report shown as “without text” means either that extraction failed or that the PDF is a scanned image where OCR did not produce reliable output. Most reports published after 2010 have full extracted text and are fully searchable via the site search. Older scanned reports may appear in search results but with limited or no text content.

What we do not do

caoversight.org does not editorialize report content. It does not summarize findings, assign conclusions, or track whether recommendations were implemented. (Recommendation tracking is handled separately by civilgrandjury.org for civil grand jury reports specifically.) Links always point to the originating body’s PDF, not to any cached or mirrored copy.

Coverage gaps

The Local pilot category currently covers a sample of county-level bodies (selected LAFCO districts and county auditors). Coverage in this category expands over time as additional crawlers are built and validated. Statewide bodies — the Audit, Policy, Inspector General, and Enforcement categories — aim for complete historical coverage back to the earliest publicly available report.

Some bodies publish documents other than formal reports (press releases, meeting minutes, informational bulletins). These are generally excluded; only documents the body itself designates as a report or formal findings are ingested.

Data freshness

The site rebuilds daily. Each crawler runs on its own schedule (typically weekly or monthly), so the database may contain reports ingested up to a few weeks before the latest site build. The footer on every page shows the date of the last build.

How to flag a missing report

If you know of a report published by one of the 21 indexed bodies that does not appear on its source page, email sean@ungovr.org or use the support form. Mention that you found the gap on caoversight.org and include the body name, report title, and URL if available.