Inspiration
Massachusetts court records are public, but actually using them is not. The MassCourts.org database holds over 9 million cases, yet accessing it requires SQL knowledge, familiarity with the schema, and patience to navigate messy, denormalized data stored in JSON blobs. For public defenders, journalists, researchers, and reform advocates, this creates a gap between data that is technically public and data that is practically accessible. We built MassCourtsPlusPlus to close that gap.
What it does
MassCourtsPlusPlus lets anyone search Massachusetts court data using plain English. Type a question like "How many assault and battery cases were filed in 2026?" or "Compare criminal filings between Worcester District Court and BMC Central" and the system translates it into SQL, runs it against a live database of 55,000+ cases, and returns results as interactive tables and charts.
Every query comes with full transparency: the generated SQL is shown alongside a plain-language explanation, a confidence score, and a Method Card documenting assumptions and limitations. Privacy is enforced through k-anonymity suppression (groups under 10 are automatically hidden) and a strict SQL validation layer that blocks unsafe operations.
Users can also click into individual cases to view detailed information including charges, docket entries, parties, events, dispositions, and judgments parsed from the underlying JSON data.
How we built it
- Frontend: Next.js 16 with App Router, React 19, Tailwind CSS 4, Recharts for charts, TanStack Table for data display
- Backend: Next.js API routes handling the full query pipeline server-side
- AI: OpenAI GPT-4o-mini translates natural language to MySQL with a schema-aware system prompt containing the full table DDL, court division reference, charge abbreviations, and query rules
- Database: Direct read-only connection to
civica_courtdocs.cases_masscourts_orgon MySQL 5.7 with connection pooling - Security: Multi-layer SQL validator (SELECT-only, table allowlist, blocked keywords, mandatory LIMIT, mandatory date filter, no comments, no stacked queries, string-literal-aware keyword checking) plus a read-only database user as the final safety net
- Privacy: K-anonymity enforced at two layers (LLM-generated HAVING clauses and server-side suppression)
- JSON extraction: Custom workaround for MySQL 5.7 BLOBs using CONVERT(column USING utf8mb4) + JSON_EXTRACT + numbers cross-join to unpack JSON arrays that the database stores in binary BLOB columns
Challenges we ran into
- BLOB columns instead of JSON type: The database stores structured JSON data in BLOB columns with binary charset. MySQL's JSON functions refuse to operate on binary data, throwing "Cannot create a JSON value from a string with CHARACTER SET binary." We worked around this with CONVERT(column USING utf8mb4) to re-interpret the bytes as text before applying JSON functions.
- MySQL 5.7 limitations: No JSON_TABLE (MySQL 8.0+), so we cannot cleanly unpack JSON arrays into rows. Instead, we use a CROSS JOIN with a generated numbers table (0-29) and CONCAT to build array index paths manually.
- LLM inconsistency: GPT-4o-mini sometimes generates slightly different SQL for the same question. Court division names are long and specific ("Suffolk County Probate and Family Court"), and the model would guess abbreviated values that don't exist. We solved this by adding a complete court division reference to the system prompt and enforcing LIKE-based geographic matching.
- Validator false positives: The word "revoke" inside a docket search pattern like '%motion to revise/revoke bail%' triggered the REVOKE blocked keyword check. Fixed by stripping string literals before running keyword detection.
- Domain terminology mismatch: Users ask about "sustained" motions, but Massachusetts courts use "ALLOWED" and "DENIED." We added terminology translation rules to the LLM prompt.
Accomplishments that we're proud of
- A non-technical user can type a plain English question and get verified, privacy-preserving results from a live database of 55,000+ court cases in under 10 seconds
- Every single result is fully transparent and reproducible: the SQL, the explanation, the confidence score, the assumptions, the limitations, and the data source are all visible
- The SQL validation layer catches injection attempts, unsafe operations, and schema violations before any query touches the database
- We extracted structured data from JSON blobs stored in BLOB columns on MySQL 5.7 without any schema changes to the source database
- CSV export includes the Method Card and generated SQL, so researchers can reproduce and verify any result independently
What we learned
- Prompt engineering for SQL generation is iterative and domain-specific. Generic instructions produce generic (often wrong) SQL. The system prompt grew to include charge abbreviations, court division mappings, name format conventions, motion terminology, and explicit examples for comparison queries, time granularity, and JSON blob access patterns.
- Privacy protection in aggregate data is nuanced. K-anonymity suppression makes sense for grouped data but is a false positive for single aggregate totals. We had to add logic to distinguish between the two.
- The gap between "data is public" and "data is accessible" is enormous. The raw database is technically available, but without a tool like this, it requires expert-level SQL and deep familiarity with court data conventions.
What's next for MassCourtsPlusPlus
- Expand the time scope: Currently limited to 2026 filings. The full database has 9M+ cases going back years, which would enable longitudinal analysis of trends and reform outcomes.
- Upgrade to MySQL 8.0+: Would unlock JSON_TABLE for clean array unpacking, eliminating the cross-join workaround and enabling richer docket/charge analysis.
- Query history and saved searches: Let users build on previous questions and track changes over time.
- Comparative dashboards: Pre-built views for common reform questions like court-by-court case volume, motion success rates, and charge distribution patterns.
- API access: Expose the query engine as a public API so other civic tech tools can build on top of the same transparent, privacy-preserving infrastructure.
Built With
- mysql
- next.js
- node.js
- openai-gpt-4o-mini
- react
- recharts
- tailwind-css
- tanstack-table
- typescript
- zod
Log in or sign up for Devpost to join the conversation.