Skip to content

feat: expand document corpus (DOC 554, DOCX 2850, MD 2501, RTF 153, ODT 1208)#149

Merged
StephanMeijer merged 4 commits intomainfrom
feature/document-corpus-expansion
Apr 3, 2026
Merged

feat: expand document corpus (DOC 554, DOCX 2850, MD 2501, RTF 153, ODT 1208)#149
StephanMeijer merged 4 commits intomainfrom
feature/document-corpus-expansion

Conversation

@StephanMeijer
Copy link
Copy Markdown
Contributor

Summary

Expands the DocSpec test corpus across 5 formats by adding 21 new source groups with ~5,500 new files.

Additions by Format

Format Before After New Sources
DOC 155 554 libreoffice (290), apache-tika (51), aspose-words (4), abiword (67)
DOCX 204 2,850 python-docx (46), apache-tika (66), docxcompose (88), docx4j (145), aspose-words (297), libreoffice (2,081)
Markdown 159 2,501 cmark-gfm (31), markdown-it (52), remark (106), markdownlint (120), hugo-docs (986), mkdocs-material (94), pandoc-md (1,025)
RTF 118 153 pandoc additions, rtfparserkit (23), openpreserve (5)
ODT 944 1,208 collabora (100), abiword (193)

Quality Checks

  • ./bin/check-duplicates.sh — zero duplicates (SHA256-verified across all sources)
  • ./bin/validate-licenses.sh — all SPDX identifiers valid
  • ./bin/generate-attribution.sh — ATTRIBUTION.md regenerated cleanly
  • ✅ All binary files (DOC, DOCX, RTF, ODT) tracked via Git LFS
  • ✅ DOC format integrity: OLE2 magic bytes verified on all files
  • ✅ RTF format integrity: all files start with {\rtf
  • ✅ DOCX/ODT format integrity: valid ZIP with required entry files
  • ✅ No files >10MB; no binary files <500 bytes; no markdown files <100 bytes
  • ✅ All existing corpus files untouched (verified via git diff against base commit)

Licenses

All sources use open licenses:

  • Apache-2.0: apache-tika, libreoffice, docx4j, rtfparserkit
  • MIT: python-docx, docxcompose, aspose-words, mkdocs-material, markdown-it, markdownlint, remark
  • MPL-2.0: libreoffice (DOCX + ODT), collabora
  • GPL-2.0-or-later: pandoc-md, abiword
  • BSD-2-Clause: cmark-gfm
  • CC0-1.0: openpreserve
  • Apache-2.0: hugo-docs

Notes

  • OASIS ODF TC source was evaluated but removed — license (OASIS RF on Limited Terms) does not clearly permit redistribution
  • LibreOffice DOCX contribution (2,081 files) comes from the comprehensive sw/qa/ test suite across all OOXML import/export test cases
  • Commits will be squash-merged to keep history clean

Added 21 new attribution entries for Wave 2 corpus expansion:
- DOC: libreoffice (290), apache-tika (51), aspose-words (4)
- DOCX: python-docx (46), apache-tika (66), docxcompose (88), docx4j (145),
        aspose-words (297), libreoffice (2081)
- Markdown: cmark-gfm (31), markdown-it (52), remark (106), markdownlint (120),
            hugo-docs (986), mkdocs-material (94), pandoc-md (1025)
- RTF: rtfparserkit (23), openpreserve (5)
- ODT: collabora (100), abiword (193), oasis-odf-tc (81)

Total: 5538 new files across 21 source groups
ATTRIBUTION.json: 16 -> 37 entries
ATTRIBUTION.md: regenerated
@StephanMeijer StephanMeijer force-pushed the feature/document-corpus-expansion branch from 43ae8f5 to 49c08e5 Compare April 3, 2026 19:31
@StephanMeijer StephanMeijer merged commit 50c5c83 into main Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant