Open Research Dataset
The largest open corpus of classified Word documents
Real .docx files from the public web, classified into 10 document types and 9 topics across 46+ languages.
Built by SuperDoc — DOCX editing and tooling.
documents
languages
taxonomy
avg confidence
Document Types
Topics
Languages
Browse Documents
any
How to download
Use the filters above to select a subset, then click Download manifest to get a text file with one URL per line.
# Download all files in the manifest
wget -i manifest.txt -P ./corpus/
# Or with curl
xargs -n 1 curl -O < manifest.txt
# Or fetch directly via the API
curl "https://api.docxcorp.us/manifest?type=legal&lang=en&min_confidence=0.8" -o manifest.txt
| Document | Type | Topic | Lang | Confidence |
|---|