All patent document formats from 1976 to current parse into a single common Patent model. Field validation is applied to many fields, where invalid values and their document fragment are logged. Flexibility and completeness is stressed over efficiency.
| Format | Prefix | Years | Revisions | Documentation |
|---|---|---|---|---|
| Green Book (freetext key values; multi-line indented values) |
pftaps | 1976-2001 | revisions until 1997 | Greenbook Documentation |
| Red Book SGML | pg | 2001-2004 | ST32-US-Grant-025xml.dtd | Redbook SGML v1.9 Redbook SGML v2.4 WIPO ST. 32 - SGML Standard |
| Red Book PAP XML | pa | 2002-2004 | Published Patent Applications (PAP) from 2002-2004 |
|
| Red Book XML | ipg ipa |
2004-Current | Grants: Applications: |
Redbook XML Documentation WIPO ST. 36 - XML Standard |
Short list of some of the XML variations handled and improvements made by the Patent Document Parser
| Field | Description |
|---|---|
| parties | variation: us-parties |
| applicant | variation: us-parties/us-applicants/us-applicant |
| references-cited | variation: us-references-cited |
| citation | variation: us-citation |
| inventor | variation: Applicant with attribute "app-type" value "applicant-inventor" |
| address/street | variation: address-1 address-2 |
| agent | fix: if missing use "correspondence-address" field |
| description | fix to corresponds with non-xml patent versions, improvement since individual sections are often searched on: break description into individual sections by XML Processing Instructions |
| claim | improvement: identify independent and dependent claims; capture dependent claim hierarchy |
| IPC classification | variations: classification-ipc and classification-ipcr, first flat other separated in sections |
| classification | normalization: CPC, IPC and USPC patent classifications |
| documentId / patentId | normalization; including removing leading 0 padding, currently added to patent ids with length less than 8 digits, in the near future patent ids may increase to 13 digits |
| country | improvement: mapping of country codes to country name, current and historic codes used before 1978 or individual codes dropped or changed since |
| address and name | not-fixed, lookout for switched value errors: within name the first-name and last-name or middle name switched; within address the country and state switched ; farther back in time more likely to see these data errors. Older Greenbook patents sometimes have first name or last name switched with middle name (presented as an initial), making searching by a person's name more difficult |
- CPC Classification XML
| Field | Description |
|---|---|
| Patent | instantiated with DocumentId |
| Assignee Type | valid USPTO assigne type (00 through 09) |
| CountryCode | WIPO standard ST.3. two-letter codes |
| Classification | parse and normalize Classification (USPC, IPC, CPC) |
| Date | date format yyyyMMdd |
| Address | must have a country |
Single Line Per Record:
gov.uspto.patent.TransformerCli --input="ipa_corpusApps_2005.zip"
Single File Per Record:
gov.uspto.patent.TransformerCli --input="ipa_corpusApps_2005.zip" --outBulk=false
Options:
--input="FILE.zip" Patent Bulk Zip
--outdir="output" Output Directory
--outBulk=true Single file, JSON record per line
--limit=100 Total Record Limit
--flat=false Denormalized/Flat JSON or Objecet Hierarchy
--pettyPrint=true Pretty Print JSON
--stdout=true Write to Terminal instead of file
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringWriter;
import java.io.Writer;
import gov.uspto.common.file.filter.FileFilterChain;
import gov.uspto.common.file.filter.SuffixFileFilter;
import gov.uspto.patent.bulk.DumpFileAps;
import gov.uspto.patent.bulk.DumpFileXml;
import gov.uspto.patent.bulk.DumpReader;
import gov.uspto.patent.model.Patent;
import gov.uspto.patent.serialize.DocumentBuilder;
import gov.uspto.patent.serialize.JsonMapper;
import gov.uspto.patent.serialize.JsonMapperFlat;
public class ReadBulkPatentZip {
public static void main(String... args) throws IOException, PatentReaderException {
File inputFile = new File(args[0]);
int skip = 100;
int limit = 1;
boolean flatJson = false;
boolean jsonPrettyPrint = true;
boolean writeFile = false;
PatentDocFormat patentDocFormat = new PatentDocFormatDetect().fromFileName(inputFile);
DumpReader dumpReader;
switch (patentDocFormat) {
case Greenbook:
dumpReader = new DumpFileAps(inputFile);
break;
default:
dumpReader = new DumpFileXml(inputFile);
FileFilterChain filters = new FileFilterChain();
//filters.addRule(new PathFileFilter(""));
filters.addRule(new SuffixFileFilter("xml"));
dumpReader.setFileFilter(filters);
}
dumpReader.open();
if (skip > 0) {
dumpReader.skip(skip);
}
DocumentBuilder<Patent> json;
if (flatJson) {
json = new JsonMapperFlat(jsonPrettyPrint, false);
} else {
json = new JsonMapper(jsonPrettyPrint, false);
}
for (int i = 1; dumpReader.hasNext() && i <= limit; i++) {
String xmlDocStr = (String) dumpReader.next();
try (PatentReader patentReader = new PatentReader(xmlDocStr, patentDocFormat)) {
Patent patent = patentReader.read();
String patentId = patent.getDocumentId().toText();
System.out.println(patentId);
//System.out.println("Patent Object: " + patent.toString());
Writer writer;
if (writeFile) {
writer = new FileWriter(patentId + ".json");
} else {
writer = new StringWriter();
}
try {
json.write(patent, writer);
if (!writeFile) {
System.out.println("JSON: " + writer.toString());
}
} catch (IOException e) {
System.err.println("Failed to write file for: " + patentId + "\n" + e.getStackTrace());
} finally {
writer.close();
}
}
}
dumpReader.close();
}
}