PatentDocument

Patent Document Parser

All patent document formats from 1976 to current parse into a single common Patent model. Field validation is applied to many fields, where invalid values and their document fragment are logged. Flexibility and completeness is stressed over efficiency.

Formats

Format	Prefix	Years	Revisions	Documentation
Green Book (freetext key values; multi-line indented values)	pftaps	1976-2001	revisions until 1997	Greenbook Documentation
Red Book SGML	pg	2001-2004	ST32-US-Grant-025xml.dtd	Redbook SGML v1.9 Redbook SGML v2.4 WIPO ST. 32 - SGML Standard
Red Book PAP XML	pa	2002-2004	Published Patent Applications (PAP) from 2002-2004 pap-v15-2001-01-31.dtd pap-v16-2002-01-01.dtd
Red Book XML	ipg ipa	2004-Current	Grants: us-patent-grant-v40-2004-12-02.dtd us-patent-grant-v41-2005-08-25.dtd us-patent-grant-v42-2006-08-23.dtd us-patent-grant-v43-2012-12-04.dtd us-patent-grant-v44-2013-05-16.dtd us-patent-grant-v45-2014-04-03.dtd Applications: us-patent-application-v40-2004-12-02.dtd us-patent-application-v41-2005-08-25.dtd us-patent-application-v42-2006-08-23.dtd us-patent-application-v43-2012-12-04.dtd us-patent-application-v44-2014-04-03.dtd	Redbook XML Documentation WIPO ST. 36 - XML Standard

Redbook XML variations/fixes/normalization/improvements

Short list of some of the XML variations handled and improvements made by the Patent Document Parser

Field	Description
parties	variation: us-parties
applicant	variation: us-parties/us-applicants/us-applicant
references-cited	variation: us-references-cited
citation	variation: us-citation
inventor	variation: Applicant with attribute "app-type" value "applicant-inventor"
address/street	variation: address-1 address-2
agent	fix: if missing use "correspondence-address" field
description	fix to corresponds with non-xml patent versions, improvement since individual sections are often searched on: break description into individual sections by XML Processing Instructions
claim	improvement: identify independent and dependent claims; capture dependent claim hierarchy
IPC classification	variations: classification-ipc and classification-ipcr, first flat other separated in sections
classification	normalization: CPC, IPC and USPC patent classifications
documentId / patentId	normalization; including removing leading 0 padding, currently added to patent ids with length less than 8 digits, in the near future patent ids may increase to 13 digits
country	improvement: mapping of country codes to country name, current and historic codes used before 1978 or individual codes dropped or changed since
address and name	not-fixed, lookout for switched value errors: within name the first-name and last-name or middle name switched; within address the country and state switched ; farther back in time more likely to see these data errors. Older Greenbook patents sometimes have first name or last name switched with middle name (presented as an initial), making searching by a person's name more difficult

Also Parses

CPC Classification XML

Validate and Normalize

Field	Description
Patent	instantiated with DocumentId
Assignee Type	valid USPTO assigne type (00 through 09)
CountryCode	WIPO standard ST.3. two-letter codes
Classification	parse and normalize Classification (USPC, IPC, CPC)
Date	date format yyyyMMdd
Address	must have a country

CLI Tool to export all Patents as indexable JSON:

Single Line Per Record: 
   gov.uspto.patent.TransformerCli --input="ipa_corpusApps_2005.zip"
   
Single File Per Record:
   gov.uspto.patent.TransformerCli --input="ipa_corpusApps_2005.zip" --outBulk=false

 Options:
     --input="FILE.zip"    Patent Bulk Zip
     --outdir="output"     Output Directory      
     --outBulk=true        Single file, JSON record per line
     --limit=100           Total Record Limit
     --flat=false          Denormalized/Flat JSON or Objecet Hierarchy
     --pettyPrint=true     Pretty Print JSON
     --stdout=true         Write to Terminal instead of file

Example Usage:

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringWriter;
import java.io.Writer;

import gov.uspto.common.file.filter.FileFilterChain;
import gov.uspto.common.file.filter.SuffixFileFilter;
import gov.uspto.patent.bulk.DumpFileAps;
import gov.uspto.patent.bulk.DumpFileXml;
import gov.uspto.patent.bulk.DumpReader;
import gov.uspto.patent.model.Patent;
import gov.uspto.patent.serialize.DocumentBuilder;
import gov.uspto.patent.serialize.JsonMapper;
import gov.uspto.patent.serialize.JsonMapperFlat;

public class ReadBulkPatentZip {

    public static void main(String... args) throws IOException, PatentReaderException {
        
        File inputFile = new File(args[0]);
        int skip = 100;
        int limit = 1;
        boolean flatJson = false;
        boolean jsonPrettyPrint = true;
        boolean writeFile = false;

        PatentDocFormat patentDocFormat = new PatentDocFormatDetect().fromFileName(inputFile);

        DumpReader dumpReader;
        switch (patentDocFormat) {
        case Greenbook:
            dumpReader = new DumpFileAps(inputFile);
            break;
        default:
            dumpReader = new DumpFileXml(inputFile);
            FileFilterChain filters = new FileFilterChain();
            //filters.addRule(new PathFileFilter(""));
            filters.addRule(new SuffixFileFilter("xml"));
            dumpReader.setFileFilter(filters);
        }

        dumpReader.open();
        if (skip > 0) {
            dumpReader.skip(skip);
        }

        DocumentBuilder<Patent> json;
        if (flatJson) {
            json = new JsonMapperFlat(jsonPrettyPrint, false);
        } else {
            json = new JsonMapper(jsonPrettyPrint, false);
        }

        for (int i = 1; dumpReader.hasNext() && i <= limit; i++) {
            String xmlDocStr = (String) dumpReader.next();
            try (PatentReader patentReader = new PatentReader(xmlDocStr, patentDocFormat)) {
                Patent patent = patentReader.read();
                String patentId = patent.getDocumentId().toText();
                
                System.out.println(patentId);
                //System.out.println("Patent Object: " + patent.toString());

                Writer writer;
                if (writeFile) {
                    writer = new FileWriter(patentId + ".json");
                } else {
                    writer = new StringWriter();
                }

                try {
                    json.write(patent, writer);
                    if (!writeFile) {
                        System.out.println("JSON: " + writer.toString());
                    }
                } catch (IOException e) {
                    System.err.println("Failed to write file for: " + patentId + "\n" + e.getStackTrace());
                } finally {
                    writer.close();
                }
            }
        }

        dumpReader.close();
    }
}

Name		Name	Last commit message	Last commit date
parent directory ..
.settings		.settings
resources/samples		resources/samples
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Patent Document Parser

Formats

Redbook XML variations/fixes/normalization/improvements

Also Parses

Validate and Normalize

CLI Tool to export all Patents as indexable JSON:

Example Usage:

FilesExpand file tree

PatentDocument

Directory actions

More options

Directory actions

More options

Latest commit

History

PatentDocument

Folders and files

parent directory

README.md

Patent Document Parser

Formats

Redbook XML variations/fixes/normalization/improvements

Also Parses

Validate and Normalize

CLI Tool to export all Patents as indexable JSON:

Example Usage: