CRC_Extract

Many of the files had corrupt font-mapping, so OCR was used to generate text files

crcOCR.py: Uses pdf2image to convert each page of a pdf to a jpeg, uses pytesseract to read the pictures using OCR, writes text to a text file, then deletes the jpegs. Iterates over each pdf in the folder. In order to use pdf2image, you must download poppler (http://blog.alivate.com.au/poppler-windows/).

crcExtract.py: Reads the text files generated by crcORC.py, extracts product and ingredient info, and generates a csv to be uploaded to Factotum.

Name		Name	Last commit message	Last commit date
parent directory ..
ReadMe.md		ReadMe.md
crcExtract.py		crcExtract.py
crcOCR.py		crcOCR.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadMe.md

FilesExpand file tree

CRC_Extract

Directory actions

More options

Directory actions

More options

Latest commit

History

CRC_Extract

Folders and files

parent directory

ReadMe.md