Information Extraction from Product Images

Overview

Our prediction process involved two key steps:

Image-to-Text Conversion: Extracting textual data from the provided test images using Optical Character Recognition (OCR) technology.
Information Extraction with NER: Cleaning and processing the extracted text using Named Entity Recognition (NER) techniques to identify and extract specific information, ensuring a more accurate and targeted prediction.

Flowchart of Processes

Explanation

Part 1: Image Text Extraction (Using Pytesseract)

Import Libraries: The code begins by importing necessary libraries such as requests (for HTTP requests), PIL (for handling images), pytesseract (for OCR), and pandas (for data manipulation).
Load DataFrame: The program loads a CSV file (test.csv) into a pandas DataFrame, which contains image URLs and relevant information.
extract_text_from_image Function:
- Download Image: Sends a GET request to the given URL to download the image.
- Open Image: Opens the downloaded image using PIL.Image.open.
- OCR: Uses pytesseract.image_to_string to extract text from the image.
- Error Handling: Includes a try-except block to catch errors and return error messages.
Apply OCR: The function is applied to the 'image_link' column of the DataFrame, storing the extracted text in a new column called 'tesseract'.
Save DataFrame: The updated DataFrame with the extracted text is saved as a new CSV file (csv/101000.csv).

Part 2: Entity Extraction and Unit Standardization

Load and Prepare Data: Loads a new CSV file (likely the one generated in Part 1) and defines lists of different unit types (e.g., length, weight, volume, power).
standardize_unit Function: Standardizes units to ensure consistency (e.g., "cm" to "centimeter").
process_text Function:
- Lowercase Text: Converts text to lowercase.
- Entity-Specific Logic: Extracts values based on the 'entity_name' column (e.g., 'depth', 'width', 'item_weight').
- Regular Expressions: Finds numerical values followed by specific units.
- Unit Standardization: Calls the standardize_unit function for consistent unit representation.
- Print Values: Prints the extracted values for debugging purposes.
Apply Processing: Applies process_text to each row of the DataFrame, storing results in the 'extracted_value' column.
Entity-Unit Mapping: Creates a dictionary entity_unit_map to store unique units found for each entity type.
Save Results: Saves the processed DataFrame to a new CSV file (extracted_results_from_combined.csv).

Summary

We developed a two-step solution for extracting and standardizing information from product images:

Optical Character Recognition (OCR): Using the Pytesseract library, we converted image data into machine-readable text.
- Downloaded images via URLs.
- Processed the images to extract textual content.
Named Entity Recognition (NER): A rule-based NER system cleaned and analyzed the extracted text.
- Text Preprocessing: Cleaned and prepared the text.
- Entity-Specific Rules: Applied logic and regular expressions to extract numerical values and associated units.
- Unit Standardization: Standardized units (e.g., "cm" to "centimeter") for consistency.

This automated process significantly enhanced the efficiency and accuracy of extracting and standardizing product information from images.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
img		img
Easy_ocr.csv		Easy_ocr.csv
README.md		README.md
all_ocrs.csv		all_ocrs.csv
easyOcrWithSpark.py		easyOcrWithSpark.py
ensemble_ocr.ipynb		ensemble_ocr.ipynb
extracted_results_from_pytesseractOCR.csv.zip		extracted_results_from_pytesseractOCR.csv.zip
finalNER.py		finalNER.py
keras_OCR.ipynb		keras_OCR.ipynb
keras_ocr.csv		keras_ocr.csv
pytesseractAndNLP.ipynb		pytesseractAndNLP.ipynb
textData.csv		textData.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Extraction from Product Images

Overview

Flowchart of Processes

Explanation

Part 1: Image Text Extraction (Using Pytesseract)

Part 2: Entity Extraction and Unit Standardization

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Information Extraction from Product Images

Overview

Flowchart of Processes

Explanation

Part 1: Image Text Extraction (Using Pytesseract)

Part 2: Entity Extraction and Unit Standardization

Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages