Inspiration

Current methods of extracting text from pictures are time-consuming, resource-intensive and costly. There were many times in both my professional and personal life that I would like to have such a tool to make some tasks much easier, such as copy text from images, search across images, etc. Since converting each image is independent from converting any other image, this is a task that will benefit greatly from distributed computing.

What it does

Extracts Text from Images using Distributed Compute Lab's Distributed Computing Technology and Tesseract.js's OCR capabilities

How we built it

-Used DCP node project starter code -Found a PDF to JPG convert (tesseract.js does not take a PDF file as input) -Used converter to convert PDF into an array of JPG -Fed each individual JPG to DCP protocol as a single slice

Challenges we ran into

-Finding a simple, efficient, independent and free PDF to PNG/JPG converter -Running tesseract.js within the DPC work function

Accomplishments that we're proud of

-At the very least, the project works perfectly locally -Learned about and implemented tesseract.js (very cool opensource OCR library) -Learned about and implemented a version of distributed computing

What we learned

-New technologies (tesseract.js, distributed computing, etc.)

What's next for imgTOText

-Figuring out exactly why it's not working on the DCP network, and making changes accordingly

Built With

Share this project:

Updates