50% accuracy!
Interesting/novel problem-solving approaches Geocoding We realized that if we had coordinate data for receipt and user entry addresses, the distance between any given receipt and entry can be used to match them. Since we didn't have simple address strings from the OCR data, we tried two methods to extract addresses: Some keywords frequently appeared early in addresses and others appeared later. By using the frequency and relative position of words appearing in entry addresses, we established some words as strong "starters" and "enders". We then look for a sudden increase in "starter" words followed by a sudden increase in "ender" words to identify first and last lines of an address in OCR data. Receipts tend to have a change of indentation/alignment when moving from one block of information to another. By using positional data of each line of text, we can identify when they begin to share alignment and when they break that alignment. The first block to have that behavior is the header block which usually the company name, address, contact information, registration/tax. Everything other than company name and address can be trivially discarded with keywords and we're left with an address. we also considered locality-sensitive hashing but the alignment detecting approach worked well so we went with that instead By using geopy and Google Maps API, we encoded address strings as coordinates for user entries and receipts. Regular expressions In order to find dates from the user inputs, and the receipts, we used a bunch of regular expressions to match string patterns. Anything in the format of a date (in any order) was captured by our algorithm. If some input didn’t fit the definition of a valid date (for example, a user entered 2916-4-17 as a year, and this was captured by our program). Matching dates (fuzzywuzzy) users sometimes mistyped the dates of their purchases, and the OCR data also had some incorrectly identified numbers, so we tried two methods for matching dates we tried using date type objects and used both permutations of month and day to account for users being inconsistent with the format they used we also used fuzzywuzzy to calculate Levenshtein distance, which was used to match dates even with typos
Observations integrating predicted values for “total” actually decreases performance and we’re not sure why – maybe because user price inputs are more unpredictably erroneous?
Insightful exploratory data analysis and visualizations Outliers immediately identified in user inputs Dates: Some dates were formatted very incorrectly. We also realized that there was no standard format for the user to input dates (except that it was all numeric). In order to account for this, we extracted the dates and converted them into both YYYY-MM-DD and YYYY-DD-MM formats for comparison. Addresses: Unique dates: Dates weren’t necessarily unique, since multiple transactions could have been made on the same date. To account for this, we took in other parameters, such as location. Tilting of receipts/image work: Used geocoding to
Possibilities for future work Parameter tuning with lasso or ridge regression Determining which number from OCR is most likely to be the total by feeding coordinates of all boxes containing numbers + coordinates of the box containing the word “total” into an MLP to and learning relationships between the location of “total” and the number value of the total More general: Provide capabilities for a user to simply scan their receipt and have the data populated instead of having to manually input it – they would only need to scan the receipt and then confirm the details after they were auto populated
Log in or sign up for Devpost to join the conversation.