Inspiration

We wanted to create this project after experiencing frustrations with the lack of useful information on second hand car marketplaces.

When shopping for a used car, people find that information is fragmented - vehicle listings on one site, MOT histories on another, and no easy way to estimate running costs. This overwhelms many people as they search for a used car. It's easy to spend hours switching between various websites containing fragmented data - it's obvious a solution is needed.

The vision was to build a platform that brings all this critical data together in one place, helping buyers make fully informed decisions. I wanted to create something that not only shows what cars are available but provides context about their history and projected expenses.

By integrating vehicle listings with comprehensive MOT history, mileage progression visualizations, and personalized cost estimations, the platform addresses the information gap in the used car market. The goal was to create a transparent, data-driven tool that reduces the uncertainty and stress typically associated with buying a used vehicle.

What it does

We scrape AutoTrader listings using a headless Selenium web scraper, which collects images from the AutoTrader listing as well as the vehicle's price and transmission type.

We then pass the images from the listing to the first step in our processing pipeline: a vehicle image analysis background job. This uses AI image extraction to collect the number plate of the car from the listing images. This is a complex yet necessary step as number plates are not available for AutoTrader listings. Once the number plate is identified the job finishes and enqueues the next job.

The next job accesses the DVLA vehicle service API. This enriches the vehicle with metadata such as colour, engine size and road tax status. This in turn queues the next job which queries the MOT history API. The allows us to view every MOT test that a car has undertaken since the beginning of test digitization.

Once we have successfully enriched with all previous data, we enqueue our AI analysis backdrop, this interprets the full history and provides summaries to the user on whether the vehicle is good value and the expected lifespan of the vehicle in addition to costs required to pass the next MOT.

At this point the vehicle becomes available in our webapp, fully enriched with data. If we find a bad number plate or data inaccuracies in our pipeline, records are destroyed before ever being shown to a user.

data processing pipeline

How we built it

We chose a Ruby-on-Rails backend and a React frontend with Vite. This allowed us to develop an elegant and modern UI with a robust and scalable backend.

Ruby-on-Rails allows us to leverage parallel processing libraries like Sidekiq, facilitating the processing of vast quantities of data on the backend without causing any instability in the webapp - Sidekiq runs these jobs in a Redis cache layer.

We split our processing pipeline into four jobs to minimise memory usage for each job, preventing 'out of memory' errors causing jobs to fail. We also ensured each job was idempotent so that any non-determinism in parallel processing wouldn't affect our data.

We leveraged Rails controllers to build the backend REST API. This facilitates communication between the frontend and backend. We chose a Postgres database because the native integration with Rails is very elegant, and we were storing structured, relational data. e.g. one Vehicle has many MotTests.

Challenges we ran into

Firstly, while scraping autotrader we encountered some unexpected behaviour. In order to get information, all elements of the page had to first load in. This forced us to use a headless scraper which was different to standard HTTP request architectures.

Secondly, extracting registration plates from images was quite difficult. We had to find a readily available yet highly reliable model capable of extracting number plate strings from a variety of extreme angles. In order to minimise unrecognised errors, we introduced a confidence field to the model so we could set a high cutoff preventing invalid number plates from entering the system.

Thirdly, race conditions in the processing pipeline caused issues. Because our architecture relies on each job queueing the next, having finished, situations arose where database information from the previous job failed to commit before the next job started running. This caused jobs to fail as the enrichment they depended on was yet to be provided. In order to solve this, we had to slow the operation of the backend job tasks, waiting for commit events before queueing the next background job.

Accomplishments that we're proud of

We're proud that users now have enough data to not only make an informed decision on a used car purchase, but to forecast the impacts of owning the car as it ages. Our forecasting module uses a 3 dimensional sigmoid regression function with manually tuned parameters. Forecast pricing

The seamless integration between AutoTrader and AutoBiography through simply copying the URL of the listing allows the user to make an informed decision more quickly. Link paste box

Our purchase analysis summary includes all known vehicle history, as well as common faults for that model of car into a digestible format - collecting fragmented data from many data sources and making it understandable for the amateur car shopper. Summary

What we learned

We learned about the dangers of stale documentation because, as the complexity of the codebase expanded, adding new features became more complex as the outdated documentation couldn't be entirely relied upon.

We learned how to overcome race conditions in parallel background processing after finding non-deterministic behaviours in our bulk scraping tasks.

We learnt how to do headless webscraping to gather complex and dynamically loaded components from the web after discovering that basic HTTP requests weren't giving us the expected results.

What's next for AutoBiography

The next steps for AutoBiography are to improve the documentation coverage and reduce complexity across the codebase

These changes would allow the codebase to be more maintainable, enabling us to implement custom listings, user logins and authentication. This would facilitate an advanced vehicle marketplace, presenting additional data collection opportunities and increasing the volume of data we have access to. This would allow us to provide greater insights to our customers and further differentiate our product from other vehicle search engines.

We could implement advanced machine learning models to better predict failures, match drivers to best their best suited vehicle and discover new and complex statistical relationships between vehicle features and safety outcomes

Built With

Share this project:

Updates