Skip to content
This repository was archived by the owner on Feb 2, 2022. It is now read-only.
This repository was archived by the owner on Feb 2, 2022. It is now read-only.

Local cache of scraped data #95

@mscarey

Description

@mscarey

Issue Type

  • Feature request

Current Behavior

Unless I'm mistaken, running scrAPD from the command line multiple times would scrape the entire APD news site each time.

Expected Behavior

scrAPD could cache the results of the scrape locally when it's run once. Then, by default on subsequent runs, scrAPD could construct its output from the local cache, and then hit the APD site only to scrape reports newer than the newest record in the cache.

If I'm mistaken and it's possible to use a cache already, then my request would be to update that part of the documentation.

Possible Solution

The main reason for the feature request is to reduce the number of calls to the APD site. Even if the traffic isn't overwhelming for them, it seems like a better practice to have the ability to control it with caching.

My best idea for implementing a cache would be for the CLI to create a SQLite database in a local directory that's not under version control. (This may sound hypocritical coming from me, but I don't want to put too much more personally identifiable information on github.) So, maybe using SQLAlchemy?

I'm not sure whether it would be better to cache just the output data, or to cache the entire text of the police report. If it's the latter, then when you re-ran the CLI you'd need options to (1) reuse the output data you already have, (2) re-parse the police reports in the local cache, or (3) re-download the police reports and then parse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions