time-crawler

To test the crawler for yourself, follow the steps listed below.

Install Scrapy using the following command:

pip install scrapy

Create a time-crawler directory and begin a new Scrapy project by running the following command in the time-crawler directory:

scrapy startproject time

This will create a time directory with the following contents:

time/
    scrapy.cfg
    time/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

In the spiders directory, create a file named time_spider.py, in which you will place the code from this repository.
To release the crawler and redirect its output into a file named time.json, run the following command in the spiders directory:

scrapy crawl time -o time.json

Now, time.json should contain about 13,000 dictionaries, each storing a unique article's link, title, author, date, and body.

To upload the contents of `time.json` to a public Elasticsearch domain, follow the steps listed below.

Since the dictionaries in time.json are contained within a single array, delete the first and last lines of the file. The only characters to be deleted should be [ and ].
To remove the commas from the end of each line, run the following command:

sed 's/,$//' < time.json > time1.json

This will create a new file named time1.json.

To follow the format required by Elasticsearch, insert {"index": {"_index": "stories"}} before each dictionary by running the following command:

sed 's/^/{"index": {"_index": "stories"}}\n/g' < time1.json > time2.json

This will create a new file named time2.json.

To upload the contents of time2.json to your cluster, run the following command, replacing ENDPOINT with your domain's endpoint:

curl -XPOST ENDPOINT/_bulk --data-binary @time2.json -H 'Content-Type: application/json'

NOTE: Due to its size, you may not be able to upload all of time2.json at once, requiring you to split it into multiple files, which must be uploaded individually. Splitting the file can be done using the following command, in which COUNT is an integer representing the number of lines in each new file, and PREFIX is a string representing the prefix of each new file's name:

split -l COUNT time2.json PREFIX

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
time_spider.py		time_spider.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

time-crawler

To test the crawler for yourself, follow the steps listed below.

To upload the contents of `time.json` to a public Elasticsearch domain, follow the steps listed below.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

time-crawler

To test the crawler for yourself, follow the steps listed below.

To upload the contents of time.json to a public Elasticsearch domain, follow the steps listed below.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

To upload the contents of `time.json` to a public Elasticsearch domain, follow the steps listed below.

Packages