Scraping Quotes using Python

Web scraping is the process of automatically extracting data from websites and converting it into a structured format such as tables or files. In this article, we will learn how to scrape quotes from a website using Python libraries like Requests and BeautifulSoup and store the extracted data in a DataFrame for analysis.

Prerequisites

Installed the following Python libraries:

Python

pip install requests beautifulsoup4 pandas tqdm

Implementation

Step 1: Import Required Libraries and Connect to the Website

We first import the required libraries such as requests and beautifulSoup. It sends a request to the website.

requests.get(): fetches the webpage.
BeautifulSoup(): parses the HTML content so we can extract data.

website used in this article: https://quotes.toscrape.com/

Python

import requests
from bs4 import BeautifulSoup

link = 'https://quotes.toscrape.com/'
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')

Step 2: Extract All Quotes Text

Now, we extract all the quote texts present on the page.

Python

quotes = []

for quote in soup.find_all('span', class_='text'):
    quotes.append(quote.text[1:-1])
    print(quote.text[1:-1], "\n")

Step 3: Extract Author Names

Finds all author names and stores them inside a list.

Python

authors = []
for i in soup.find_all('small', class_='author'):
    authors.append(i.text)

Step 4: Extract quote details, tags and author links

In this step, we extract all information related to each quote. Extracts quote text, Extracts author name, Extracts author profile link, extracts all tags, Prints everything for verification.

Python

for sp in soup.find_all('div', class_='quote'):
    quote = sp.find('span', class_='text').text[1:-1]
    authors = sp.find('small', class_='author').text
    details = sp.find('a').get('href')
    tags = []
    for tag in sp.find_all('a', class_='tag'):
        tags.append(tag.text)
    tags = ','.join(tags)
    print(quote)
    print(authors)
    print(details)
    print(tags)
    print("*" * 127)

Output:

Step 5: Store Extracted Data in a List

We now store all extracted values together so they can be converted into a table later.

Python

data = []
for sp in soup.find_all('div', class_='quote'):
    quote = sp.find('span', class_='text').text[1:-1]
    authors = sp.find('small', class_='author').text
    details = sp.find('a').get('href')
    tags = []
    for tag in sp.find_all('a', class_='tag'):
        tags.append(tag.text)
    tags = ','.join(tags)
    data.append([quote, authors, details, tags])

Step 6: Collect author elements

This step collects all author HTML elements from the page, which can be useful for further inspection or advanced data extraction.

Python

authors_1 = []
for i in soup.find_all('small', class_='author'):
    authors_1.append(i)
authors_1

Output:

Step 7: Extract tags

Here, we extract all tags associated with a quote to understand the themes or categories linked to it.

Python

tags = []
for tag in sp.find_all('a', class_='tag'):
    tags.append(tag.text)

tags = ','.join(tags)

Step 8: Extract only quote text

This step focuses on isolating just the quote text from the HTML structure for clean and direct use.

Python

for sp in soup.find_all('div', class_='quote'):
    quote = sp.find('span', class_='text').text[1:-1]
quote

Output:

A day without sunshine is like, you know, night.

Step 9: Convert data into a DataFrame

Converts scraped data into a table. Makes it easier to analyze and store.

Python

import pandas as pd
df = pd.DataFrame(data, columns=['Quote', 'Author', 'details', 'Tags'])
df.head()

Output:

Step 10: Scrape multiple pages

In this step, we automate the scraping process across multiple pages to gather a larger and more complete dataset.

Python

from tqdm import tqdm
Multiple_Pages = []
for page in tqdm(range(1, 11)):

    link = ('http://quotes.toscrape.com/page/' + str(page))
    res = requests.get(link)
    soup = BeautifulSoup(res.text, 'html.parser')

    for sp in soup.find_all('div', class_='quote'):

        quote = sp.find('span', class_='text').text[1:-1]
        authors = sp.find('small', class_='author').text
        details = sp.find('a').get('href')

        tags = []

        for tag in sp.find_all('a', class_='tag'):
            tags.append(tag.text)

        tags = ','.join(tags)
        Multiple_Pages.append([quote, authors, details, tags])

Step 11: Create final DataFrame

Converts all scraped pages into a DataFrame. Renames columns and builds full author profile URLs.

Python

Multiple_Pages_df = pd.DataFrame(data=Multiple_Pages)
Multiple_Pages_df = Multiple_Pages_df.rename(
    columns={0: 'Quote', 1: 'Author', 2: 'Author_id', 3: 'Tags'})
Multiple_Pages_df['Author_Link'] = 'http://quotes.toscrape.com/' + \
    Multiple_Pages_df['Author_id']
Multiple_Pages_df.head()

Output:

Now we have created a dataframe and it can be further used for analysis and model making.

Scraping Quotes using Python

Prerequisites

Implementation

Step 1: Import Required Libraries and Connect to the Website

Step 2: Extract All Quotes Text

Step 3: Extract Author Names

Step 4: Extract quote details, tags and author links

Step 5: Store Extracted Data in a List

Step 6: Collect author elements

Step 7: Extract tags

Step 8: Extract only quote text

Step 9: Convert data into a DataFrame

Step 10: Scrape multiple pages

Step 11: Create final DataFrame

Explore