Web scraping is the process of automatically extracting data from websites and converting it into a structured format such as tables or files. In this article, we will learn how to scrape quotes from a website using Python libraries like Requests and BeautifulSoup and store the extracted data in a DataFrame for analysis.
Prerequisites
Installed the following Python libraries:
pip install requests beautifulsoup4 pandas tqdm
Implementation
Step 1: Import Required Libraries and Connect to the Website
We first import the required libraries such as requests and beautifulSoup. It sends a request to the website.
- requests.get(): fetches the webpage.
- BeautifulSoup(): parses the HTML content so we can extract data.
website used in this article: https://quotes.toscrape.com/
import requests
from bs4 import BeautifulSoup
link = 'https://quotes.toscrape.com/'
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
Step 2: Extract All Quotes Text
Now, we extract all the quote texts present on the page.
quotes = []
for quote in soup.find_all('span', class_='text'):
quotes.append(quote.text[1:-1])
print(quote.text[1:-1], "\n")
Step 3: Extract Author Names
Finds all author names and stores them inside a list.
authors = []
for i in soup.find_all('small', class_='author'):
authors.append(i.text)
Step 4: Extract quote details, tags and author links
In this step, we extract all information related to each quote. Extracts quote text, Extracts author name, Extracts author profile link, extracts all tags, Prints everything for verification.
for sp in soup.find_all('div', class_='quote'):
quote = sp.find('span', class_='text').text[1:-1]
authors = sp.find('small', class_='author').text
details = sp.find('a').get('href')
tags = []
for tag in sp.find_all('a', class_='tag'):
tags.append(tag.text)
tags = ','.join(tags)
print(quote)
print(authors)
print(details)
print(tags)
print("*" * 127)
Output:
Step 5: Store Extracted Data in a List
We now store all extracted values together so they can be converted into a table later.
data = []
for sp in soup.find_all('div', class_='quote'):
quote = sp.find('span', class_='text').text[1:-1]
authors = sp.find('small', class_='author').text
details = sp.find('a').get('href')
tags = []
for tag in sp.find_all('a', class_='tag'):
tags.append(tag.text)
tags = ','.join(tags)
data.append([quote, authors, details, tags])
Step 6: Collect author elements
This step collects all author HTML elements from the page, which can be useful for further inspection or advanced data extraction.
authors_1 = []
for i in soup.find_all('small', class_='author'):
authors_1.append(i)
authors_1
Output:

Step 7: Extract tags
Here, we extract all tags associated with a quote to understand the themes or categories linked to it.
tags = []
for tag in sp.find_all('a', class_='tag'):
tags.append(tag.text)
tags = ','.join(tags)
Step 8: Extract only quote text
This step focuses on isolating just the quote text from the HTML structure for clean and direct use.
for sp in soup.find_all('div', class_='quote'):
quote = sp.find('span', class_='text').text[1:-1]
quote
Output:
A day without sunshine is like, you know, night.
Step 9: Convert data into a DataFrame
Converts scraped data into a table. Makes it easier to analyze and store.
import pandas as pd
df = pd.DataFrame(data, columns=['Quote', 'Author', 'details', 'Tags'])
df.head()
Output:

Step 10: Scrape multiple pages
In this step, we automate the scraping process across multiple pages to gather a larger and more complete dataset.
from tqdm import tqdm
Multiple_Pages = []
for page in tqdm(range(1, 11)):
link = ('http://quotes.toscrape.com/page/' + str(page))
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for sp in soup.find_all('div', class_='quote'):
quote = sp.find('span', class_='text').text[1:-1]
authors = sp.find('small', class_='author').text
details = sp.find('a').get('href')
tags = []
for tag in sp.find_all('a', class_='tag'):
tags.append(tag.text)
tags = ','.join(tags)
Multiple_Pages.append([quote, authors, details, tags])
Step 11: Create final DataFrame
Converts all scraped pages into a DataFrame. Renames columns and builds full author profile URLs.
Multiple_Pages_df = pd.DataFrame(data=Multiple_Pages)
Multiple_Pages_df = Multiple_Pages_df.rename(
columns={0: 'Quote', 1: 'Author', 2: 'Author_id', 3: 'Tags'})
Multiple_Pages_df['Author_Link'] = 'http://quotes.toscrape.com/' + \
Multiple_Pages_df['Author_id']
Multiple_Pages_df.head()
Output:

Now we have created a dataframe and it can be further used for analysis and model making.