Getting Started with BeautifulSoup4 – Part 2: Extracting Real Data from a Website


🥣 Getting Started with BeautifulSoup4

Welcome back! 👋 In the previous post, we learned the basics of BeautifulSoup4 and wrote a small program to extract links from a simple web page.

Now, let’s take it up a notch.

In this post, we’ll:

  • Download a real website’s HTML
  • Extract meaningful data (like article titles)
  • Print it in a clean, readable format

This will give you practical web scraping skills you can build on. Let’s go! 🚀


🧰 What We’ll Be Scraping

For this demo, we’ll use a real, beginner-friendly site: https://quotes.toscrape.com

This website is specifically made for practicing web scraping, so it’s legal and safe to use!

We’ll extract:

  • Quote text
  • Author name

🧪 Full Code Example

Here’s the complete code, followed by a breakdown of what each part does:

import requests
from bs4 import BeautifulSoup

# Step 1: Download the webpage
url = "https://quotes.toscrape.com"
response = requests.get(url)

# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Find all quote containers
quote_blocks = soup.find_all('div', class_='quote')

# Step 4: Loop through each quote and extract text and author
for quote in quote_blocks:
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    print(f'"{text}" — {author}')

🧱 Code Breakdown

🧩 1. Import the Libraries

import requests
from bs4 import BeautifulSoup

We need:

  • requests to fetch the website’s HTML
  • BeautifulSoup to parse and search the HTML content

🌍 2. Fetch the Web Page

url = "https://quotes.toscrape.com"
response = requests.get(url)
  • We set the target URL
  • requests.get(url) downloads the page
  • response.text contains the raw HTML

🧹 3. Parse the HTML

soup = BeautifulSoup(response.text, 'html.parser')

This line gives us a BeautifulSoup object (soup) to work with. Think of it as a structured version of the raw HTML.


🔍 4. Find Quote Containers

quote_blocks = soup.find_all('div', class_='quote')

Each quote on the page is inside a <div> with the class quote. This line finds all such blocks.


🔧 5. Extract Quote and Author

for quote in quote_blocks:
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    print(f'"{text}" — {author}')

Let’s break this down:

  • Loop through each quote block
  • Use .find() to get the quote text and author
  • .get_text() extracts the actual text content
  • Finally, print the quote and author nicely formatted

📦 Example Output

When you run the script, you’ll see something like:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” — J.K. Rowling
...

Beautiful, right? 😄


💡 Bonus Tip: Viewing the HTML Structure

To understand what to extract, always inspect the page using your browser’s Developer Tools (right-click → Inspect). Look at the HTML tags and class names.


⚠️ Friendly Reminder

  • Only scrape sites you have permission to scrape.
  • Be respectful: don’t overload servers with too many requests.
  • Use time.sleep() between requests if scraping multiple pages.

✅ What’s Next?

You’ve now learned to:

  • Scrape a real website
  • Extract specific data
  • Print it in a readable format

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top