🥣 Getting Started with BeautifulSoup4
Welcome back! 👋 In the previous post, we learned the basics of BeautifulSoup4 and wrote a small program to extract links from a simple web page.
Now, let’s take it up a notch.
In this post, we’ll:
- Download a real website’s HTML
- Extract meaningful data (like article titles)
- Print it in a clean, readable format
This will give you practical web scraping skills you can build on. Let’s go! 🚀
🧰 What We’ll Be Scraping
For this demo, we’ll use a real, beginner-friendly site: https://quotes.toscrape.com
This website is specifically made for practicing web scraping, so it’s legal and safe to use!
We’ll extract:
- Quote text
- Author name
🧪 Full Code Example
Here’s the complete code, followed by a breakdown of what each part does:
import requests
from bs4 import BeautifulSoup
# Step 1: Download the webpage
url = "https://quotes.toscrape.com"
response = requests.get(url)
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Step 3: Find all quote containers
quote_blocks = soup.find_all('div', class_='quote')
# Step 4: Loop through each quote and extract text and author
for quote in quote_blocks:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f'"{text}" — {author}')
🧱 Code Breakdown
🧩 1. Import the Libraries
import requests
from bs4 import BeautifulSoup
We need:
requests
to fetch the website’s HTMLBeautifulSoup
to parse and search the HTML content
🌍 2. Fetch the Web Page
url = "https://quotes.toscrape.com"
response = requests.get(url)
- We set the target URL
requests.get(url)
downloads the pageresponse.text
contains the raw HTML
🧹 3. Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
This line gives us a BeautifulSoup object (soup
) to work with. Think of it as a structured version of the raw HTML.
🔍 4. Find Quote Containers
quote_blocks = soup.find_all('div', class_='quote')
Each quote on the page is inside a <div>
with the class quote
. This line finds all such blocks.
🔧 5. Extract Quote and Author
for quote in quote_blocks:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f'"{text}" — {author}')
Let’s break this down:
- Loop through each
quote
block - Use
.find()
to get the quote text and author .get_text()
extracts the actual text content- Finally, print the quote and author nicely formatted
📦 Example Output
When you run the script, you’ll see something like:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” — Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” — J.K. Rowling
...
Beautiful, right? 😄
💡 Bonus Tip: Viewing the HTML Structure
To understand what to extract, always inspect the page using your browser’s Developer Tools (right-click → Inspect). Look at the HTML tags and class names.
⚠️ Friendly Reminder
- Only scrape sites you have permission to scrape.
- Be respectful: don’t overload servers with too many requests.
- Use
time.sleep()
between requests if scraping multiple pages.
✅ What’s Next?
You’ve now learned to:
- Scrape a real website
- Extract specific data
- Print it in a readable format