BeautifulSoup4
Have you ever wanted to extract information from a website automatically? That’s what web scraping is all about. In this post, we’ll introduce you to a powerful Python library called BeautifulSoup4 that makes web scraping simple and fun!
We’ll go through a small program and explain each part step-by-step. By the end, you’ll know how to pull data from a webpage using BeautifulSoup4.
🌐 What is BeautifulSoup4?
BeautifulSoup4 is a Python library used to parse HTML or XML documents. It creates a parse tree from page source code, so you can easily extract the data you need—like titles, links, and more.
🧰 Prerequisites
Before we dive in, make sure you have the following installed:
pip install beautifulsoup4 requests
beautifulsoup4
: For parsing HTML.requests
: To download the webpage.
🧪 The Example Program
We’ll write a Python program that:
- Downloads a webpage
- Parses the HTML
- Extracts and prints all the links (
<a>
tags)
Here’s the full code, followed by an explanation:
import requests
from bs4 import BeautifulSoup
# Step 1: Download the webpage
url = "https://example.com"
response = requests.get(url)
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Step 3: Find all the links
links = soup.find_all('a')
# Step 4: Print the links
for link in links:
href = link.get('href')
print(href)
🧱 Code Explanation (Block by Block)
🧩 1. Import Libraries
import requests
from bs4 import BeautifulSoup
requests
helps us make HTTP requests (like downloading a webpage).BeautifulSoup
is used to parse and search the HTML content.
🌍 2. Download the Webpage
url = "https://example.com"
response = requests.get(url)
- We define a URL to scrape.
requests.get(url)
fetches the page.response.text
contains the HTML content of the page.
🧹 3. Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
- This creates a BeautifulSoup object named
soup
. - It uses the built-in
'html.parser'
to read the HTML. - Now
soup
holds the whole structure of the webpage in a searchable format.
🔗 4. Find All the Links
links = soup.find_all('a')
find_all('a')
finds all<a>
(anchor) tags in the HTML.- These tags usually represent hyperlinks.
🖨️ 5. Print the Links
for link in links:
href = link.get('href')
print(href)
- We loop through each link.
link.get('href')
extracts the actual URL inside the<a>
tag.- Then we print it!
🎯 Output Example
If you run this code for a real website, you might see output like:
https://www.iana.org/domains/example
#
/about
/contact
These are the links found on the page!
🧠 Tips for Beginners
- Always check the site’s robots.txt file (e.g.,
example.com/robots.txt
) to see if scraping is allowed. - Avoid scraping pages too quickly—add a delay (
time.sleep
) between requests. - This example works only with public HTML content, not JavaScript-loaded pages.
✅ Conclusion
Congratulations! You’ve just built your first web scraper using BeautifulSoup4. 🎉
Try changing the URL and exploring what else you can extract—headings (<h1>
), paragraphs (<p>
), or even images (<img>
).
Stay tuned for the next post, where we’ll scrape data into a CSV file!