🌍 Overview
In this blog post, we will walk through building a real-world web scraping project using Python and BeautifulSoup4. Our goal is to scrape all books listed on Books to Scrape, extracting detailed information such as:
- Title
- Price
- Rating (converted to number)
- Stock availability
- Product description
- Category
- UPC, Tax info, and more
By the end of this tutorial, you will have a fully working project that scrapes over 1000 books and saves the data into a CSV file for further analysis.
🔧 Prerequisites
Make sure you have Python installed, then install the necessary packages:
pip install requests beautifulsoup4
📂 Project Structure
book_scraper_pro/
├── main.py
├── scraper.py
└── books.csv
📃 scraper.py
This file contains all the logic for fetching, parsing, and extracting book data from the website.
import requests
from bs4 import BeautifulSoup
import time
BASE_URL = "http://books.toscrape.com/"
def get_soup(url):
res = requests.get(url)
if res.status_code != 200:
return None
return BeautifulSoup(res.text, 'html.parser')
def get_rating_as_number(class_list):
ratings = {
'One': 1, 'Two': 2, 'Three': 3,
'Four': 4, 'Five': 5
}
for cls in class_list:
if cls in ratings:
return ratings[cls]
return 0
def scrape_book_details(book_url):
soup = get_soup(book_url)
if not soup:
return {}
description = soup.find('meta', attrs={"name": "description"})
description = description['content'].strip() if description else "No description."
table = soup.find('table', class_='table table-striped')
info = {row.th.get_text(): row.td.get_text() for row in table.find_all('tr')}
category = soup.select_one('ul.breadcrumb li:nth-of-type(3) a').get_text()
return {
'description': description,
'upc': info.get('UPC'),
'product_type': info.get('Product Type'),
'price_excl_tax': info.get('Price (excl. tax)'),
'price_incl_tax': info.get('Price (incl. tax)'),
'tax': info.get('Tax'),
'availability': info.get('Availability'),
'category': category
}
def scrape_all_books():
books = []
next_page = "catalogue/page-1.html"
while next_page:
print(f"Scraping {BASE_URL + next_page}")
soup = get_soup(BASE_URL + next_page)
if not soup:
break
for book in soup.select('article.product_pod'):
title = book.h3.a['title']
price = book.select_one('.price_color').get_text()[1:]
rating = get_rating_as_number(book.p['class'])
rel_link = book.h3.a['href'].replace('../../../', 'catalogue/')
book_url = BASE_URL + rel_link
details = scrape_book_details(book_url)
books.append({
'title': title,
'price': price,
'rating': rating,
'url': book_url,
**details
})
next_btn = soup.select_one('li.next a')
next_page = 'catalogue/' + next_btn['href'] if next_btn else None
time.sleep(1) # Be polite
return books
📃 main.py
This file runs the scraping logic and saves the results to a CSV file.
import csv
from scraper import scrape_all_books
def save_books_csv(books, filename='books.csv'):
keys = books[0].keys()
with open(filename, mode='w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(books)
def main():
print("Scraping all books from books.toscrape.com...")
books = scrape_all_books()
print(f"Total books scraped: {len(books)}")
save_books_csv(books)
print("Data saved to books.csv")
if __name__ == "__main__":
main()
📊 Output Sample
The resulting books.csv
will contain rows like:
title | price | rating | category | availability | description |
---|---|---|---|---|---|
A Light in the Attic | 51.77 | 3 | Poetry | In stock | A book of poems and whimsical illustrations. |
🤖 Tips and Next Steps
Here are some fun ways you can extend this project:
- Filter books by price or rating
- Export to JSON or Excel
- Analyze trends with
pandas
- Build a simple web interface with Flask
- Store in a SQLite or MongoDB database
🎓 Summary
In this tutorial, you built a complete web scraper using BeautifulSoup that extracts valuable book data from a real website. You structured your code cleanly, handled pagination, and saved the results into a CSV for analysis. This is an excellent foundation for more advanced data scraping and analysis projects.
Happy scraping! 🚀