HTML CSS JavaScript Python Swift TypeScript SQL React

How to Web Scrape with Python

What you'll build or solve

You'll fetch a web page, parse its HTML, and extract the data you care about into a clean Python structure.

When this approach works best

Web scraping works well when you:

Learn Python on Mimo

Become a Python developer. Master Python from basics to advanced topics, including data structures, functions, classes, and error handling

Beginner friendly

COURSEPython

Start your coding journey with Python. Learn basics, data types, control flow, and more

Beginner friendly

Pull public data from a page that has no API, like a directory page or simple listings.
Collect repeated page elements, like product cards, job posts, or article headlines.
Monitor a page for changes, like "new items added" or "price changed."

Avoid scraping when the site offers an API or export option. Also skip scraping pages behind logins, paywalls, or access controls unless you have explicit permission.

Prerequisites

Python installed
You know how to run a Python script
Basic familiarity with HTML (tags, classes, attributes)

Step-by-step instructions

1) Install the tools you'll use

A common setup is requests for downloading pages and beautifulsoup4 for parsing HTML.

Bash

Open in Mimo

pip install requests beautifulsoup4 lxml

What to look for:

lxml makes parsing faster and more forgiving than the default parser in many cases.

2) Fetch the page HTML with `requests`

Start by downloading the page and checking the status code.

Python

Open in Mimo

import requests

url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

html = response.text
print(html[:200])

What to look for:

raise_for_status() turns HTTP errors (like 404 or 403) into an exception you can handle. A timeout helps your script fail fast instead of hanging.

3) Parse HTML and select elements with Beautiful Soup

Create a BeautifulSoup object, then select the parts you want.

Python

Open in Mimo

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

title = soup.title.get_text(strip=True)
links = soup.select("a")  # all links

print(title)
print("links:", len(links))

What to look for:

select() uses CSS selectors. This often reads more clearly than manual tag searches once you get used to it.

4) Extract structured data from repeated elements

Loop over matching elements and build a list of dictionaries.

Python

Open in Mimo

items = []

for card in soup.select(".card"):
    name_el = card.select_one(".name")
    price_el = card.select_one(".price")

    items.append(
        {
            "name": name_el.get_text(strip=True) if name_el else "",
            "price": price_el.get_text(strip=True) if price_el else "",
        }
    )

print("found:", len(items))

What to look for:

select_one() can return None if the page changes. Handle missing elements instead of assuming every field exists.

Examples you can copy

Example 1: Scrape all headline links from a news-like page

Python

Open in Mimo

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://example.com"
headers = {"User-Agent": "Mozilla/5.0"}

r = requests.get(url, headers=headers, timeout=10)
r.raise_for_status()

soup = BeautifulSoup(r.text, "lxml")

results = []
for a in soup.select("h2 a"):
    text = a.get_text(strip=True)
    href = urljoin(url, a.get("href", ""))
    if text and href:
        results.append((text, href))

print(results[:5])

Example 2: Scrape a table into rows

Python

Open in Mimo

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
r.raise_for_status()

soup = BeautifulSoup(r.text, "lxml")

rows = []
for tr in soup.select("table tbody tr"):
    cells = [td.get_text(strip=True) for td in tr.select("td")]
    if cells:
        rows.append(cells)

print("rows:", len(rows))
print(rows[0] if rows else "no rows found")

Example 3: Follow pagination until no "Next" link

This pattern keeps your scraper simple and avoids guessing page counts.

Python

Open in Mimo

import time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

start_url = "https://example.com/list"
headers = {"User-Agent": "Mozilla/5.0"}

url = start_url
all_items = []

while url:
    r = requests.get(url, headers=headers, timeout=10)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")

    for card in soup.select(".card"):
        title_el = card.select_one(".title")
        title = title_el.get_text(strip=True) if title_el else ""
        if title:
            all_items.append(title)

    next_el = soup.select_one("a.next")
    url = urljoin(url, next_el["href"]) if next_el and next_el.get("href") else None

    time.sleep(1)  # be polite

print("items:", len(all_items))

Example 4: Extract JSON data embedded in a script tag

Some pages render data into the HTML as JSON.

Python

Open in Mimo

import json
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
r.raise_for_status()

soup = BeautifulSoup(r.text, "lxml")
script = soup.select_one("script#data")

data = {}
if script and script.string:
    data = json.loads(script.string)

print(type(data))

Example 5: Save scraped results to a CSV file

Python

Open in Mimo

import csv

items = [
    {"name": "Item A", "price": "10"},
    {"name": "Item B", "price": "25"},
]

with open("results.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price"])
    writer.writeheader()
    writer.writerows(items)

print("saved:", len(items))

Common mistakes and how to fix them

Mistake 1: Scraping the wrong HTML because the page is rendered by JavaScript

What you might do

Python

Open in Mimo

import requests
from bs4 import BeautifulSoup

r = requests.get("https://example.com", headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(r.text, "lxml")
print(soup.select(".card"))

Why it breaks

requests downloads the initial HTML. If the real content loads later via JavaScript, your selector returns nothing.

Fix

Look for a public JSON endpoint in the Network tab of your browser dev tools, or use an official API if available. If neither exists, use a browser automation tool (like Playwright or Selenium) with permission from the site.

Mistake 2: Assuming every element exists

What you might do

Python

Open in Mimo

name = card.select_one(".name").get_text(strip=True)

Why it breaks

A small page change can remove or rename .name, so select_one() returns None and your code crashes.

Fix

Python

Open in Mimo

name_el = card.select_one(".name")
name = name_el.get_text(strip=True) if name_el else ""

Mistake 3: Forgetting to handle relative links

What you might do

Python

Open in Mimo

href = a["href"]  # "/post/123"

Why it breaks

Relative URLs are not usable on their own.

Fix

Python

Open in Mimo

from urllib.parse import urljoin

href = urljoin(base_url, a.get("href", ""))

Troubleshooting

If you see ModuleNotFoundError, run pip install requests beautifulsoup4 lxml in the same environment you use to run Python.

If you get 403 Forbidden, the site may block automated requests. Check the site's terms and robots policy, then slow down requests and send a basic User-Agent. If the site disallows scraping, stop.

If you get 429 Too Many Requests, add a delay (like time.sleep(1)) and reduce how often you hit the server.

If your selectors return empty lists, print a small chunk of response.text and confirm the elements exist in the downloaded HTML.

If scraping stops early or repeats pages, print the "next page" URL each loop and confirm it changes.

If characters look broken, use encoding="utf-8" when writing files, and check response.apparent_encoding if the site uses a different encoding.