How to Web Scrape with Python

What you’ll build or solve

You’ll fetch a web page, parse its HTML, and extract the data you care about into a clean Python structure.

When this approach works best

Web scraping works well when you:

  • Pull public data from a page that has no API, like a directory page or simple listings.
  • Collect repeated page elements, like product cards, job posts, or article headlines.
  • Monitor a page for changes, like “new items added” or “price changed.”

Avoid scraping when the site offers an API or export option. Also skip scraping pages behind logins, paywalls, or access controls unless you have explicit permission.

Prerequisites

  • Python installed
  • You know how to run a Python script
  • Basic familiarity with HTML (tags, classes, attributes)

Step-by-step instructions

1) Install the tools you’ll use

A common setup is requests for downloading pages and beautifulsoup4 for parsing HTML.

pip install requests beautifulsoup4 lxml

What to look for:

lxml makes parsing faster and more forgiving than the default parser in many cases.


2) Fetch the page HTML with requests

Start by downloading the page and checking the status code.

importrequests

url="https://example.com"
headers= {"User-Agent":"Mozilla/5.0"}

response=requests.get(url,headers=headers,timeout=10)
response.raise_for_status()

html=response.text
print(html[:200])

What to look for:

raise_for_status() turns HTTP errors (like 404 or 403) into an exception you can handle. A timeout helps your script fail fast instead of hanging.


3) Parse HTML and select elements with Beautiful Soup

Create a BeautifulSoup object, then select the parts you want.

frombs4importBeautifulSoup

soup=BeautifulSoup(html,"lxml")

title=soup.title.get_text(strip=True)
links=soup.select("a")# all links

print(title)
print("links:",len(links))

What to look for:

select() uses CSS selectors. This often reads more clearly than manual tag searches once you get used to it.


4) Extract structured data from repeated elements

Loop over matching elements and build a list of dictionaries.

items= []

forcardinsoup.select(".card"):
name_el=card.select_one(".name")
price_el=card.select_one(".price")

items.append(
        {
"name":name_el.get_text(strip=True)ifname_elelse"",
"price":price_el.get_text(strip=True)ifprice_elelse"",
        }
    )

print("found:",len(items))

What to look for:

select_one() can return None if the page changes. Handle missing elements instead of assuming every field exists.


Examples you can copy

Example 1: Scrape all headline links from a news-like page

importrequests
frombs4importBeautifulSoup
fromurllib.parseimporturljoin

url="https://example.com"
headers= {"User-Agent":"Mozilla/5.0"}

r=requests.get(url,headers=headers,timeout=10)
r.raise_for_status()

soup=BeautifulSoup(r.text,"lxml")

results= []
forainsoup.select("h2 a"):
text=a.get_text(strip=True)
href=urljoin(url,a.get("href",""))
iftextandhref:
results.append((text,href))

print(results[:5])

Example 2: Scrape a table into rows

importrequests
frombs4importBeautifulSoup

url="https://example.com"
r=requests.get(url,headers={"User-Agent":"Mozilla/5.0"},timeout=10)
r.raise_for_status()

soup=BeautifulSoup(r.text,"lxml")

rows= []
fortrinsoup.select("table tbody tr"):
cells= [td.get_text(strip=True)fortdintr.select("td")]
ifcells:
rows.append(cells)

print("rows:",len(rows))
print(rows[0]ifrowselse"no rows found")

Example 3: Follow pagination until no “Next” link

This pattern keeps your scraper simple and avoids guessing page counts.

importtime
importrequests
frombs4importBeautifulSoup
fromurllib.parseimporturljoin

start_url="https://example.com/list"
headers= {"User-Agent":"Mozilla/5.0"}

url=start_url
all_items= []

whileurl:
r=requests.get(url,headers=headers,timeout=10)
r.raise_for_status()
soup=BeautifulSoup(r.text,"lxml")

forcardinsoup.select(".card"):
title_el=card.select_one(".title")
title=title_el.get_text(strip=True)iftitle_elelse""
iftitle:
all_items.append(title)

next_el=soup.select_one("a.next")
url=urljoin(url,next_el["href"])ifnext_elandnext_el.get("href")elseNone

time.sleep(1)# be polite

print("items:",len(all_items))

Example 4: Extract JSON data embedded in a script tag

Some pages render data into the HTML as JSON.

importjson
importrequests
frombs4importBeautifulSoup

url="https://example.com"
r=requests.get(url,headers={"User-Agent":"Mozilla/5.0"},timeout=10)
r.raise_for_status()

soup=BeautifulSoup(r.text,"lxml")
script=soup.select_one("script#data")

data= {}
ifscriptandscript.string:
data=json.loads(script.string)

print(type(data))

Example 5: Save scraped results to a CSV file

importcsv

items= [
    {"name":"Item A","price":"10"},
    {"name":"Item B","price":"25"},
]

withopen("results.csv","w",newline="",encoding="utf-8")asf:
writer=csv.DictWriter(f,fieldnames=["name","price"])
writer.writeheader()
writer.writerows(items)

print("saved:",len(items))

Common mistakes and how to fix them

Mistake 1: Scraping the wrong HTML because the page is rendered by JavaScript

What you might do

importrequests
frombs4importBeautifulSoup

r=requests.get("https://example.com",headers={"User-Agent":"Mozilla/5.0"})
soup=BeautifulSoup(r.text,"lxml")
print(soup.select(".card"))

Why it breaks

requests downloads the initial HTML. If the real content loads later via JavaScript, your selector returns nothing.

Fix

Look for a public JSON endpoint in the Network tab of your browser dev tools, or use an official API if available. If neither exists, use a browser automation tool (like Playwright or Selenium) with permission from the site.


Mistake 2: Assuming every element exists

What you might do

name=card.select_one(".name").get_text(strip=True)

Why it breaks

A small page change can remove or rename .name, so select_one() returns None and your code crashes.

Fix

name_el=card.select_one(".name")
name=name_el.get_text(strip=True)ifname_elelse""

Mistake 3: Forgetting to handle relative links

What you might do

href=a["href"]# "/post/123"

Why it breaks

Relative URLs are not usable on their own.

Fix

fromurllib.parseimporturljoin

href=urljoin(base_url,a.get("href",""))

Troubleshooting

If you see ModuleNotFoundError, run pip install requests beautifulsoup4 lxml in the same environment you use to run Python.

If you get 403 Forbidden, the site may block automated requests. Check the site’s terms and robots policy, then slow down requests and send a basic User-Agent. If the site disallows scraping, stop.

If you get 429 Too Many Requests, add a delay (like time.sleep(1)) and reduce how often you hit the server.

If your selectors return empty lists, print a small chunk of response.text and confirm the elements exist in the downloaded HTML.

If scraping stops early or repeats pages, print the “next page” URL each loop and confirm it changes.

If characters look broken, use encoding="utf-8" when writing files, and check response.apparent_encoding if the site uses a different encoding.


Quick recap

  • Install requests and beautifulsoup4, then fetch HTML with requests.get().
  • Parse HTML with Beautiful Soup and select elements using CSS selectors.
  • Extract fields carefully and handle missing elements.
  • Fix relative links with urljoin().
  • Respect site rules, rate limits, and permission boundaries.