How to Web Scrape with Python
What you’ll build or solve
You’ll fetch a web page, parse its HTML, and extract the data you care about into a clean Python structure.
When this approach works best
Web scraping works well when you:
Learn Python on Mimo
- Pull public data from a page that has no API, like a directory page or simple listings.
- Collect repeated page elements, like product cards, job posts, or article headlines.
- Monitor a page for changes, like “new items added” or “price changed.”
Avoid scraping when the site offers an API or export option. Also skip scraping pages behind logins, paywalls, or access controls unless you have explicit permission.
Prerequisites
- Python installed
- You know how to run a Python script
- Basic familiarity with HTML (tags, classes, attributes)
Step-by-step instructions
1) Install the tools you’ll use
A common setup is requests for downloading pages and beautifulsoup4 for parsing HTML.
pip install requests beautifulsoup4 lxml
What to look for:
lxml makes parsing faster and more forgiving than the default parser in many cases.
2) Fetch the page HTML with requests
Start by downloading the page and checking the status code.
importrequests
url="https://example.com"
headers= {"User-Agent":"Mozilla/5.0"}
response=requests.get(url,headers=headers,timeout=10)
response.raise_for_status()
html=response.text
print(html[:200])
What to look for:
raise_for_status() turns HTTP errors (like 404 or 403) into an exception you can handle. A timeout helps your script fail fast instead of hanging.
3) Parse HTML and select elements with Beautiful Soup
Create a BeautifulSoup object, then select the parts you want.
frombs4importBeautifulSoup
soup=BeautifulSoup(html,"lxml")
title=soup.title.get_text(strip=True)
links=soup.select("a")# all links
print(title)
print("links:",len(links))
What to look for:
select() uses CSS selectors. This often reads more clearly than manual tag searches once you get used to it.
4) Extract structured data from repeated elements
Loop over matching elements and build a list of dictionaries.
items= []
forcardinsoup.select(".card"):
name_el=card.select_one(".name")
price_el=card.select_one(".price")
items.append(
{
"name":name_el.get_text(strip=True)ifname_elelse"",
"price":price_el.get_text(strip=True)ifprice_elelse"",
}
)
print("found:",len(items))
What to look for:
select_one() can return None if the page changes. Handle missing elements instead of assuming every field exists.
Examples you can copy
Example 1: Scrape all headline links from a news-like page
importrequests
frombs4importBeautifulSoup
fromurllib.parseimporturljoin
url="https://example.com"
headers= {"User-Agent":"Mozilla/5.0"}
r=requests.get(url,headers=headers,timeout=10)
r.raise_for_status()
soup=BeautifulSoup(r.text,"lxml")
results= []
forainsoup.select("h2 a"):
text=a.get_text(strip=True)
href=urljoin(url,a.get("href",""))
iftextandhref:
results.append((text,href))
print(results[:5])
Example 2: Scrape a table into rows
importrequests
frombs4importBeautifulSoup
url="https://example.com"
r=requests.get(url,headers={"User-Agent":"Mozilla/5.0"},timeout=10)
r.raise_for_status()
soup=BeautifulSoup(r.text,"lxml")
rows= []
fortrinsoup.select("table tbody tr"):
cells= [td.get_text(strip=True)fortdintr.select("td")]
ifcells:
rows.append(cells)
print("rows:",len(rows))
print(rows[0]ifrowselse"no rows found")
Example 3: Follow pagination until no “Next” link
This pattern keeps your scraper simple and avoids guessing page counts.
importtime
importrequests
frombs4importBeautifulSoup
fromurllib.parseimporturljoin
start_url="https://example.com/list"
headers= {"User-Agent":"Mozilla/5.0"}
url=start_url
all_items= []
whileurl:
r=requests.get(url,headers=headers,timeout=10)
r.raise_for_status()
soup=BeautifulSoup(r.text,"lxml")
forcardinsoup.select(".card"):
title_el=card.select_one(".title")
title=title_el.get_text(strip=True)iftitle_elelse""
iftitle:
all_items.append(title)
next_el=soup.select_one("a.next")
url=urljoin(url,next_el["href"])ifnext_elandnext_el.get("href")elseNone
time.sleep(1)# be polite
print("items:",len(all_items))
Example 4: Extract JSON data embedded in a script tag
Some pages render data into the HTML as JSON.
importjson
importrequests
frombs4importBeautifulSoup
url="https://example.com"
r=requests.get(url,headers={"User-Agent":"Mozilla/5.0"},timeout=10)
r.raise_for_status()
soup=BeautifulSoup(r.text,"lxml")
script=soup.select_one("script#data")
data= {}
ifscriptandscript.string:
data=json.loads(script.string)
print(type(data))
Example 5: Save scraped results to a CSV file
importcsv
items= [
{"name":"Item A","price":"10"},
{"name":"Item B","price":"25"},
]
withopen("results.csv","w",newline="",encoding="utf-8")asf:
writer=csv.DictWriter(f,fieldnames=["name","price"])
writer.writeheader()
writer.writerows(items)
print("saved:",len(items))
Common mistakes and how to fix them
Mistake 1: Scraping the wrong HTML because the page is rendered by JavaScript
What you might do
importrequests
frombs4importBeautifulSoup
r=requests.get("https://example.com",headers={"User-Agent":"Mozilla/5.0"})
soup=BeautifulSoup(r.text,"lxml")
print(soup.select(".card"))
Why it breaks
requests downloads the initial HTML. If the real content loads later via JavaScript, your selector returns nothing.
Fix
Look for a public JSON endpoint in the Network tab of your browser dev tools, or use an official API if available. If neither exists, use a browser automation tool (like Playwright or Selenium) with permission from the site.
Mistake 2: Assuming every element exists
What you might do
name=card.select_one(".name").get_text(strip=True)
Why it breaks
A small page change can remove or rename .name, so select_one() returns None and your code crashes.
Fix
name_el=card.select_one(".name")
name=name_el.get_text(strip=True)ifname_elelse""
Mistake 3: Forgetting to handle relative links
What you might do
href=a["href"]# "/post/123"
Why it breaks
Relative URLs are not usable on their own.
Fix
fromurllib.parseimporturljoin
href=urljoin(base_url,a.get("href",""))
Troubleshooting
If you see ModuleNotFoundError, run pip install requests beautifulsoup4 lxml in the same environment you use to run Python.
If you get 403 Forbidden, the site may block automated requests. Check the site’s terms and robots policy, then slow down requests and send a basic User-Agent. If the site disallows scraping, stop.
If you get 429 Too Many Requests, add a delay (like time.sleep(1)) and reduce how often you hit the server.
If your selectors return empty lists, print a small chunk of response.text and confirm the elements exist in the downloaded HTML.
If scraping stops early or repeats pages, print the “next page” URL each loop and confirm it changes.
If characters look broken, use encoding="utf-8" when writing files, and check response.apparent_encoding if the site uses a different encoding.
Quick recap
- Install
requestsandbeautifulsoup4, then fetch HTML withrequests.get(). - Parse HTML with Beautiful Soup and select elements using CSS selectors.
- Extract fields carefully and handle missing elements.
- Fix relative links with
urljoin(). - Respect site rules, rate limits, and permission boundaries.
Join 35M+ people learning for free on Mimo
4.8 out of 5 across 1M+ reviews
Check us out on Apple AppStore, Google Play Store, and Trustpilot