Web scraping is writing code that reads a website and extracts data from it โ automatically.
Instead of manually copying information from a webpage, you write a script that does it for you. It visits the page, reads the HTML, and pulls out the parts you need.
How it works
- Your script sends an HTTP request to a URL (just like your browser does)
- The server returns HTML
- Your script parses the HTML and extracts the data you want
- You save it, process it, or do something useful with it
A simple example in Python
import urllib.request
import re
# Fetch a page
url = "https://example.com"
html = urllib.request.urlopen(url).read().decode()
# Extract the title
title = re.search(r'<title>(.*?)</title>', html).group(1)
print(title) # "Example Domain"
For more complex scraping, most people use libraries like BeautifulSoup or Scrapy:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://example.com")
soup = BeautifulSoup(page.content, "html.parser")
# Find all links on the page
for link in soup.find_all("a"):
print(link.get("href"))
Scraping vs APIs
| Web Scraping | API | |
|---|---|---|
| Data source | HTML pages | Structured JSON |
| Reliability | Breaks when site changes | Stable, versioned |
| Speed | Slower | Faster |
| Permission | Gray area | Explicitly allowed |
Always prefer an API when one exists. Scraping is for when thereโs no API available.
When is scraping okay?
- โ Public data that anyone can see in a browser
- โ
The siteโs
robots.txtdoesnโt block it - โ Youโre not hammering the server with requests
- โ Youโre not bypassing login walls or paywalls
- โ Donโt scrape personal data (GDPR, privacy laws)
- โ Donโt ignore rate limits or
robots.txt - โ Donโt resell scraped content as your own
Always check the siteโs terms of service.
Scraping without a browser
Some sites load content with JavaScript (single-page apps). Regular HTTP requests wonโt see that content. For those, you need a headless browser:
- Playwright โ modern, supports all browsers
- Puppeteer โ Chrome/Chromium only
- Selenium โ older, widely used
Common use cases
- Price monitoring โ track product prices across stores
- Job boards โ aggregate listings from multiple sites
- Research โ collect data for analysis
- Content monitoring โ watch for changes on pages
- Lead generation โ extract business contact info from directories