๐Ÿ“ Tutorials
ยท 2 min read

What is Web Scraping? A Simple Explanation for Developers


Web scraping is writing code that reads a website and extracts data from it โ€” automatically.

Instead of manually copying information from a webpage, you write a script that does it for you. It visits the page, reads the HTML, and pulls out the parts you need.

How it works

  1. Your script sends an HTTP request to a URL (just like your browser does)
  2. The server returns HTML
  3. Your script parses the HTML and extracts the data you want
  4. You save it, process it, or do something useful with it

A simple example in Python

import urllib.request
import re

# Fetch a page
url = "https://example.com"
html = urllib.request.urlopen(url).read().decode()

# Extract the title
title = re.search(r'<title>(.*?)</title>', html).group(1)
print(title)  # "Example Domain"

For more complex scraping, most people use libraries like BeautifulSoup or Scrapy:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://example.com")
soup = BeautifulSoup(page.content, "html.parser")

# Find all links on the page
for link in soup.find_all("a"):
    print(link.get("href"))

Scraping vs APIs

Web ScrapingAPI
Data sourceHTML pagesStructured JSON
ReliabilityBreaks when site changesStable, versioned
SpeedSlowerFaster
PermissionGray areaExplicitly allowed

Always prefer an API when one exists. Scraping is for when thereโ€™s no API available.

When is scraping okay?

  • โœ… Public data that anyone can see in a browser
  • โœ… The siteโ€™s robots.txt doesnโ€™t block it
  • โœ… Youโ€™re not hammering the server with requests
  • โœ… Youโ€™re not bypassing login walls or paywalls
  • โŒ Donโ€™t scrape personal data (GDPR, privacy laws)
  • โŒ Donโ€™t ignore rate limits or robots.txt
  • โŒ Donโ€™t resell scraped content as your own

Always check the siteโ€™s terms of service.

Scraping without a browser

Some sites load content with JavaScript (single-page apps). Regular HTTP requests wonโ€™t see that content. For those, you need a headless browser:

  • Playwright โ€” modern, supports all browsers
  • Puppeteer โ€” Chrome/Chromium only
  • Selenium โ€” older, widely used

Common use cases

  • Price monitoring โ€” track product prices across stores
  • Job boards โ€” aggregate listings from multiple sites
  • Research โ€” collect data for analysis
  • Content monitoring โ€” watch for changes on pages
  • Lead generation โ€” extract business contact info from directories
๐Ÿ“˜