This guide explains how to use HTTPX, a powerful Python HTTP client, for web scraping:
- What Is HTTPX?
- Scraping with HTTPX: Step-By-Step Guide
- HTTPX Web Scraping Advanced Features and Techniques
- HTTPX vs Requests for Web Scraping
- Conclusion
HTTPX is a fully featured HTTP client for Python 3, built on top of the httpcore
library. It is built to deliver reliable performance even under heavy multithreading. HTTPX offers both synchronous and asynchronous APIs, supporting HTTP/1.1 and HTTP/2 protocols.
Features
- Simple and Modular Codebase: Designed for ease of contribution and extension.
- Fast and Configurable: Offers fully customizable flags for probing multiple elements.
- Versatile Probing Methods: Supports various HTTP-based probing techniques.
- Automatic Protocol Fallback: Defaults to smart fallback from HTTPS to HTTP.
- Flexible Input Options: Accepts hosts, URLs, and CIDR as input.
- Advanced Features: Includes support for proxies, custom HTTP headers, configurable timeouts, basic authentication, and more.
Pros
- Command-Line Availability: Accessible via
httpx[cli]
. - Feature-Rich: Includes support for HTTP/2 and an asynchronous API.
- Actively Developed: Continuously improved with regular updates.
Cons
- Frequent Updates: New releases may introduce breaking changes.
- Less Popular: Not as widely used as the
requests
library.
HTTPX is an HTTP client, so parse and extract data from the HTML it retrieved, you will need an HTML parser like BeautifulSoup.
Warning:
While HTTPX is only used in the early stages of the process, we will walk you through a complete workflow. If you are interested in more advanced HTTPX scraping techniques, you can skip ahead to the next chapter after Step 3.
Install Python 3+ on your machine and create a directory for your HTTPX scraping project:
mkdir httpx-scraper
Navigate into it and initialize a virtual environment:
cd httpx-scraper
python -m venv env
Open the project folder in your Python IDE, create a scraper.py
file inside the folder, then activate the virtual environment. On Linux or macOS, run:
./env/bin/activate
On Windows:
env/Scripts/activate
Install HTTPX and BeautifulSoup:
pip install httpx beautifulsoup4
Import the added dependencies into your scraper.py
script:
import httpx
from bs4 import BeautifulSoup
In this example, the target page will be the “Quotes to Scrape” site:
Use HTTPX to retrieve the HTML of the homepage with the get()
method:
# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")
Behind the scenes, HTTPX will perform an HTTP GET request to the server, which will respond with the page's HTML. You can access the HTML content using the response.text
attribute:
html = response.text
print(html)
This will print the raw HTML content:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<!-- omitted for brevity... -->
</body>
</html>
Feed the HTML content to the BeautifulSoup constructor to parse it:
# Parse the HTML content using
soup = BeautifulSoup(html, "html.parser")
The soup
variable now holds the parsed HTML and exposes the methods to extract the data.
Scrape quotes data from the page:
# Where to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author")
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
This snippet defines a list named quotes
to store the scraped data. It then selects all quote HTML elements and loops through them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary within the quotes
list, organizing the data for further use or export.
Export the scraped data to a CSV file:
# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
This snippet opens a file named quotes.csv
in write mode, defines column headers (text
, author
, tags
), writes the headers to the file, and then writes each dictionary from the quotes
list to the CSV file. The csv.DictWriter
handles the formatting, making it easy to store structured data.
Import csv
from the Python Standard Library:
import csv
The final HTTPX web scraping script will contain the following code:
import httpx
from bs4 import BeautifulSoup
import csv
# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")
# Access the HTML of the target page
html = response.text
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Where to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author").get_text()
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
Execute it with:
python scraper.py
Or, on Linux/macOS:
python3 scraper.py
A quotes.csv
file will appear in the root folder of your project withthe following contents:
Let's use a more complex example. The target site will be the HTTPBin.io /anything
endpoint. This is a special API that returns the IP address, headers, and other information sent by the caller.
HTTPX allows you to specify custom headers using the headers
argument:
import httpx
# Custom headers for the request
headers = {
"accept": "application/json",
"accept-language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3"
}
# Make a GET request with custom headers
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...
User-Agent
is one of the most important HTTP headers for web scraping. By default, HTTPX uses the following User-Agent
:
python-httpx/<VERSION>
This value can easily indicate that your requests are automated, potentially resulting in the target site blocking them.
To avoid that, you can set a custom User-Agent
to mimic a real browser, like so:
import httpx
# Define a custom User-Agent
headers = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
}
# Make a GET request with the custom User-Agent
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...
Set cookies in HTTPX using the cookies
argument:
import httpx
# Define cookies as a dictionary
cookies = {
"session_id": "3126hdsab161hdabg47adgb",
"user_preferences": "dark_mode=true"
}
# Make a GET request with custom cookies
response = httpx.get("https://httpbin.io/anything", cookies=cookies)
# Handle the response...
This gives you the ability to include session data required for your web scraping requests.
Now route your HTTPX requests through a proxy to protect your identity and avoid IP bans while performing web scraping. To doo that, use the proxies
argument:
import httpx
# Replace with the URL of your proxy server
proxy = "<YOUR_PROXY_URL>"
# Make a GET request through a proxy server
response = httpx.get("https://httpbin.io/anything", proxy=proxy)
# Handle the response...
By default, HTTPX raises errors only for connection or network issues. To raise exceptions also for HTTP responses with 4xx
and 5xx
status codes,use the raise_for_status()
method as below:
import httpx
try:
response = httpx.get("https://httpbin.io/anything")
# Raise an exception for 4xx and 5xx responses
response.raise_for_status()
# Handle the response...
except httpx.HTTPStatusError as e:
# Handle HTTP status errors
print(f"HTTP error occurred: {e}")
except httpx.RequestError as e:
# Handle connection or network errors
print(f"Request error occurred: {e}")
When using the top-level API in HTTPX, a new connection is created for each request, meaning TCP connections are not reused. This approach becomes inefficient as the number of requests to a host increases.
Meanwhile, using a httpx.Client
instance enables HTTP connection pooling. This means that multiple requests to the same host can reuse an existing TCP connection instead of creating a new one for each request.
Here are some of the benefits of using a Client
over the top-level API:
- Reduced latency across requests, because there is no repeated handshaking
- Lower CPU usage and fewer round-trips
- Decreased network traffic
Additionally, Client
instances support session handling with features unavailable in the top-level API, including:
- Cookie persistence across requests
- Applying configuration across all outgoing requests
- Sending requests through HTTP proxies
It is typically recommended to use a Client
in HTTPX with a context manager (with
statement):
import httpx
with httpx.Client() as client:
# Make an HTTP request using the client
response = client.get("https://httpbin.io/anything")
# Extract the JSON response data and print it
response_data = response.json()
print(response_data)
Alternatively, you can manually manage the client and close the connection pool explicitly with client.close()
:
import httpx
client = httpx.Client()
try:
# Make an HTTP request using the client
response = client.get("https://httpbin.io/anything")
# Extract the JSON response data and print it
response_data = response.json()
print(response_data)
except:
# Handle the error...
pass
finally:
# Close the client connections and release resources
client.close()
Note:
If you are familiar with therequests
library,httpx.Client()
serves a similar purpose torequests.Session()
.
By default, HTTPX exposes a standard synchronous API. At the same time, it also offers an asynchronous client for when it's required. If you are working with asyncio
, using an async client is essential for sending outgoing HTTP requests efficiently.
To make asynchronous requests in HTTPX, initialize AsyncClient
and use it to make a GET request as shown below:
import httpx
import asyncio
async def fetch(url):
async with httpx.AsyncClient() as client:
response = await client.get(url)
return response.text
async def main():
urls = ["https://httpbin.io/anything"] * 5
responses = await asyncio.gather(*(fetch(url) for url in urls))
for response in responses:
print(response)
asyncio.run(main())
The with
statement ensures the client is automatically closed when the block ends. Alternatively, if you manage the client manually, you can close it explicitly with await client.close()
.
All HTTPX request methods (get()
, post()
, etc.) are asynchronous when using an AsyncClient
. Therefore, you must add await
before calling them to get a response.
Network instability during web scraping may result in connection failures or timeouts. HTTPX simplifies handling such issues via its HTTPTransport
interface. This mechanism retries requests when an httpx.ConnectError
or httpx.ConnectTimeout
occurs.
The following code demonstrates how to configure a transport to retry requests up to 3 times:
import httpx
# Configure transport with retry capability on connection errors or timeouts
transport = httpx.HTTPTransport(retries=3)
# Use the transport with an HTTPX client
with httpx.Client(transport=transport) as client:
# Make a GET request
response = client.get("https://httpbin.io/anything")
# Handle the response...
Only connection-related errors trigger a retry. To handle read/write errors or specific HTTP status codes, you need to implement custom retry logic with libraries like tenacity
.
The following table compares HTTPX and Requests for web scraping:
Feature | HTTPX | Requests |
---|---|---|
GitHub stars | 8k | 52.4k |
Async support | ✔️ | ❌ |
Connection pooling | ✔️ | ✔️ |
HTTP/2 support | ✔️ | ❌ |
User-agent customization | ✔️ | ✔️ |
Proxy support | ✔️ | ✔️ |
Cookie handling | ✔️ | ✔️ |
Timeouts | Customizable for connection and read | Customizable for connection and read |
Retry mechanism | Available via transports | Available via HTTPAdapter s |
Performance | High | Medium |
Community support and popularity | Growing | Large |
Automated HTTP requests expose your public IP address, potentially revealing your identity and location, which compromises your privacy. To enhance your security and privacy, use a proxy server to hide your IP address.
Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:
- Datacenter proxies – Over 770,000 datacenter IPs.
- Residential proxies – Over 72M residential IPs in more than 195 countries.
- ISP proxies – Over 700,000 ISP IPs.
- Mobile proxies – Over 7M mobile IPs.
Create a free Bright Data account today to test our scraping solutions and proxies!