This guide explains how to use curl_cffi to enhance a web scraping script in Python by mimicking real browser TLS fingerprints.
- What Is
curl_cffi
? - How It Works
- How to Use
curl_cffi
for Web Scraping curl_cffi
: Advanced Usagecurl_cffi
vs Requests vs AIOHTTP vs HTTPX for Web Scrapingcurl_cffi
Alternatives for Web Scraping
curl_cffi
provides Python bindings for the curl-impersonate
fork via CFFI and thus can impersonate browser TLS/JA3/HTTP2 fingerprints. This helps bypassing anti-bot blocks based on TLS fingerprinting.
Here are some of its features:
- Support for JA3/TLS and HTTP2 fingerprint impersonation, including recent browsers and custom fingerprints
- Much faster than
requests
andhttpx
, on par withaiohttp
- Mimics the
requests
API - Support for
asyncio
to perform asynchronous HTTP requests - Support for proxy rotation on each request
- Support for HTTP/2.0 and
WebSocket
When you send an HTTPS request, a TLS handshake takes place, generating a unique TLS fingerprint. Because HTTP clients operate differently from web browsers, their fingerprints can reveal automation, potentially activating anti-bot defenses.
cURL Impersonate, that curl_cffi
is based on, customizes cURL to replicate authentic browser TLS fingerprints:
- TLS library tweaks: Rely on the libraries for TLS connection used by browsers instead of that of cURL.
- Configuration changes: Adjust TLS extensions and SSL options to mimic browsers.
- HTTP/2 customization: Match browser handshake settings.
- Non-default cURL flags: Set
--ciphers
,--curves
, and custom headers for accuracy.
This makes the requests resemble those from a real browser, aiding in bypassing bot detection.
Let's try to scrape the “Keyboard” page from Walmart:
If you try to access this page using any HTTP client, you will receive the following error page:
You will get this bot detection page even if you set the User-Agent
to simulate a real browser because of TLS fingerprinting. This is where curl_cffi
comes in handy.
Make sure that you have Python 3+ installed on your machine. Then, create a directory for your curl_cffi
scraping project:
mkdir curl-cfii-scraper
Navigate into that directory and set up a virtual environment:
cd curl-cfii-scraper
python -m venv env
Open the project folder in your preferred Python IDE and create a scraper.py
file in that folder.
In your IDE’s terminal, activate the virtual environment. On Linux or macOS, use:
./env/bin/activate
On Windows, launch:
env/Scripts/activate
In an activated virtual environment, install the HTTP client:
pip install curl-cffi
Import requests
from curl_cffi
:
from curl_cffi import requests
This object exposes a high-level Requests-like API. You can use it to perform a GET HTTP request to the target page:
response = requests.get("https://www.walmart.com/search?q=keyboard", impersonate="chrome")
The impersonate="chrome"
argument tells curl_cffi
to make the HTTP request look like it is coming from the latest version of Chrome. This will make Walmart treat the automated request as a regular browser request and return the standard web page.
You can access the HTML content of the target page with:
html = response.text
If you print html
, you will see:
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charSet="utf-8"/>
<meta property="fb:app_id" content="105223049547814"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1, interactive-widget=resizes-content"/>
<link rel="dns-prefetch" href="https://tap.walmart.com "/>
<link rel="preload" fetchpriority="high" crossorigin="anonymous" href="https://i5.walmartimages.com/dfw/63fd9f59-a78c/fcfae9b6-2f69-4f89-beed-f0eeb4237946/v1/BogleWeb_subset-Bold.woff2" as="font" type="font/woff2"/>
<link rel="preload" fetchpriority="high" crossorigin="anonymous" href="https://i5.walmartimages.com/dfw/63fd9f59-a78c/fcfae9b6-2f69-4f89-beed-f0eeb4237946/v1/BogleWeb_subset-Regular.woff2" as="font" type="font/woff2"/>
<link rel="preconnect" href="https://beacon.walmart.com"/>
<link rel="preconnect" href="https://b.wal.co"/>
<title>Electronics - Walmart.com</title>
<!-- omitted for brevity ... -->
To perform web scraping, you will also need a library for HTML parsing like BeautifulSoup:
pip install beautifulsoup4
Import it in scraper.py
:
from bs4 import BeautifulSoup
Use it to parse the HTML of the page:
soup = BeautifulSoup(response.text, "html.parser")
"html.parser"
is the default HTML parser from Python’s standard library used by BeautifulSoup for parsing the HTML string. It contains methods to select HTML elements on the page and extract data from them.
The next example illustrates how to scrape just the page title. You can select it through a CSS selector using the find()
method and then access its text with the text
attribute:
title_element = soup.find("title")
title = title_element.text
Print the page title:
print(title)
This is your final curl_cffi
web scraping script:
from curl_cffi import requests
from bs4 import BeautifulSoup
# Send a GET request to the Walmart search page for "keyboard"
response = requests.get("https://www.walmart.com/search?q=keyboard", impersonate="chrome")
# Extract the HTML from the page
html = response.text
# Parse the response content with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Find the title tag using a CSS selector and print it
title_element = soup.find("title")
# Extract data from it
title = title_element.text
# More complex scraping logic...
# Print the scraped data
print(title)
Launch it:
python3 scraper.py
On Windows:
python scraper.py
The result will be:
Electronics - Walmart.com
If you remove the impersonate="chrome"
argument, you will get instead:
Robot or human?
curl_cffi
supports impersonating several browsers using unique labels that you can pass to the impersonate
argument:
response = requests.get("<YOUR_URL>", impersonate="<BROWSER_LABEL>")
You can use the following labels:
chrome99
,chrome100
,chrome101
,chrome104
,chrome107
,chrome110
,chrome116
,chrome119
,chrome120
,chrome123
,chrome124
,chrome131
chrome99_android
,chrome131_android
edge99
,edge101
safari15_3
,safari15_5
,safari17_0
,safari17_2_ios
,safari18_0
,safari18_0_ios
Here are some recommendations:
- To always impersonate the latest browser versions, you can simply use
chrome
,safari
andsafari_ios
. - Firefox is currently not available because it uses NSS, while other browsers use boringssl, and curl can only be linked to one TLS library at the same time.
- Browser versions are added only when their fingerprints change. If a version is skipped, you can still impersonate it by using the headers of the previous version.
- For non-browser targets, use
ja3
,akamai
, and similar arguments to specify your own custom TLS fingerprints. Please refer to the documentation on impersonation for details.
curl-cfii
can use Session
objects to persist certain parameters across multiple requests, such as cookies, headers, or other session-specific data.
Here is a code example:
# Create a new session
session = requests.Session()
# This endpoint sets a cookie on the server
session.get("https://httpbin.io/cookies/set/userId/5", impersonate="chrome")
# Print the session's cookies to confirm they are being stored
print(session.cookies)
The output of the above script will be:
<Cookies[<Cookie userId=5 for httpbin.org />]>
The result proves that the session is maintaining state across requests, such as storing cookies defined by the server.
Just like the requests
library, curl_cffi
supports proxy integration through a proxies
object:
# Define your proxy URL
proxy = "YOUR_PROXY_URL"
# Create a dictionary of proxies for HTTP and HTTPS
proxies = {"http": proxy, "https": proxy}
# Make a request using a proxy and browser impersonation
response = requests.get("<YOUR_URL>", impersonate="chrome", proxies=proxies)
curl_cffi
supports peforming async requests through asyncio
via the AsyncSession
object:
from curl_cffi.requests import AsyncSession
import asyncio
# Define an async function to execute the asynchronous code
async def fetch_data():
async with AsyncSession() as session:
# Perform the asynchronous GET request
response = await session.get("https://httpbin.org/anything", impersonate="chrome")
# Print the response text
print(response.text)
# Run the async function
asyncio.run(fetch_data())
Using AsyncSession
makes it easier to handle multiple asynchronous requests efficiently.
curl_cffi
also supports WebSocket
s through the WebSocket
class:
from curl_cffi.requests import WebSocket
# Define a callback function to handle incoming messages
def on_message(ws, message):
print(message)
# Initialize the WebSocket connection with the callback
ws = WebSocket(on_message=on_message)
# Connect to a sample WebSocket server and listen for messages
ws.run_forever("wss://api.gemini.com/v1/marketdata/BTCUSD")
This is especially useful for scraping real-time data from sites or APIs that use WebSocket
to populate data dynamically.
Instead of scraping rendered pages, you can directly target the WebSocket
channel for efficient data retrieval.
Note:
You can useWebSocket
s asynchronously thanks to theAsyncWebSocket
class.
The following table compares curl_cffi
with other popular Python HTTP clients for web scraping:
Feature | curl_cffi | Requests | AIOHTTP | HTTPX |
---|---|---|---|---|
Sync API | ✔️ | ✔️ | ❌ | ✔️ |
Async API | ✔️ | ❌ | ✔️ | ✔️ |
Support for **WebSocket** s |
✔️ | ❌ | ✔️ | ❌ |
Connection pooling | ✔️ | ✔️ | ✔️ | ✔️ |
Support for HTTP/2 | ✔️ | ❌ | ❌ | ✔️ |
**User-Agent** customization |
✔️ | ✔️ | ✔️ | ✔️ |
TLS fingerprint spoofing | ✔️ | ❌ | ❌ | ❌ |
Speed | High | Medium | High | Medium |
Retry mechanism | ❌ | Available via HTTPAdapter s |
Available only via a third-party library | Available via built-in Transport s |
Proxy integration | ✔️ | ✔️ | ✔️ | ✔️ |
Cookie handling | ✔️ | ✔️ | ✔️ | ✔️ |
curl_cffi
requires a manual approach to web scraping, where most of the code must be written by hand. While effective for simple static websites, it can be challenging when dealing with dynamic or highly secure sites.
Bright Data provides several curl_cffi
alternatives:
- Scraping Browser API: Fully managed cloud browser instances seamlessly integrated with Puppeteer, Selenium, and Playwright. These browsers come with built-in CAPTCHA solving and automated proxy rotation.
- Web Scraper APIs: Pre-configured endpoints provide fresh, structured data from over 100 popular domains.
- No-Code Scraper: A user-friendly, on-demand data collection service that requires no coding.
- Datasets: Browse pre-built datasets from various websites or tailor data collections to meet your specific needs.
Create a free Bright Data account today to test our proxies and scraping solutions!