Subverting PerimeterX Protections while Web Scraping citygear.com (Hibbetts Sports)

Puppeteer-stealth is a puppeteer plugin that can help by patching fingerprint leaks which are small details controlled browsers leak into javascript environment. There's also Selenium-stealth which is no longer maintained but can be a good start toyour own patch.

That being said, many public solutions like puppeteer-stealth will not get past PerimeterX because, well, they are public. PerimeterX team can just go through them and update their service.

So, to bypass PerimeterX, you need to implement your own patches and fixes, and to start, we should first understand how PerimeterX is identifying whether the connecting client is a scraper or a real human being.

PerimeterX (and others) use score based system to determine who's a bot and who's a real user. Let's call it a trust score.
How can we raise our trust score?

Proxies

The first and easiest step is to use high-quality residential-type proxies. There are generally 3 types of IPs: datacenter (provided to data centers like Google Cloud or Amazon AWS), residential (provided to homes), and mobile (provided to cell phone towers, 4G, etc.).

When we connect to a PerimeterX-protected target, the first piece of information PerimeterX sees is our IP address - if it's unseen and of residential type, then we start off with a high trust score!

Javascript

We already covered Puppeteer-stealth, but there are thousands of other details that your browser leaks about itself, which can indicate that it's not a real user.

For this, it's best to learn how to fingerprint the browser yourself so you can patch these details. I cover this in great detail on my blog How to avoid web scraping blocking: Javascript. If you'd like to learn more though, a good start is to fork puppeteer-stealth and start playing.

Connection Patterns

Finally, the way your scraper connects to the website is also important. If you collect thousands of example.com/product/X URLs that it's very revealing that you're most likely not a real user. It's important to properly distribute/throttle your scraper and introduce a bit of chaos (like visiting the homepage) into your connection patterns.


It's a huge subject and, unfortunately, a very opaque one because revealing patches online makes these services just stronger. If ready to do more research, check out my full introduction to this subject on my blog: How to scrape without getting blocked

/r/webscraping Thread