開発・エンジニアリング・情報システム

Scraping project, how we failed and what is Anti-bot system

We failed by scraping the home page calls "at home". We explain how we tried, and why we failed. This is a very good example.


Here is athome site.

 

A CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge–response test used in computing to determine whether or not the user is human.

Athome site has strong captcha functionality as an anti-bot system.

This site has two kinds of captcha.(Dragging and Dropping captcha, Distil captcha)

I have used two kinds of programs in order to extract necessary data from athome site.

One script was written with a python scrapy framework.

And another script was built with selenium packages.

The running result of scrapy framework is as follows.

 

We cannot go further by using scrapy framework, because Distil captcha begins to work.

So we have used selenium script to avoid captcha.

We can avoid this Captcha by dragging and dropping the button manually.

 

We can use selenium package to get to the destination site url once or twice.

However, if you repeat this operation several times, Distil captcha functionality will start to work.

Here is Distil captcha.

Distil CAPTCHA provides an easier test for humans than Google ReCAPTCHA, while being harder for bots to solve. It works in any language or country and defeats CAPTCHA bypass techniques such as OCR, brute force, machine learning, and human CAPTCHA farms.

Best of all, Distil only serves CAPTCHAs to visitors that it determines are bots. The vast majority of legitimate human users remain blissfully unaware testing has occurred.

Distil proved its ability as an anti-scrape-bot service. It does JavaScript embedding for bot recognition and threat analysis as well as captcha popups and blocking on demand.

A CAPTCHA ("Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge–response test used in computing to determine whether or not the user is human.

Athome site has strong captcha functionality as an anti-bot system.

 

Distil captcha will receive this cookie information and analyze this information.

On the basis of this, Captcha will determine whether a person or auto scraping bot.

We cannot avoid this distil captcha by using selenium packages.

 

 

Similar posts

メールマガジンに登録して最新の情報をゲットしよう。