How to Programmatically Clean Up Data Collected from Web Crawlers


When you put your website’s content on the internet, you want to make sure that it is read and viewed by people who have an interest in what you have to say. To do this, the best idea is to publish articles regularly so that people can find your site with Google or other search engines. But how does one get ranked? How does one know if their content is being read? Basically, some sort of analytics tool would be necessary for this. This blog article will go over how analytics are collected and organized.
This is an article on crawler. We cannot watch it unless you join us. Please post any questions in the replies section of this post.
Several users have been interested in how the crawler data about the crawler-aware web site is organized, and today we will end up being more than interested to reveal how the crawler info is collected plus organized.

We can reverse the IP address in the crawler to query the particular rDNS, such as: we find this IP: 116. 179. 32. 160, rDNS simply by reverse DNS lookup tool: baiduspider-116-179-32-160. spider. baidu. com

From the above, we can roughly determine should end up being Baidu internet search engine spiders. Because Hostname could be forged, so we only reverse lookup, still not correct. We also require to forward lookup, we ping order to find baiduspider-116-179-32-160. crawl. baidu. possuindo can be resolved since: 116. 179. 32. 160, through the following chart could be seen baiduspider-116-179-32-160. crawl. baidu. possuindo is resolved to the IP address 116. 179. 32. 160, which means that the Baidu lookup engine crawler will be sure.

Searching by ASN-related information

Only a few crawlers follow the particular above rules, most crawlers reverse look for without any outcomes, we need to query the IP address ASN details to determine when the crawler info is correct.

For example , this IP is usually 74. 119. 118. 20, we can see that IP address is the particular Internet protocol address of Sunnyvale, California, USA by simply querying the IP information.

We may see by the ASN information of which he is definitely an IP of Criteo Corp.

The screenshot above shows the working information of critieo crawler, the yellow-colored part is its User-agent, then the IP, and practically nothing wrong with this particular entry (the IP is usually indeed the Internet protocol address of CriteoBot).

Internet protocol address segment published with the crawler’s official paperwork

Some crawlers publish IP address sectors, and save the officially published IP address segments regarding the crawler straight to the database, which is an easy in addition to fast way in order to do this.

Through public logs

We could often view open public logs on the particular Internet, for instance , typically the following image is really a public log record I found.

We all can parse the particular log records to determine which are crawlers and which are visitors centered on the User-agent, which greatly enhances our database associated with crawler records.


These four strategies detail how the crawler identification site collects and sets up crawler data, and how to guarantee the accuracy plus reliability of the crawler data, nevertheless of course presently there are not only typically the above four strategies in the actual operation process, nevertheless they are much less used, so they aren’t introduced right here.

Tags: , ,

3 Replies to “How to Programmatically Clean Up Data Collected from Web Crawlers”

  1. Here is a brief overview of how the crawler collects data and organizes it.

Comments are closed.