[analog-help] Excluding Robots

Jeremy Wadsack jeremy at 7simplemachines.com
Thu Jan 17 09:01:05 PST 2008


The list of robots came from the robots database which is no longer active. If you want an archived version of that list there's probably someone here who has a copy.

Blocking hosts that requests robots.txt may not be accurate. Larger search engines certainly have dedicated IP addresses for their crawlers, but there are thousands of home-spun crawlers, robots, etc. (thus the stated reason for the demise of the robots database) and many of these could be run on networks where the represented IP address also addresses real traffic. For example AOL (where a single user may have several IP's in a session) or through a corporate firewall.

You can use Analog to produce a list of hosts that requested robots.txt. Just use a configuration file like this:

	ALL OFF
	HOST ON
	OUTPUT TEXT
	FILEINCLUDE /robots.txt

>From that list you could create a series of HOSTEXCLUDE commands to include in your subsequent run. It's just going to take two passes through the log files.

--
 
Jeremy Wadsack
Seven Simple Machines
(206) 545-4850

-----Original Message-----
From: analog-help-bounces at lists.meer.net [mailto:analog-help-bounces at lists.meer.net] On Behalf Of Sabine Henneberger
Sent: Thursday, January 17, 2008 5:30 AM
To: analog-help at lists.meer.net
Subject: [analog-help] Excluding Robots

Hallo,
how can I exclude hosts which request the file robots.txt?
Other programs for Logfile Analysis do it automatically, because they 
count every host with such a request as a robot.
How do the users of Analog exclude robots, if there is no actual list at 
http://www.wadsack.com/robot-list.html?
Best Regards,
Sabine

-- 
Sabine Henneberger

Humboldt-Universität Berlin
Computer- und Medienservice
Arbeitsgruppe Elektronisches Publizieren
Tel. 030 2093 7075

Humboldt University Berlin, Germany
Computer and Media Service 
Electronic Publishing Group
phone: +49+30+2093-7075 

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------



More information about the analog-help mailing list