[analog-help] Identifying Known Spiders?

Aengus analog07 at eircom.net
Fri Jul 4 06:04:19 PDT 2008


On 7/4/2008 12:30 AM, Michael Crawford wrote:
> On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
> <jeremy at 7simplemachines.com> wrote:
>> The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot.
> 
> In my personal case, it's not so much that I want to watch all the
> bots, as to monitor my progress at getting a new site indexed by the
> search engines.
> 
> While Google, Yahoo and MSN together provide the vast majority of
> search engine referrals, there are still a few small, independent
> players such as JGDO.
> 
> There are lots of reasons for running a bot, some good, some bad.  I'd
> be happy if I could get a report of visits by the bots belong to, say,
> the top half-dozen search engines.
> 
> Note that it often happens, with new sites, that a search engine
> spider may not visit at all for months, and even then will only fetch
> the home page.  By creating config files for each of my pages, I hope
> to monitor spider visits throughout my site.
> 
> If this isn't yet possible with analog, I don't think it would be hard
> to implement, and would be very popular, and so would get Analog a lot
> more users, and maybe some consulting fees for Analog experts.

It all comes down to the same simple question - how do you decide that 
any given request is from a spider/bot rather than a real person? If you 
rely on the User-Agent string you then have to decide how to identify 
the relevant strings - assume that everything that isn't a "well known 
browser" is a spider, or assume that everything that asks for asks for 
/robots.txt is a spider.

Unfortunately, there's nothing to stop a bot using a "well known 
browser" User-Agent (see recent controversy about the AVG LinkScanner, 
for example), and there's nothing to stop an ordinary user from 
requesting /robots.txt. That means that there's no simple way to 
automate the identification of spiders - it requires some judgement, and 
Analog doesn't do judgement :-).

Once you come up with a set of rules that work for you (or for the set 
of log files that you're working with at the moment), then it's not 
difficult to use Analog to delve deeper into the robot traffic. You can 
use FILEINCLUDE /robots.txt to get a list of IP addresses or Browser 
strings that have requested /robots.txt. You can then use this 
information with HOSTINCLUDE or with BROWINCLUDE to get a view of the 
rest of the traffic from either one specific spider, or all of the 
spiders as a whole, bearing in mind that the job of spidering your site 
might be spread between a number of different machines, so you might 
need to HOSTINCLUDE a range of machines if you use that technique.

So you can certainly use Analog to watch this type of traffic - indeed 
Analog's configurability makes it an ideal tool for the job. But because 
there are no black and white rules for deciding what is or is not a 
robot/spider, this functionality can't be built-in to Analog. The 
decisions that you might make today to do this analysis on your site 
might be different for someone else, and might be different in a few 
months time, as the list of search engines change.

Aengus



More information about the analog-help mailing list