[analog-help] Identifying Known Spiders?
Aengus
analog07 at eircom.net
Fri Jul 4 06:04:19 PDT 2008
On 7/4/2008 12:30 AM, Michael Crawford wrote:
> On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
> <jeremy at 7simplemachines.com> wrote:
>> The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot.
>
> In my personal case, it's not so much that I want to watch all the
> bots, as to monitor my progress at getting a new site indexed by the
> search engines.
>
> While Google, Yahoo and MSN together provide the vast majority of
> search engine referrals, there are still a few small, independent
> players such as JGDO.
>
> There are lots of reasons for running a bot, some good, some bad. I'd
> be happy if I could get a report of visits by the bots belong to, say,
> the top half-dozen search engines.
>
> Note that it often happens, with new sites, that a search engine
> spider may not visit at all for months, and even then will only fetch
> the home page. By creating config files for each of my pages, I hope
> to monitor spider visits throughout my site.
>
> If this isn't yet possible with analog, I don't think it would be hard
> to implement, and would be very popular, and so would get Analog a lot
> more users, and maybe some consulting fees for Analog experts.
It all comes down to the same simple question - how do you decide that
any given request is from a spider/bot rather than a real person? If you
rely on the User-Agent string you then have to decide how to identify
the relevant strings - assume that everything that isn't a "well known
browser" is a spider, or assume that everything that asks for asks for
/robots.txt is a spider.
Unfortunately, there's nothing to stop a bot using a "well known
browser" User-Agent (see recent controversy about the AVG LinkScanner,
for example), and there's nothing to stop an ordinary user from
requesting /robots.txt. That means that there's no simple way to
automate the identification of spiders - it requires some judgement, and
Analog doesn't do judgement :-).
Once you come up with a set of rules that work for you (or for the set
of log files that you're working with at the moment), then it's not
difficult to use Analog to delve deeper into the robot traffic. You can
use FILEINCLUDE /robots.txt to get a list of IP addresses or Browser
strings that have requested /robots.txt. You can then use this
information with HOSTINCLUDE or with BROWINCLUDE to get a view of the
rest of the traffic from either one specific spider, or all of the
spiders as a whole, bearing in mind that the job of spidering your site
might be spread between a number of different machines, so you might
need to HOSTINCLUDE a range of machines if you use that technique.
So you can certainly use Analog to watch this type of traffic - indeed
Analog's configurability makes it an ideal tool for the job. But because
there are no black and white rules for deciding what is or is not a
robot/spider, this functionality can't be built-in to Analog. The
decisions that you might make today to do this analysis on your site
might be different for someone else, and might be different in a few
months time, as the list of search engines change.
Aengus
More information about the analog-help
mailing list