[analog-help] Identifying Known Spiders?

Jeremy Wadsack jeremy at 7simplemachines.com
Thu Jul 3 08:37:43 PDT 2008


The robots list from which that page was built no longer exists. The group that was maintaining it decided that it didn't make sense to maintain a database of "known robots" any more as anyone can make a robot. However, some quick review looks like there's a new user-submitted list at http://www.robotstxt.org/db.html and a "wild caught" list at http://www.botsvsbrowsers.com/category/1/index.html. If I have time this weekend, maybe I'll update the scripts to pull from one of those sources.

I have to say though, that the more I think about it, the more I'm of the mind set that anything that is *not a known web browser* is most likely a bot. And maybe inverting the logic would make sense at this point.

--

Jeremy Wadsack
Seven Simple Machines
Main: (206) 545-4850
Direct: (206) 812-6829

-----Original Message-----
From: analog-help-bounces at lists.meer.net [mailto:analog-help-bounces at lists.meer.net] On Behalf Of Aengus
Sent: Thursday, July 03, 2008 4:30 AM
To: Support for analog web log analyzer
Subject: Re: [analog-help] Identifying Known Spiders?

On 7/3/2008 3:48 AM, Michael Crawford wrote:
> I'd like to know the success of my efforts to submit a new site to all
> the search engines; some spiders won't visit a site until it's been
> online for a while, and some will only visit the home page.
>
> I can see some of the spiders in the BROWSERREP and BROWSERSUM, but
> it's missing some because it's definitely missing Googlebot and Yahoo
> Slurp.
>
> Also the BROWSERREP shows all the browsers used by my human visitors;
> it will get hard to spot spiders when my traffic picks up.
>
> Is there a report specifically for known spiders?

No, the only special treatment for spiders in Analog is the ROBOTINCLUDE
command which tells Analog to count the requests with the specified
User-Agents as Search Engines in the OS Report.

There used to be a list of Spider User-Agents at
http://www.wadsack.com/robot-list.html but it seems to be empty at the
moment. There's a list from may 2007 at
http://www2.owen.vanderbilt.edu/mike.shor/diversions/analog/RobotInclude.txt

You might want to do a report with FILEINCLUDE /robots.txt, which should
give you a good indication of which search engines are hitting your site.

Aengus

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------



More information about the analog-help mailing list