Global Crawler Traps

A crawler trap is any sequence of webpages which a crawler can blindly and endlessly follow without harvesting any new information. A common example is a calendar system with hyperlinks to subsequent or previous dates. Crawler traps can be avoided by specifying (as regular expressions) URLs which the crawler is to ignore. In NetarchiveSuite one can specify crawler traps either per-domain or globally. This section describes the management of global crawler traps.

A list of crawler traps is just a plain text file containing crawler-trap regular expressions one-per-line. Crawler-traps specified this way must be xml-escaped. For instance one standard set of global crawler traps includes the following calendar code

.*index\.php\?module=PostCalendar&func=view&tplview=.*
.*modules\.php\?name=vwar&file=calendar.*
.*modules\.php\?name=Calendar&op=modload&file=index.*

Lists may be active or inactive. When NetarchiveSuite creates a new job for any harvest, all crawler traps for all active lists (excluding duplicates) are added to the crawl template for that job.

To upload a list of global traps, first click on the Edit link and fill in a name and description for the list of crawler traps and the path where the file containing the crawler trap expressions is to be found. You can also choose whether the list should be initially active or inactive. Click Create to upload the list.

A list may be made active or inactive by clicking on the Activate and Deactivate buttons. Lists may also be viewed (via the Retrieve button), deleted, or edited. Note that the retrieved version of a crawler trap list may differ from the original uploaded version because any duplicates in the original are removed during upload and the order of the lines in the retrieved version will not be the same as in the original file. The Edit actions allow for uploading of a new version of the list.

 

A side effect of using global crawler trap lists is that the database will grow more rapidly as the modified crawl template, including all the active crawler traps, is stored for every job.

 

Â