Page Comparison

Table of Contents

...

Panel

BNE

Panel

KB-Sweden

Panel

Questions:

Do you treat certain types of web sites/domains as uninteresting to harvest, and limit their budget or reduce the harvest in other ways? If yes:

Which categories of web sites?
How do you identify the category and find which web sites to treat specially?
How do you reduce the harvest there – data limit, object count limit, reject rules?

We would like to avoid the very large amount of web sites containing huge product catalogues, often with lots of images on each product. But are there ways to do find and avoid/limit them in some (semi-)automatic way?

(On the wish list – when you have identified such a site – would also be a way to harvest a specified proportion of it, e.g. 1 %, randomly selected among a representative selection of different types of pages … J )

A side-track to this is more complicated crawler traps which often show up on these (and other) sites, e.g. infinite loops of types which Heritrix can’t detect (a/b/c/a/b/c, pages referring to themselves with extra parameters etc.). Hints?

Next meetings

October 6, 2020
November 3, 2020
December 8, 2020
January 5, 2021

Any other business?

·

Versions Compared

Old Version 4

New Version 5

Key

Next meetings

Any other business?