- Progression/Queues
- Crawllog: check cache update, filtering, paging
- Reports: click on at least two of them. Check that the Processors report shows that deduplication has been disabled.
- Show/delete frontier: delete some items from the frontier
- Add RejectRules: add a new rule
- Modify budget: add an object limit to some domain or subdomain
Now restart (unpause) the job. Then immediatiely ...
- Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
- Check that neither domain appears in the order templates harvest template's crawler beans for any of the jobs with the (possible) exception of the following lines:
Code Block |
[devel@kb-prod-udv-001 ~]$ ssh netarkiv@sbnetarkdv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc | grep -v 'metadata:' |