How Do We Do QA Now?
At a recent international meeting, a group of leading web-archiving practitioners were asked "How do you do QA for your archive?". The embarrassed silence was deafening. This is not because webarchivers are unprofessional, but because the tools available to them have not kept up with the changing nature of both webarchiving and the web. For example Netarchivsuite's Viewerproxy doesn't function at all with https sites.
Fortunately the IIPC recently convened a hackathon to address the issue of QA, and this project from Innovation Week May 2019 is a followup to that.
Extending Logtrix
Logtrix (https://github.com/iipc/logtrix) is a tool to process and visualise heritrix crawl logs. What is a crawl log? Here is a section from one:
2018-06-22T06:10:20.034Z 1 67 dns:nyheder.tv2.dk P http://nyheder.tv2.dk/ text/dns #008 20180622061019287+218 sha1:YG6VB62O3RSD3GR562AVCME2KVLES4LG http://nyheder.tv2.dk/ content-size:67
2018-06-22T06:10:20.042Z 1 54 dns:ekstrabladet.dk P http://ekstrabladet.dk/ text/dns #032 20180622061019352+119 sha1:KXFJLASYI7IADC6SVELBVBNRFAVREDRJ http://ekstrabladet.dk/ content-size:54
2018-06-22T06:10:20.933Z 404 1245 http://altinget.dk/robots.txt P http://altinget.dk/ text/html #046 20180622061020454+428 sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 http://altinget.dk/ content-size:1425
2018-06-22T06:10:20.942Z 301 0 http://jyllands-posten.dk/robots.txt P http://jyllands-posten.dk/ unknown #019 20180622061020421+466 sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://jyllands-posten.dk/ content-size:202
2018-06-22T06:10:20.950Z 301 0 http://dr.dk/robots.txt P http://dr.dk/ unknown #016 20180622061020473+415 sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://dr.dk/ content-size:304
2018-06-22T06:10:20.956Z 200 237 http://ekstrabladet.dk/robots.txt P http://ekstrabladet.dk/ text/plain #014 20180622061020465+440 sha1:JYUYCJ43MLGZYUHOREYB2ZHMCQJUVDZQ http://ekstrabladet.dk/ content-size:628
Each line represents a URL handled by heritrix and includes such information as the payload response-size, the http return code, the mime-type, and the parent URL (if any) from which this URL derived. Logtrix consists of
- A command line tool to compile statistics from a crawl log into a JSON structure, and
- A simple web ui to view the results in tabular form.
The latest logtrix release has the options of viewing the statistics by Status Code, Mime-type or Domain. For our use case in Netarkivet we have extended this in a couple of directions, but first a digression ...
Domains and Harvests and Seeds and Trees and Graphs
Researchers like to envisage the web as a directed graph in which web-pages link to (via hyperlink, embedding, implication, redirect etc.) other web-pages. This link-graph concept is somewhat of an oversimplification because the actual linkage of the web is dynamic, so that what links you find in a browser or crawler will be entirely dependent on your context (User-agent, time-of-day, login, etc.) and how the webserver responds to that context (which may be completely arbitrary and will often include a random component).
However, any actual crawl of the web will nevertheless be a concrete realisation of some part of this theoretical link-graph. A web-crawl in e.g. heritrix is defined by
- A set of seeds
- A set of rules for dealing with fetched content, including for example
- how to extract links from content
- how wide to harvest (e.g. when to include other domains)
- how deep to harvest
- how much to harvest
Any actual crawl is therefore not only a subgraph of the true link-graph, but is actually a tree-structure (for each seed). [I'm assuming we treat any urls refetched during the harvest as new nodes in the tree.]
Extending Logtrix
Logtrix shows statistics for a harvest per-domain. But as discussed above, we don't define harvests by domains but by seeds (which are URLs, not domain-names or host-names). So at least if we are thinking like curators, rather than researchers, then we want to gather harvest statistics per-seed, not per-domain. Now the heritrix crawl log does not explicitly specify the seed for each URL harvested, but the information is implicit because we can just follow the parent-URLs for each URL until we reach a URL with no parent, which must be the corresponding seed.
This is not a trivial point! Let's take one of our large snapshot harvest jobs with over 8000 seeds and compare the biggest domains in the harvest with the biggest seeds:
Note that the largest domains in the crawl are places like facebook.com and cdn.simplesite.com. But these domains are not specified in any of the seeds. They represent typically embedded content included on a page. The largest seeds are a quite distinct set (and, of course, are URLs).