How Do We Do QA Now?
At a recent international meeting, a group of leading web-archiving practitioners were asked "How do you do QA for your archive?". The embarrassed silence was deafening. This is not because webarchivers are unprofessional, but because the tools available to them have not kept up with the changing nature of both webarchiving and the web. For example Netarchivsuite's Viewerproxy doesn't function at all with https sites.
Fortunately the IIPC recently convened a hackathon to address the issue of QA, and this project from Innovation Week May 2019 is a followup to that.
Extending Logtrix
Logtrix (https://github.com/iipc/logtrix) is a tool to process and visualise heritrix crawl logs. What is a crawl log? Here is a section from one:
2018-06-22T06:10:20.034Z 1 67 dns:nyheder.tv2.dk P http://nyheder.tv2.dk/ text/dns #008 20180622061019287+218 sha1:YG6VB62O3RSD3GR562AVCME2KVLES4LG http://nyheder.tv2.dk/ content-size:67
2018-06-22T06:10:20.042Z 1 54 dns:ekstrabladet.dk P http://ekstrabladet.dk/ text/dns #032 20180622061019352+119 sha1:KXFJLASYI7IADC6SVELBVBNRFAVREDRJ http://ekstrabladet.dk/ content-size:54
2018-06-22T06:10:20.933Z 404 1245 http://altinget.dk/robots.txt P http://altinget.dk/ text/html #046 20180622061020454+428 sha1:AS23RBWCBWELK7XKNWH7RATCJJFMDZI5 http://altinget.dk/ content-size:1425
2018-06-22T06:10:20.942Z 301 0 http://jyllands-posten.dk/robots.txt P http://jyllands-posten.dk/ unknown #019 20180622061020421+466 sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://jyllands-posten.dk/ content-size:202
2018-06-22T06:10:20.950Z 301 0 http://dr.dk/robots.txt P http://dr.dk/ unknown #016 20180622061020473+415 sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://dr.dk/ content-size:304
2018-06-22T06:10:20.956Z 200 237 http://ekstrabladet.dk/robots.txt P http://ekstrabladet.dk/ text/plain #014 20180622061020465+440 sha1:JYUYCJ43MLGZYUHOREYB2ZHMCQJUVDZQ http://ekstrabladet.dk/ content-size:628
Each line represents a URL handled by heritrix and includes such information as the payload response-size, the http return code, the mime-type, and the parent URL (if any) from which this URL derived. Logtrix consists of
- A command line tool to compile statistics from a crawl log into a JSON structure, and
- A simple web ui to view the results in tabular form.
The latest logtrix release has the options of viewing the statistics by Status Code, Mime-type or Domain. For our use case in Netarkivet we have extended this in a couple of directions, but first a digression ...
Domains and Harvests and Seeds and Trees and Graphs
Researchers like to envisage the web as a directed graph in which web-pages link to (via hyperlink, embedding, implication, redirect etc.) other web-pages. This link-graph concept is somewhat of an oversimplification because the actual linkage of the web is dynamic, so that what links you find in a browser or crawler will be entirely dependent on your context (User-agent, time-of-day, login, etc.) and how the webserver responds to that context (which may be completely arbitrary and will often include a random component).
However, any actual crawl of the web will nevertheless be a concrete realisation of some part of this theoretical link-graph. A web-crawl in e.g. heritrix is defined by
- A set of seeds
- A set of rules for dealing with fetched content, including for example
- how to extract links from content
- how wide to harvest (e.g. when to include other domains)
- how deep to harvest
- how much to harvest
Any actual crawl is therefore not only a subgraph of the true link-graph, but is actually a tree-structure (for each seed). [I'm assuming we treat any urls refetched during the harvest as new nodes in the tree.]
Extending Logtrix
Logtrix shows statistics for a harvest per-domain. But as discussed above, we don't define harvests by domains but by seeds (which are URLs, not domain-names or host-names). So at least if we are thinking like curators, rather than researchers, then we want to gather harvest statistics per-seed, not per-domain. Now the heritrix crawl log does not explicitly specify the seed for each URL harvested, but the information is implicit because we can just follow the parent-URLs for each URL until we reach a URL with no parent, which must be the corresponding seed.
This is not a trivial point! Let's take one of our large snapshot harvest jobs with over 8000 seeds and compare the biggest domains in the harvest with the biggest seeds:
Note that the largest domains in the crawl are places like facebook.com and cdn.simplesite.com. But these domains are not specified in any of the seeds. They represent typically embedded content included on a page. The largest seeds are a quite distinct set (and, of course, are URLs).
But here's a slight surprise. This particular crawl log was taken from a snapshot harvest with a 10MB per-seed crawl limit. So why are we seeing some domains with much larger sizes? Well typically what happens is the crawler finds one very large file (usually a video) and downloads it, after which the crawl of that seed is stopped by its quota. How often does this happen? Well let's look at some visual data. Here is a scatter plot of objects-crawled versus bytes-harvested for all the seeds in this crawl
Note how the data mostly clusters under the 10MB line with just a few outliers lying well over it. We can also visualize this with a histogram of bytes harvested over the 8000 or so seeds in the harvest. Here are both normal and cumulative histograms.
There are a bunch of seeds clustered around zero-bytes, then another large peak at around 10MB, then a small tail. (Note the logarithmic y-axis! ). The cumulative graph tells the same story.
Diving In
What we'd really like to know is which domains were harvested under which seeds. Fortunately Libtrix already contains examples of how to create grouping statistics. To group by seed, we just needed to add a new grouping function based on the seed-extraction we had already implemented. What Libtrix did not have was a ui for visualising this grouped structure, so we built one using standard jQuery components.
There's a lot going on here. By clicking on seed, we open up a new table showing the domains harvested as part of that seed. DNS lookups are treated as a separate domain. The URLs for all the domains should add up to the URLs for the corresponding seed and the %ages to 100&. The two duplicate-related columns show that all 4459 duplicates found came from the domain ruedinger.dk. That is to say of the 5.15% of the total data which was duplicated, 100% came from that domain.
Finally here is an example related to an actual issue which has been bothering us:
Here we see a site where the seed-harvest has reached it's byte limit of around 10MB, but the main corresponding domain has only been harvested for 5,3MB. With this tool, we can very quickly (because jQuery is quick) check that the remaining 4MB is contained in just 14 objects from another domain, something we might otherwise have found it difficult to check.
Finally it should be noted that most of the effort in this project was spent in relearning Java 8 syntax and jQuery, and getting to know plotly.js for the first time. Take a look at the source code at https://github.com/csrster/logtrix to see exactly what was learned.