...
At a recent international meeting, a group of leading web-archiving practitioners were asked "How do you do QA for your archive?". The embarrassed silence was deafening. This is not because webarchivers are unprofessional, but because the tools available to them have not kept up with the changing nature of both webarchiving and the web. For example NetarchivsuiteNetarchivesuite's Viewerproxy doesn't function at all with https sites.
...
Each line represents a URL handled by heritrix and includes such information as the payload response-size, the http return code, the mime-type, and the parent URL (if any) from which this URL derived. Logtrix consists of
A command line tool to compile statistics from a crawl log into a JSON structure
, and, and
Code Block { "totals" : { "count" : 477826, "bytes" : 11429071824, "millis" : 3238610536, "uniqueCount" : 357799, "uniqueBytes" : 4961357812, "uniqueMillis" : 2837500967, "firstTime" : "2019-02-01T23:30:23.904Z", "lastTime" : "2019-02-03T00:36:46.328Z" }, "statusCodes" : { "-1" : { "count" : 168, "bytes" : 0, "millis" : 166298, "uniqueCount" : 168, "uniqueBytes" : 0, "uniqueMillis" : 166298, "firstTime" : "2019-02-01T23:31:25.925Z", "lastTime" : "2019-02-02T18:59:39.321Z", "description" : "DNS lookup failed" }, "-2" : { "count" : 176, "bytes" : 0, "millis" : 0, .....
- A simple web ui to view the results in tabular form.
...
Researchers like to envisage the web as a directed graph in which web-pages link to (via hyperlinkhyperlinking, embedding, implication, redirect redirection etc.) other web-pages. This link-graph concept is somewhat of an oversimplification because the actual linkage of the web is dynamic, so that what links you find in a browser or crawler will be entirely dependent on your context (User-agent, time-of-day, login, etc.) and how the webserver responds to that context (which may be completely arbitrary and will often include a random component).
However, any actual crawl of the web will nevertheless be a concrete realisation of some part of this theoretical link-graph. A web-crawl in e.g. heritrix is defined by
- A set of seedsseed URLs
- A set of rules for dealing with fetched content, including for example
- how to extract links from content
- how wide to harvest (e.g. when to include other domains)
- how deep to harvest
- how much to harvest
...
This is not a trivial point! Let's take one of our large snapshot harvest jobs with over 8000 2000 seeds and compare the biggest domains in the harvest with the biggest seeds:
...
Note how the data mostly clusters under the 10MB line with just a few outliers lying well over it, and that these outliers are not associated with especially large numbers of objects. We can also visualize this with a histogram of bytes harvested over the 8000 2000 or so seeds in the harvest. Here are both normal and cumulative histograms.
...
What we'd really like to know is which domains were harvested under which seeds. Fortunately Libtrix already contains examples of how to create grouping statistics. To group by seed, we just needed to add a new grouping function based on the seed-extraction we had already implemented. What Libtrix did not have was a ui for visualising this grouped structure, so we built one using standard jQuery components and a "borrowed" stylesheet.
There's a lot going on here. By clicking on a seed, we open up a new table showing the domains harvested as part of that seed. There are still some things to be ironed out about what is and isn't a domain so "ruedinger.dk" and "www.ruedinger.dk" both appear. DNS lookups are treated as a separate domain. The URLs for all the domains should add up to the URLs for the corresponding seed and the %ages to 100&. The two duplicate-related columns show that all 4459 duplicates found came from the domain ruedinger.dk. That is to say of the 5.15% of the total data which was duplicated, 100% came from that domain.
Finally here is an example related to an actual issue which has been bothering us:
Here we see a site where the seed-harvest has reached it's byte limit of around 10MB, but the main corresponding domain has only been harvested for 5,3MB. With this tool, we can very quickly (because jQuery is quick) check that the remaining 4MB is contained in just 14 objects from another domain, something we might otherwise have found it difficult to check.
Finally it should be noted that most of the effort in this project was spent in relearning Java 8 syntax and jQuery, and getting to know plotly.js for the first time. Take a look at the source code at https://github.com/netarchivesuite/logtrix to see exactly what was learned.