Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At a recent international meeting, a group of leading web-archiving practitioners were asked "How do you do QA for your archive?". The embarrassed silence was deafening. This is not because webarchivers are unprofessional, but because the tools available to them have not kept up with the changing nature of both webarchiving and the web. For example NetarchivsuiteNetarchivesuite's Viewerproxy doesn't function at all with https sites. 

...

Each line represents a URL handled by heritrix and includes such information as the payload response-size, the http return code,  the mime-type, and the parent URL (if any) from which this URL derived. Logtrix consists of

  1. A command line tool to compile statistics from a crawl log into a JSON structure

    , and

    , and 

    Code Block
    {
      "totals" : {
        "count" : 477826,
        "bytes" : 11429071824,
        "millis" : 3238610536,
        "uniqueCount" : 357799,
        "uniqueBytes" : 4961357812,
        "uniqueMillis" : 2837500967,
        "firstTime" : "2019-02-01T23:30:23.904Z",
        "lastTime" : "2019-02-03T00:36:46.328Z"
      },
      "statusCodes" : {
        "-1" : {
          "count" : 168,
          "bytes" : 0,
          "millis" : 166298,
          "uniqueCount" : 168,
          "uniqueBytes" : 0,
          "uniqueMillis" : 166298,
          "firstTime" : "2019-02-01T23:31:25.925Z",
          "lastTime" : "2019-02-02T18:59:39.321Z",
          "description" : "DNS lookup failed"
        },
        "-2" : {
          "count" : 176,
          "bytes" : 0,
          "millis" : 0,
    .....


  2. A simple web ui to view the results in tabular form.

...

Researchers like to envisage the web as a directed graph in which web-pages link to (via hyperlinkhyperlinking, embedding, implication, redirect redirection etc.) other web-pages. This link-graph concept is somewhat of an oversimplification because the actual linkage of the web is dynamic, so that what links you find in a browser or crawler will be entirely dependent on your context (User-agent, time-of-day, login, etc.) and how the webserver responds to that context (which may be completely arbitrary and will often include a random component).

However, any actual crawl of the web will nevertheless be a concrete realisation of some part of this theoretical link-graph. A web-crawl in e.g. heritrix is defined by

  1. A set of seedsseed URLs
  2. A set of rules for dealing with fetched content, including for example
    1. how to extract links from content
    2. how wide to harvest (e.g. when to include other domains)
    3. how deep to harvest
    4. how much to harvest

...

This is not a trivial point! Let's take one of our large snapshot harvest jobs with over 8000 2000 seeds and compare the biggest domains in the harvest with the biggest seeds:

...

Note that the largest domains in the crawl are places like facebook.com and cdn.simplesite.com. But these domains are not specified in any of the seeds. They represent typically embedded content included on a page. The largest seeds are a quite distinct set (and, of course, are URLs).

But here's a slight surprise. This particular crawl log was taken from a snapshot harvest with a 10MB per-seed crawl limit. So why are we seeing some domains with much larger sizes? Well typically what happens is the crawler finds one very large file (usually a video) and downloads it, after which the crawl of that seed is stopped by its quota. How often does this happen? Well let's look at some visual data. Here is a scatter plot of objects-crawled versus bytes-harvested for all the seeds in this crawl

Image Added 

Note how the data mostly clusters under the 10MB line with just a few outliers lying well over it, and that these outliers are not associated with especially large numbers of objects. We can also visualize this with a histogram of bytes harvested over the 2000 or so seeds in the harvest. Here are both normal and cumulative histograms. 

Image AddedImage Added

There are a bunch of seeds clustered around zero-bytes, then another large peak at around 10MB, then a small tail. (Note the logarithmic y-axis! ). The cumulative graph tells the same story.

Diving In

What we'd really like to know is which domains were harvested under which seeds. Fortunately Libtrix already contains examples of how to create grouping statistics. To group by seed, we just needed to add a new grouping function based on the seed-extraction we had already implemented. What Libtrix did not have was a ui for visualising this grouped structure, so we built one using standard jQuery components and a "borrowed" stylesheet.

Image Added

There's a lot going on here. By clicking on a seed, we open up a new table showing the domains harvested as part of that seed.  DNS lookups are treated as a separate domain. The URLs for all the domains should add up to the URLs for the corresponding seed and the %ages to 100. The two duplicate-related columns show that all 4459 duplicates found came from the domain ruedinger.dk. That is to say of the 5.15% of the total data which was duplicated, 100% came from that domain.

Finally here is an example related to an actual issue which has been bothering us:

Image Added

Here we see a site where the seed-harvest has reached it's byte limit of around 10MB, but the main corresponding domain has only been harvested for 5,3MB. With this tool, we can very quickly (because jQuery is quick) check that the remaining 4MB is contained in just 14 objects from another domain, something we might otherwise have found it difficult to check.

Finally it should be noted that most of the effort in this project was spent in relearning Java 8 syntax and jQuery, and getting to know plotly.js for the first time. Take a look at the source code at https://github.com/netarchivesuite/logtrix to see exactly what was learned.