...
Note that the largest domains in the crawl are places like facebook.com and cdn.simplesite.com. But these domains are not specified in any of the seeds. They represent typically embedded content included on a page. The largest seeds are a quite distinct set (and, of course, are URLs).
But here's a slight surprise. This particular crawl log was taken from a snapshot harvest with a 10MB per-seed crawl limit. So why are we seeing some domains with much larger sizes? Well typically what happens is the crawler finds one very large file (usually a video) and downloads it, after which the crawl of that seed is stopped by its quota. How often does this happen? Well let's look at some visual data. Here is a scatter plot of objects-crawled versus bytes-harvested for all the seeds in this crawl
Note how the data mostly clusters under the 10MB line with just a few outliers lying well over it. We can also visualize this with a histogram of bytes harvested over the 8000 or so seeds in the harvest. Here are both normal and cumulative histograms.
There are a bunch of seeds clustered around zero-bytes, then another large peak at around 10MB, then a small tail. (Note the logarithmic y-axis! ). The cumulative graph tells the same story.
Diving In
What we'd really like to know is which domains were harvested under which seeds. Fortunately Libtrix already contains examples of how to create grouping statistics. To group by seed, we just needed to add a new grouping function based on the seed-extraction we had already implemented. What Libtrix did not have was a ui for visualising this grouped structure, so we built one using standard jQuery components.
There's a lot going on here. By clicking on seed, we open up a new table showing the domains harvested as part of that seed. There are still some things to be ironed out about what is and isn't a domain so "ruedinger.dk" and "www.ruedinger.dk" both appear. DNS lookups are treated as a separate domain. The URLs for all the domains should add up to the URLs for the corresponding seed and the %ages to 100&. The two duplicate-related columns show that all 4459 duplicates found came from the domain ruedinger.dk. That is to say of the 5.15% of the total data which was duplicated, 100% came from that domain.
Finally here is an example related to an actual issue which has been bothering us
Here we see a site where the seed-harvest has reached it's byte limit of around 10MB, but the main corresponding domain has only been harvested for 5,3MB. With this tool, we can very quickly check that the remaining 4MB is contained in just 14 objects from another domain, something we might otherwise have found it difficult to check.