15-minute presentation with 5-minute Q&A
Topics: TOOLS & INFRASTRUCTURE
Keywords: validate, warc, wacz, jwat, warchaeology
Why, When and How To Validate Wacz/Warc Files In a Mixed Heritrix/Browser based Crawl Platform
Tue Hejlskov Larsen
Royal Danish Library, Denmark
The presentation looks into: Why is it necessary to validate wacz/warc files?
Do we have the tools for warc and wacz files - e.g. wacz, warchaeology and jwat?
When is it necessary to validate?
And how does the exemplified tools perform from a usage point of view.
15-minute presentation with 5-minute Q&A
Topics: CURATION & COLLECTIONS
Keywords: automation, organising, curating domains, clustering
Automatic Clustering of Domains by Industry for Effective Curation
Thomas Smedebøl
Royal Danish Library, Denmark
When archiving 1.4 million .dk domains, we need practical tools to curate them by clusters.
We have observed that, for example, hairdressers, massage therapists, and physiotherapists often have crawler traps around their booking systems. The same applies to hotels.
Takeaway restaurants often have crawl traps around their ordering systems. Car dealerships have general issues with their used car databases.
And online shops tend to pose great difficulties around the sorting of the offered products.
Each industry seems to have their own set of specifics we should take into account when curating the archiving of their domains.
It would be useful to analyse and manage these domains by industry.
In Denmark, all companies have a CVR number that identifies them.
This number must be displayed on their website.
In the central business register, the company's industry is listed. By scraping all domains for the company’s CVR number, a connection can be established between domains and industries, and we can quickly generate a list of domains within museums, churches, dentists, water utilities, and all other industries in the register.
All it takes is good planning, a lot of scraping, access to the central business registrys database and a database.
By working with industries as a starting point, we can improve our insight, quickly manage large volumes, and spend focused time on special cases. We can also offer researchers a unique register of segmented domains.
Doing What 'Is Not Humanly Possible'
Thomas Smedebøl
Royal Danish Library, Denmark
With 4 annual crawls of 1.4 million .dk domains, hands-on curation is necessary.
We manually set byte limits for domains that exceed their limit.
My predecessor called the task impossible! 'An endless job that you will never finish.'
I took on the challenge, and after some time, I managed to get through all the domains between harvests.
Without getting a repetitive strain injury and without spending more time than absolutely necessary.
The result is better quality and reduced waste of storage.
Learn how I did it, and be inspired in your own work.