2017 Workshop Conclusion
Member updates
All NetarchiveSuite community members participated in the workshop: ONB (our host), KB (with Copenhaguen and Aarhus teams), BNE and BnF. We also welcomed a new member, the Royal Library of Sweden, who is restarting its Web Archiving Programme. Member updates showed that efforts and thoughts are currently mainly focused on improving access to our archives (CDX indexing, OpenWayback, full text indexing, usage analysis, connection to other library catalogs) and strenghtening our organisations (scheduling and sharing responbilities, networking selection processes). All presentations are available on the agenda page: 2017 NAS workshop.
NetarchiveSuite developpements
NetarchiveSuite 5 / Heritrix 3 is now used in production by almost all community members. A demo of the latest features and discussion on the migration challenges enabled BNE to have a precise insight on how to move from NAS4 to NAS5. We all gratefully thanked (and thank again) the Netarkivet team for supporting all the development effort and defining the first H3 template files which we all used as a starting point. KB estimates H3 integration took about 2 developers/year + one more of a technical expert to redefine all templates. BnF contributed to NAS 5.3 release (mainly on the H3 Remote Access feature).
The two next releases have been discussed:
NAS 5.3.1: https://kb-dk.atlassian.net/projects/NAS/versions/12945 (end of May, lead DK)
It should only be a bug fix release to solve issues that Netarkivet has had in the last broad crawl (job generation, H3 crawllog caching and bytes limit reach with seedQueueAssignementPolicy). This release will be used to launch the next DK broad crawl in June. BnF will use it in July to run its first broad crawl tests and as a basis for 5.4 developments. DK development team asked for help for the release tests.
NAS 5.4: https://kb-dk.atlassian.net/projects/NAS/versions/12944 (beginning of September, lead BnF)
This release will include fixes from BnF (marked as P1 in this document https://kb-dk.atlassian.net/wiki/download/attachments/38896730/BnF-NASbugsandfeatures-april2017.docx?version=1&modificationDate=1493134139845&api=v2).
Two features will be discussed within the community:
- How to improve the design of H3 job page (it contains too much information and is not easy to use)
- How to improve the H3 caching feature. First points raised during the discussion were there is no need to cache entire crawllogs from all running jobs, this is impossible (in terms of performance) and unnecessary (only the latest lines are useful to take decisions), it would be useful to have the possibility to configure the number of lines to extract.
BnF will support most development effort but will need participation from Nicholas to work on the H3 crawllog feature.
Procedures to contribute to NAS code have been rediscussed to ease the integration and test processes at all levels. All community developers should work directly into netarchivesuite repository. We should then create a staging branch from the master (e.g. bnf-staging) to include features and bug fixes which are described on Jira. Colin reminds there is documentation for new comers: https://kb-dk.atlassian.net/wiki/display/NAS/Development. It is important to keep new code consistent with unit tests (access and how to use Jenkins will be discussed in the next development phase).
NAS 6.0 (no date, no lead)
We discussed what we would like to include in the next major release:
- introduction of a login system,
- structuration of crawl documentation and traceability of user actions,
- validation of new and existing seeds,
- easy management of TLDs.
Above all, the most important feature would be to get NAS to work with an additional crawler to improve the harvesting of javascript heavy websites and streamlining the harvesting of videos plateforms (Netarkivet has successfully tested Internet Archive Brozzler: https://github.com/internetarchive/brozzler). There are several options for this:
- Option 1: make NAS completely modular and extensible to other even yet non-existing crawlers (this would require a complete refactoring of NAS code (the introduction of H3 offered a basis but is clearly not enough) and the definion of crawler APIs (WASAPI group is currently defining data transfer APIs: https://github.com/WASAPI-Community/data-transfer-apis but to our knowledge there is no existing crawler API).
- Option 2: get Umbra (https://github.com/internetarchive/umbra) or an Umbra-like messaging system to keep Heritrix as our main harvester which would use complementary tools (such as Chromium, PhantomJS, youtube-dl) to identify and extract complex URLs and feed them back to Heritrix.
Option 1 is much more ambitious and satisfactory, option 2 would also need important developments. All members need to go back to their institution to see if resources could be allocated on this topic in 2018. In the meantime, members will keep on testing new tools in small time slots and sharing results.
Curator roadmap
NAS curators have shared experiences in keeping track of crawl documentation. There is a need to centralize and structure crawl documentation to make it usable by different user profiles (curators, crawl operators, researchers) (presentations are available on the agenda: 2017 NAS workshop). NAS curators have also started to review the curator roadmap to list and prioritize the features that need to be implemented in NetarchiveSuite: https://kb-dk.atlassian.net/projects/NASC/summary/statistics. This is work-in-progress.
BCweb
BCweb
BCweb has been in used at BnF to manage all focused crawls since 2012 (circa 40 000 URLs today). BNE has also been using it to share web sites selection with a regional library network. BNE has developped two new features: the possibility to add a title to the description and the possibility to pick up a subject from a thesaurus when defining a harvest. BNE would like to submit these features to BCweb main repository. BNE also plans to work on the articulation of BCweb data (keywords, topics, collections) and OpenWayback.
Netarkivet has been testing BCweb in the past weeks (and made a bug fix to make it work). Netarkivet is considering using it to strenghthen its selection team and increase its focused crawls. Sweden is also very much interested in using BCweb to facilitate the organisation of its selections. BCweb is currently hosted on BnF internal Git repository but as the interest of the NAS community is growing, BnF is considering making it an open source project (political approval is underway, legal issues still need to be discussed with BnF lawyers).
Community next steps
In 2017, the community members, both developers and curators, should focus on the following:
- work collaboratively on features and bug fixes included in NAS 5.3.1 and NAS 5.4 and participate to the release tests,
- keep on updating the curator roadmap and make it a discussion tool to exchange needs and ideas on the different features,
- discuss NAS direction regarding the integration of other crawlers,
- move foward in making BCweb an open source project,
- share documentation on our activities (e.g. BNE handbook for web curators, BnF annual harvest plan and Harvest naming conventions, ONB guide on using credentials). Documents can be shared directly by email with the workshop participants or using the mailing lists:
- netarchivesuite-curator: https://ml.sbforge.org/mailman/listinfo/netarchivesuite-curator
- netarchivesuite-devel: https://ml.sbforge.org/mailman/listinfo/netarchivesuite-devel