2015 Workshop Conclusion

Summary of H3 curator discussion

The discussions focused on the current NAS and Heritrix features used by curators to monitor and QA crawls.

- important figures to follow the crawl progress are included in the running jobs page. On the Heritrix console, curators are also looking at the job status, the number of active threads, the progress bar and percentage.

- important Heritrix logs/reports used by curators are:

  • crawl-report: BnF
  • seed-report: BnF, KB, ONB
  • host-report: BnF, KB
  • source-report: BnF
  • mimetype-report: BnF, KB
  • responsecode-report: BnF
  • frontier-report: BnF, KB/SB, BNE, EST, ONB
  • crawl-log: BnF, KB/SB, BNE, EST, ONB
  • toe-threads-report: KB
  • order.xml: BnF, order.xml in NAS: KB

- important Heritrix features used by curators are:

  • search crawl-log and frontier using regular expression: BnF, KB
  • view/search/delete URLs from the frontier using regular expression: KB, BnF
  • see/add new filters to exclude URLs in the crawl settings and the frontier: BnF

- documentation on response codes and regular expressions is useful to KB curators.

Summary of H3 coding discussion

Mikis and Soren presented the new code structure. The code base needs to be fixed to get contributions. Soren and Nicholas will also decode the structure more to see which parts could be developed by other institutions.

Summary of WARC discussion

ISO has opened a revision process which gives the possibility to adaptations. Tue wondered about the differences between WARC files produced with Archive-it and those produced with NAS. Conclusions are:

  • warcinfo record: harvest description produced by NAS is more structured, no changes needed.
  • request and metadata record: it should possible to configure templates to generate request and/or metadata records.
  • revisit record: NAS should generate revisit records, deduplication information stands currently only in log files, this point is the most important. There is a proposal within the IIPC to add the WARC-Target-URI and the WARC-Date of the previously harvested document to facilitate the indexing or any processing of this information. Officially supported by the IIPC, we could include these fields in NAS from now on, the change of Heritrix version is a good opportunity to change the format.

There is nothing urgent for the 5.0 release, but format has to be consistent with previous H1 releases.

 

Community next steps

In 2015, the community members, both developers and curators, should focus on the integration and use of Heritrix 3. The following action plan was defined:

- Testing: start/keep on testing Heritrix 3 as a standalone application / starting now / KB/SB, EST, BnF

- Testing: run several release tests of NAS with Heritrix 3 / April-June / KB/SB, EST, BnF, ONB will contribute to tests in the second half of the year.

- Migrating the templates: developers need help from curators to migrate the H1 templates to H3 format. It represents a large amount of work. The community needs to share advices on how to migrate efficiently, comparing the most commonly used in operations. Soren will send an example containing a list of place holders for NAS, BnF will see if it fits with its current templates. BnF will also send out a reference table maintained by curators to document the differences between the the difference templates / April - June / KB/SB, BnF, EST, BNE

- New NAS features: the running jobs will be included in 5.0, developed by KB, code reviewed by BnF. Regarding important Heritrix features currently identified by curators, BnF will further describe use cases, share them with the community for feedback and implement the following features as a minimal Heritix UI add-on (included in the curator roadmap):

  • Possibility to view URLs in the frontier (NASC56)
  • Possibility to delete URLs from the frontier (NASC57)
  • Possibility to search the frontier with a regular expression (NASC58)
  • Possibility to view current filters/crawler traps (MatchesListRegExpDecideRule) (NASC59)
  • Possibility to add new filters/crawler traps in the job settings to exclude URLs from the crawl (NASC60)
  • Possibility to search the crawl log with a regular expression (NASC61)
  • Possibility to have a progress bar and percentage on current job (this last feature can be postponed to next release) (NASC62)