Added 3 new link extractors (from the British Library) to heritrix :
org.archive.modules.extractor.ExtractorRobotsTxt
org.archive.modules.extractor.ExtractorSitemap
org.archive.modules.extractor.ExtractorJson
Added caching of crawl logs when hadoop is used for processing
Added caching of metadata-file indexes when hadoop is used for processing
Added retry functionality to improve the robustness of the WarcRecordClient
Fixed a bug whereby files uploaded from a harvester were not being deleted when the Bitmagasin backend is in use
Highlights in 7.0
NetarchiveSuite 7.0 introduces an entirely new backend storage and mass-processing implementation based on software from bitrepository.org and hadoop. The new functionality is enabled by defining the following key in the settings file for all applications:
The older arcrepositoryClient implementation dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient will be deprecated in future releases. (The developers are unaware of any other organisations currently using the older client, but please contact us if you still rely on it.)