Added 3 new link extractors (from the British Library) to heritrix :
org.archive.modules.extractor.ExtractorRobotsTxt
org.archive.modules.extractor.ExtractorSitemap
org.archive.modules.extractor.ExtractorJson
Added caching of crawl logs when hadoop is used for processing
Added caching of metadata-file indexes when hadoop is used for processing
Added retry functionality to improve the robustness of the WarcRecordClient
Fixed a bug whereby files uploaded from a harvester were not being deleted when the Bitmagasin backend is in use
The new caching functionality stores data in a directory specified by the setting settings.common.webinterface.metadata_cache_dir whose default value is "metadata_cache" (relative to the current working directory where the GUIApplication is started). At present there is no automatic cleaning of this directory.
Highlights in 7.0
NetarchiveSuite 7.0 introduces an entirely new backend storage and mass-processing implementation based on software from bitrepository.org and hadoop. The new functionality is enabled by defining the following key in the settings file for all applications:
The older arcrepositoryClient implementation dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient will be deprecated in future releases. (The developers are unaware of any other organisations currently using the older client, but please contact us if you still rely on it.)