Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel4
minLevel2
indent6px
exclude(Download.*)|(Javadoc)|(Manuals)

Excerpt

7.0 Release Date: 2021-03-19

7.1 ForthcomingRelease Date: 2021-07-06

Highlights in 7.1

Current Snapshot: 

https://sbforge.org/nexus/service/local/artifact/maven/redirect?r=snapshots&g=org.netarchivesuite&a=distribution&v=7.1-SNAPSHOT&e=zip

https://sbforge.org/nexus/service/local/artifact/maven/redirect?r=snapshots&g=org.netarchivesuite&a=heritrix3-bundler&v=7.1-SNAPSHOT&e=zip

  1. Fixed (after many years) 
    Jira Legacy
    serverSystem JIRA
    columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2870
     whereby all generated revisit-records had badly formatted WARC-Payload-Digest fields and were therefore invalid according to the Warc standard. 
  2. Added 3 new link extractors (from the British Library) to heritrix :
    • org.archive.modules.extractor.ExtractorRobotsTxt

    • org.archive.modules.extractor.ExtractorSitemap
    • org.archive.modules.extractor.ExtractorJson 
      Note that ExtractorSitemap deviates slightly in functionality from the British Library version in that it is considerably more lenient in both what it identifies as a sitemap and what Urls it accepts in sitemaps.
  3. Added caching of crawl logs and metadata-indexes when hadoop is used for processing
    1. The new caching functionality for crawl logs and metadata indexes stores data in a directory specified by the setting

      Code Block
      settings.common.webinterface.metadata_cache_dir

       whose default value is "metadata_cache" (relative to the current working directory where the GUIApplication is started). At present there is no automatic cleaning of this directory.

  4. Added retry functionality to improve the robustness of the WarcRecordClient
  5. Fixed a bug whereby files uploaded from a harvester were not being deleted when the Bitrepository backend is in use
  6. Added retry-handling to Bitrepository uploads via two new settings keys under settings.common.arcrepositoryClient.bitrepository

    Code Block
     <store_retries>3</store_retries>
     <retryWaitSeconds>1800</retryWaitSeconds>


  7. Added parameters to manage memory and core usage in hadoop mapper-only jobs 

    Code Block
    settings.common.hadoop.mapred.mapMemoryMb
    settings.common.hadoop.mapred.mapMemoryCores


  8. Added support for uberized jobs, optimised for small tasks in hadoop, via

    Code Block
    settings.common.hadoop.mapred.enableUbertask


  9. Added hdfs-caching functionality to hadoop jobs. When this feature is enabled, any local files passed as input to the hadoop job are first copied into hdfs and cached for future use. This should create savings when the same file is processed multiple times, as is often the case for metadata files. This functionality is controlled by the following parameters 

    Code Block
    settings.common.hadoop.mapred.hdfsCacheEnabled
    settings.common.hadoop.mapred.hdfsCacheDir
    settings.common.hadoop.mapred.hdfsCacheDays

    Note that if the cache is enabled but the "hdsfCacheDays" parameter is set to zero then files are still copied into hdfs before processing but are deleted and recopied each time they are used. This can be useful for benchmarking.

  10. Added parameters to determine which hadoop mapreduce job queue is used for different jobs. Currently two possibilities are allowed for: 

    Code Block
    settings.common.hadoop.mapred.queue.batch
    settings.common.hadoop.mapred.queue.interactive


    "Interactive" is used for jobs started by GUI operations and "batch" for all other jobs. By assigning these to different hadoop queues, each with a non-zero minimum quota, one can ensure that interactive jobs do not have to wait indefinitely while batch jobs are being processed.
  11. Improved the performance of the GUI functionality associated with the button "Browse only relevant crawl-log lines for this domain".

Highlights in 7.0

NetarchiveSuite 7.0 introduces an entirely new backend storage and mass-processing implementation based on software from bitrepository.org and hadoop. The new functionality is enabled by defining the following key in the settings file for all applications: 

Code Block
<settings>
   <common>
      <arcrepositoryClient>
         <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>   

and additionally 

Code Block
<settings>
   <common>
      <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

The older arcrepositoryClient implementation dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient will be deprecated in future releases. (The developers are unaware of any other organisations currently using the older client, but please contact us if you still rely on it.)

The new architecture introduces many new keys and external configuration files. There is therefore a separate Guide To Configuring the NetarchiveSuite 7.0 Backend.

Upgrading From Previous NetarchiveSuite Releases

For those using either JMSArcRepositoryClient or LocalArcRepositoryClient there should be no special requirements to upgrade.

Issues Resolved in Release 7.0



Panel
Most-recent updates for 7.
0
1: