NetarchiveSuite 7.x Release Notes

7.0 Release Date: 2021-03-19

7.1 Release Date: 2021-07-06

7.2 Release Date: 2021-08-19

7.3 Release Date: 2022-01-31

7.4 Release Date 2022-07-15

7.4.1 Release Date 2022-08-08

7.4.2 Release Date 2022-09-05

7.4.3 Release Date 2022-10-03

7.4.4 Release Date 2022-12-22

7.5 Release Date 2023-09-07

7.6 Release Date 2024-10-14

7.7 Release Date 2025-03-31

Highlights in 7.7

Java Version

NetarchiveSuite 7.7 is now compiled with Java 17. When running with Java 17, the following arguments need to be added to the Java command line for all applications:

--add-exports java.base/sun.security.ssl=ALL-UNNAMED --add-exports jdk.naming.rmi/com.sun.jndi.rmi.registry=ALL-UNNAMED --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED --add-opens java.management/javax.management.openmbean=ALL-UNNAMED

The quickstart project in netarchivesuite-docker-compose has also been upgraded to use a Java 17 base image.

Heritrix Dependency

NetarchiveSuite is now bundled with Heritrix 3.7.0. This represents a major upgrade to the crawling abilities of the software.

Limitations in 7.7

The hadoop processing infrastructure does not function in NAS 7.7 and is likely to be deprecated in future releases.

Javadoc for NAS 7.7 has not yet been released.

Highlights in 7.6

Heritrix Dependency

Version 7.6 of NetarchiveSuite represents a significant modernisation of the code base. In particular it fully incorporates all changes from the recent 2024-09-09 interim release of Heritrix (https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20240909). The actual heritrix dependency used in NetarchiveSuite is released from our own fork at https://github.com/netarchivesuite/heritrix3 but contains very little that is not also found in the internetarchive release (mostly some customisations to ExtractorSitemap).

CrawlRSS Dependency

In addition we have created a new release of crawlrss (https://github.com/netarchivesuite/crawlrss forked from https://github.com/Landsbokasafn/crawlrss) with an appropriately updated heritrix dependency.

NetarchiveSuite

NetarchiveSuite 7.6 contains a small number of improvements and bug fixes to the actual NetarchiveSuite code.

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution

Loading...

Refresh

Netarchivesuite-Docker-Compose

The quickstart environment using netarchivesuite-docker-compose (https://github.com/netarchivesuite/netarchivesuite-docker-compose) has been modernised to use a newer base image, along with other minor improvements (for example to the default crawler beans used for harvesting). Note that there is still an issue with the proftpd container crashing on restart. We recommend that users delete the proftpd container before attempting to restart the netarchivesuite-docker-compose assembly.

Highlights in 7.5

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution

Loading...

Refresh

Highlights in 7.4.4

Fixed an issue causing hadoop jobs to fail on extraction of very large crawl-logs.

Highlights in 7.4.3

Highlights in 7.4.2

A bugfix for a slightly mysterious intermittent failure to create deduplication indexes using hadoop
Default seedlists for new domains now include "https://" and url's both with and without www so "foobar.dk" has a seedlist
http://foobar.dk http://www.foobar.dk https://foobar.dk https://www.foobar.dk

Highlights in 7.4.1

This version fixes a bug in the implementation of NAS-2876: Add compressed data size on All Running jobs pageClosed

Highlights in 7.4

The CrawlRSS module has been updated to be compatible with the current version of heritrix. See documentation - RSS Harvests .
Fixed various issues with caching in hdfs

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution

Loading...

Refresh

Highlights in 7.3

Fixed a bug in the bitmagasin logic used by WaybackIndexer to fetch all filenames
Made fetching of hadoop results from hdfs pipe directly to disk, thereby avoiding potential OutOfMemory issue
Refactored the hadoop version of the CDX-indexing workflow
Added a number of upstream fixed to heritrix, including one to fix unwanted behaviour when a url redirects to a top-level-domain
Added two new settings parameters to make FileResolver more robust in the event of server instability.

    /**
     * Number of retries for fileresolver if an empty result is obtained (0 = try only once). default 3.
     */
    public static String FILE_RESOLVER_RETRIES = "settings.common.fileResolver.retries";

    /**
     * Seconds to wait between retries. default 5.
     */
    public static String FILE_RESOLVER_RETRY_WAIT = "settings.common.fileResolver.retrywaitSeconds";

Highlights in 7.2

Fixed NAS-2868: Automatic Heritrix Version DetectionClosed and https://kb-dk.atlassian.net/browse/NAS-2864 so that the version of Heritrix reported in all archive and metadata files is correct and consistent.
Included all Heritrix patches up to the 2021-08-03 Interim Release , as well as a number of even more recent minor bugfixes . This upgrade includes as a major new feature the ExtractorChrome module which enables browser-based harvesting from directly within the Heritrix extractor chain. To enable browser-based harvesting, add a bean like this
<bean id="extractorChrome" class="org.archive.modules.extractor.ExtractorChrome"> <property name="executable" value="/usr/bin/google-chrome"/> </bean>
to the FetchChain of your crawler-beans before the ExtractorHTTP element. Then make sure your harvest job runs on a machine where chrome (or chromium) is available at the specified executable path. Here you can use NetarchiveSuite's existing harvest-channel mappings functionality if only some of your harvesting machines are to be used for browser-based harvesting. Content harvested by the browser can be identified in the crawl log as they will be annotated "browser".
ExtractorSitemap has been modified with two optional properties:
<bean id="extractorSitemap" class="org.archive.modules.extractor.ExtractorSitemap"> <property name="urlPattern" value=".*sitemap.*\.xml.*"/> <property name="enableLenientExtraction" value="true" /> </bean>
if "urlPattern" is set then any url matching this pattern is assumed to be a sitemap. Otherwise ExtractorSitemap reverts to its default functionality whereby it checks the mime-type of every url and then sniffs the start of any xml url to see if it looks like a sitemap. If "enableLenientExtraction" is set to true then every url found in the sitemap will be extracted. Otherwise the extractor will omit any urls which do not obey the scoping rules defined in the sitemap specification.

Highlights in 7.1

Fixed (after many years) NAS-2870: Integrate the latest H3 IIPC community version: 3.4.0-202010XX with the fix on WARC-Payload-Digest on revisit recordsTriage whereby all generated revisit-records had badly formatted WARC-Payload-Digest fields and were therefore invalid according to the Warc standard.
Added 3 new link extractors (from the British Library) to heritrix :
- org.archive.modules.extractor.ExtractorRobotsTxt
- org.archive.modules.extractor.ExtractorSitemap
- org.archive.modules.extractor.ExtractorJson
  Note that ExtractorSitemap deviates slightly in functionality from the British Library version in that it is considerably more lenient in both what it identifies as a sitemap and what Urls it accepts in sitemaps.
Added caching of crawl logs and metadata-indexes when hadoop is used for processing
1. The new caching functionality for crawl logs and metadata indexes stores data in a directory specified by the setting
  settings.common.webinterface.metadata_cache_dir
  whose default value is "metadata_cache" (relative to the current working directory where the GUIApplication is started). At present there is no automatic cleaning of this directory.
Added retry functionality to improve the robustness of the WarcRecordClient
Fixed a bug whereby files uploaded from a harvester were not being deleted when the Bitrepository backend is in use
Added retry-handling to Bitrepository uploads via two new settings keys under settings.common.arcrepositoryClient.bitrepository
<store_retries>3</store_retries> <retryWaitSeconds>1800</retryWaitSeconds>
Added parameters to manage memory and core usage in hadoop mapper-only jobs
settings.common.hadoop.mapred.mapMemoryMb settings.common.hadoop.mapred.mapMemoryCores
Added support for uberized jobs, optimised for small tasks in hadoop, via
settings.common.hadoop.mapred.enableUbertask
Added hdfs-caching functionality to hadoop jobs. When this feature is enabled, any local files passed as input to the hadoop job are first copied into hdfs and cached for future use. This should create savings when the same file is processed multiple times, as is often the case for metadata files. This functionality is controlled by the following parameters
settings.common.hadoop.mapred.hdfsCacheEnabled settings.common.hadoop.mapred.hdfsCacheDir settings.common.hadoop.mapred.hdfsCacheDays
Note that if the cache is enabled but the "hdsfCacheDays" parameter is set to zero then files are still copied into hdfs before processing but are deleted and recopied each time they are used. This can be useful for benchmarking.
Added parameters to determine which hadoop mapreduce job queue is used for different jobs. Currently two possibilities are allowed for:
settings.common.hadoop.mapred.queue.batch settings.common.hadoop.mapred.queue.interactive

"Interactive" is used for jobs started by GUI operations and "batch" for all other jobs. By assigning these to different hadoop queues, each with a non-zero minimum quota, one can ensure that interactive jobs do not have to wait indefinitely while batch jobs are being processed.
Improved the performance of the GUI functionality associated with the button "Browse only relevant crawl-log lines for this domain".

Highlights in 7.0

NetarchiveSuite 7.0 introduces an entirely new backend storage and mass-processing implementation based on software from bitrepository.org and hadoop. The new functionality is enabled by defining the following key in the settings file for all applications:

<settings>
   <common>
      <arcrepositoryClient>
         <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>

and additionally

<settings>
   <common>
      <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

The older arcrepositoryClient implementation dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient will be deprecated in future releases. (The developers are unaware of any other organisations currently using the older client, but please contact us if you still rely on it.)

The new architecture introduces many new keys and external configuration files. There is therefore a separate Guide To Configuring the NetarchiveSuite 7.0 Backend.

Upgrading From Previous NetarchiveSuite Releases

For those using either JMSArcRepositoryClient or LocalArcRepositoryClient there should be no special requirements to upgrade.

NetarchiveSuite

NetarchiveSuite 7.x Release Notes

Highlights in 7.7

Java Version

Heritrix Dependency

Limitations in 7.7

Highlights in 7.6

Heritrix Dependency

CrawlRSS Dependency

NetarchiveSuite

Netarchivesuite-Docker-Compose

Highlights in 7.5

Highlights in 7.4.4

Highlights in 7.4.3

Highlights in 7.4.2

Highlights in 7.4.1

Highlights in 7.4

Highlights in 7.3

Highlights in 7.2

Highlights in 7.1

Highlights in 7.0

Upgrading From Previous NetarchiveSuite Releases

Issues Resolved in Release 7.0

Download Links for NetarchiveSuite:

Related content