NetarchiveSuite 5.2.x Release Notes

5.2.2 Release Date 25th November 2016

5.2.1 Release Date 23rd November 2016

5.2 Release Date: 4th November 2016

Contents

Highlights in 5.2.2

NAS 5.2.2 restores the functionality, missing since the upgrade to Heritrix 3, which allows one to switch deduplication on or off as a setting to the HarvestJobManager component. The setting in settings_HarvestJobManagerApplication.xml is harvester.harvesting.deduplication.enabled which is binary valued. The setting is applied to harvests generated using any crawler template which includes the DeDuplicator bean, and which specifies the appropriate placeholder, for example as follows:

  <bean id="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator">
    <property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}"/>
    <property name="matchingMethod" value="URL"/>
    <property name="tryEquivalent" value="TRUE"/>
    <property name="changeContentSize" value="false"/>
    <property name="mimeFilter" value="^text/.*"/>
    <property name="filterMode" value="BLACKLIST"/>
    <property name="origin" value=""/>
    <property name="originHandling" value="INDEX"/>
    <property name="statsPerHost" value="true"/>
    <property name="enabled" value="%{DEDUPLICATION_ENABLED_PLACEHOLDER}" />
  </bean>

The %{DEDUPLICATION_ENABLED_PLACEHOLDER} is replaced with the current value of the setting when jobs are generated. The placeholder is optional, and deduplication will be enabled by default for any template which includes the DeDuplicator in its disposition chain, and for which the "enabled" property is not explicitly defined.  

Highlights in 5.2.1

NAS 5.2.1 is a bugfix release addressing in issue in wayback-indexing of deduplicate records. 

Highlights in 5.2

Java 8


NetarchiveSuite now requires a Java 8 runtime for all components.

New Settings

  • ChecksumFileApplication

     

    /**
    * <b>settings.archive.checksum.usePrecomputedChecksum</b>: This decides whether or not use the pre-computed checksum sent as part of the StoreMessage and UploadMessage
    * The default is false
    */
        public static String CHECKSUM_USE_PRECOMPUTED_CHECKSUM_DURING_UPLOAD= "settings.archive.checksum.usePrecomputedChecksumDuringUpload";

    This boolean can be used to optimise the upload process to the bitarchives.

     

  • GUIApplication, HarvestJobManager

    /**
     * <b>settings.common.topLevelDomains.tld</b>: <br>
     * Extra valid top level domain, like .co.uk, .dk, .org., not part of current embedded public_suffix_list.dat file 
     * in common/common-core/src/main/resources/dk/netarkivet/common/utils/public_suffix_list.dat
     * downloaded from https://www.publicsuffix.org/list/public_suffix_list.dat
     */
    public static String TLDS = "settings.common.topLevelDomains.tld";
  • HarvestControllerApplication

    /**
     * The version number which goes in metadata file names like 12345-metadata-&lt;version number&gt;.warc.gz
     */
    public static String METADATA_FILE_VERSION_NUMBER = "settings.harvester.harvesting.metadata.filename.versionnumber";

    This parameter allows for the definition of different generations of metadata file.

    /**
     * <b>settings.harvester.harvesting.metadata.compression</b> Do we compress the
     * metadata associated with a given harvest job. 
     * default: false 
     */
    public static String METADATA_COMPRESSION = "settings.harvester.harvesting.metadata.compression";

    Controls whether metadata files are generated in compressed (warc.gz) format.

  • ViewerproxyApplication, IndexServerApplication, WaybackIndexerApplication

    /**
     * Specifies the suffix of a regex which can identify valid metadata files by job number. Thus preceding
     * the value of this setting with .* will find all metadata files.
     */
    public static String METADATAFILE_REGEX_SUFFIX = "settings.common.metadata.fileregexsuffix";

    This parameter allows one to determine which metadata files to include in indexing (for Viewerproxy or Wayback). The full regex string to be searched consists of the string <jobid>-<harvestid> followed by this suffix. The default value is -metadata-[0-9]+.(w)?arc(.gz)? which matches all metadata files using the standard NetarchiveSuite naming scheme.

  • GUIApplication

        /**
         * <b>settings.harvester.viewerproxy.allowFileDownloads</b> If set to false, there will be no links to
         * allow download of warcfiles via the Viewerproxy GUI.
         */
        public static String ALLOW_FILE_DOWNLOADS = "settings.harvester.viewerproxy.allowFileDownloads";

    A simple security feature to hinder operators from easily downloading harvested archive files. (default: true)

       public static String HERITRIX3_MONITOR_TEMP_PATH = "settings.harvester.harvesting.monitor.tempPath";

    Path to a directory which the new Heritrix3 monitor feature can use for caching. This is empty by default, and falls back to the system-wide temporary directory (usually /tmp).

Control Heritrix from NetarchiveSuite (beta)


In earlier versions of NetarchiveSuite, there was limited monitoring of running heritrix harvests in the NetarchiveSuite GUI, but management of running jobs required opening the Heritrix3 console itself. From NetarchiveSuite 5.2, much of the Heritrix3 console functionality has been moved into NetarchiveSuite. It is now possible, from NAS itself to:

  • pause, unpause or terminate running heritrix jobs
  • to inspect reports on running jobs
  • to show the crawl-log of a running job, either in entirety or filtered by regex
  • to show and manipulate the Heritrix frontier

These extensive new features are experimental in NAS 5.2 and the developers welcome feedback, bug-reports, and code-patches.

Top-Level Domains Can Be Defined Externally

From NAS 5.2, all ICANN-recognized domains are recognized as valid in NAS. NAS contains an embedded copy of https://publicsuffix.org/list/public_suffix_list.dat, but this may be overridden, if necessary, by placing an alternative copy at the hard-coded path conf/public_suffix_list.dat in the installation on the machine where the GUIApplication and HarvestJobManager run. 

warc.gz metadata files

NAS now supports compression of metadata files (warc.gz format) via the setting settings.harvester.harvesting.metadata.compression.

Warc Revisit Records

NAS now generates WARC revisit records when using the is.hi.bok.deduplicator.DeDuplicator deduplicator.

Tomcat

The web GUI now uses an embedded tomcat, rather than Jetty, as a servlet container. This changeover should be invisible to the end user.

New Heritrix Version

NAS now uses the most recent (unofficial) Heritrix release from Kristinn Sigurðsson at the National Library of Iceland (version 3.3.0-LBS-2016-02).

RSS Crawling

The heritrix crawl-rss extension from Kristinn Sigurðsson at the National Library of Iceland now also comes bundles with NAS, and is therefore available for use in NAS crawls. (See RSS Harvests for documentation).

GUI Styling

The styling of the web interface has been improved.

Most-recent updates for 5.2.x:

Issues resolved in release 5.2.2

type summary status
Loading...
Refresh

Issues resolved in release 5.2.1

type summary status
Loading...
Refresh

Issues resolved in release 5.2

type summary status
Loading...
Refresh

Known issues

type key priority summary fixversions
Loading...
Refresh