NetarchiveSuite 5.4.x Release Notes

5.4.2 Release Date: 2018-06-15

BugFix Release 5.4.2

This release addresses issue  NAS-2514 - Getting issue details... STATUS  which resulted in many url's receiving crawl-status code -50 in some harvests. It is only relevant for users of SeedUriDomainnameQueueAssignmentPolicy. The fix is in two parts:

  • A new QuotaEnforcer implementation dk.netarkivet.harvester.harvesting.PrerequisiteIgnoringQuotaEnforcer which can be used in a crawler-bean harvest-template, and which never enforces harvesting quotas on prerequisite url's (typically dns lookups and robots.txt), and
  • An alteration to SeedUriDomainnameQueueAssignmentPolicy to ensure that dns queries are queued on the same queue as other url's for the same seed. This appears to work around an undocumented race condition in heritrix which was causing many crawl failures.

BugFix Release 5.4.1

NAS 5.4.1 is a Bug-Fix release addressing some issues found during the Acceptance Test phase of NAS 5.4. The issues addressed are

  • A memory leak introduced by a new feature in NAS 5.4 ( NAS-2614 - Getting issue details... STATUS  ) to manage the number of jobs on the JMS queues, and
  • An error in the functionality for searching/browsing in the frontier of running jobs
  • Introduction of a new setting (settings.harvester.indexserver.tryToMigrateDuplicationRecords), a switch, to disable new functionality associated with the Danish netarchive's project to compress their archive. This functionality caused an unnecessary slowdown in indexing functionality, but is now disabled by default. 

The functionality for browsing in the Heritrix frontier is still somewhat experimental and is in need of a usability overhaul. This is a priority for a future release.


key summary status
Loading...
Refresh

Highlights in 5.4

  • NetarchiveSuite now ships with a customised version of Heritrix 3, forked from the version maintained by Kristinn Sigurdsson at the National Library of Iceland.
  • The integration between the NetarchiveSuite Web interface and Heritrix 3 has been much improved, both in regard to scaling and usability.
  • There is significant improvement to the job generation algorithm, so that the production of spurious duplicate jobs is now largely eliminated.
  • Support for Heritrix1 has now been removed from the distribution.
  • You can now define a limit to how many jobs are submitted to each jobchannel simultaneously, if you enable limitSubmittedJobsInQueue by setting settings.harvester.scheduler.limitSubmittedJobsInQueue to true. The default value if you enable this is one job at a time. You can change this value by overriding the settings.harvester.scheduler.submittedJobsInQueueLimit. The latter setting is ignored, if limitSubmittedJobsInQueue is false, which is the default setting.
  • The setting settings.harvester.scheduler.jobgenerationperiode has been renamed settings.harvester.scheduler.jobgenerationperiod (default value is still 60 a.k.a 1 minute)
  • Added new setting to choose between filtering methods on History/Harveststatus-running.jsp: settings.webinterface.runningjobsFilteringMethod (default: database alternative: cachedLogs)

Upgrading from previous releases of Netarchivesuite

  • Upgrading the database: After finishing the installation of NetarchiveSuite and starting it for the first time, please go the server where GUIApplication and HarvestJobManager is installed and run:

    cd NAS_INSTALLDIR/conf
    bash update_external_harvest_database.sh

    Please examine the INSTALLDIR/update_external_harvest_database.log for any errors.

 

Most-recent updates for 5.4.1:

Issues resolved in release 5.4

key summary status
Loading...
Refresh