This is a quick overview of the software and its major features.

Introduction

The primary function of NetarchiveSuite is to plan, schedule and archive web harvests of parts of the internet. We use Heritrix3 as our web-crawler.
NetarchiveSuite was released on July 2007 as Open Source under the LGPL license and is used by the Danish organization Netarkivet.dk. This organization has since July 2005 been using NetarchiveSuite to harvest Danish websites as authorized by the latest Danish Legal Deposit Act. A number of other national libraries are also using NetarchiveSuite to harvest their web, and the software is maintained as an Open Source partnership between these organisations.

The NetarchiveSuite can organize three different kinds of web harvest:

The software has been designed with the following in mind:

The modules in the NetarchiveSuite

The NetarchiveSuite is split into four main modules: One module with common functionality and three modules corresponding to processes of harvesting, archiving and accessing, respectively.

The Common Module

The framework and utilities used by the whole suite, like exceptions, settings, messaging, file transfer (RemoteFile), and logging. It also defines the Java interfaces used to communicate between the different modules, to support alternative implementations. The Common Module includes the web front-end through which curators and managers can define harvests, monitor running harvests, and perform quality assurance on completed harvests.

The Harvester Module

This module handles defining, scheduling, and performing harvests.

The Archive Module

This module makes it possible to setup and run a repository with replication, active bit consistency checks for bit-preservation, and support for distributed batch jobs on the archive.

The Access (Viewerproxy) Module

This module gives access to previously harvested material, through a proxy solution.

For developers