Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The framework and utilities used by the whole suite, like exceptions, settings, messaging, file transfer (!RemoteFile), and logging. It also defines the Java interfaces used to communicate between the different modules, to support alternative implementations.

...

This module handles defining, scheduling, and performing harvests.

  • Harvesting The harvesting uses the Heritrix crawler developed by Internet Archive. The harvesting module allows for flexible automated definitions of harvests. The system gives access to the full power of the Heritrix crawler, given adequate knowledge of the Heritrix crawler. NetarchiveSuite wraps the crawler in an easy-to-use interface that handles scheduling and configuring of the crawls, and distributes it to several crawling servers.
  • The harvester module allows for de-duplication, using an index of URLs already crawled and stored in the archive to avoid storing duplicates more than once. This function uses the de-duplicator module from Kristinn Sigurdsson.
  • The harvester module supports packaging metadata about the harvest together with the harvested data.

...

  • The archiving component offers a secure environment for storing your harvested material. It is designed for high preservation guarantees on bit preservation.
  • It allows for replication of data on different locations, and distribution of content on several servers on each location. It supports different software and hardware platforms.
  • The module allows for distributed batch jobs, i.e. running the same jobs on all servers at a location in parallel, and merging the results.
  • An index of data in the archive allows fast access to the harvested materials.

...

  • The modules are loosely coupled, communicating through Java interfaces, with the implementation replaceable without recompiling.
  • A rich number of settings in an XML structure allows for numerous ways of tweaking the applications for special needs.
  • All design and code is peer-reviewed.
  • There are Javadoc and implementation comments throughout the code.
  • The code is tested with unit tests (coverage of 80%about 70%) and thorough release tests.
  • Development happens in a well-defined development model (originally based on evolutionary prototyping).
  • NetarchiveSuite is available under a well-known, integration-friendly, open source license (LGPL).