Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The DeDuplicator is a module authored by Kristinn Sigurdsson from the National Library of Iceland. It is part of the [ Write-processor chain. It enables us to avoid saving duplicates in our storage. It does this by looking up the url of the potential duplicate object in the index associated with this module. If the url is found in the index, and the checksum for the url in the index is unaltered, the object is not stored. However a reference to where the object is stored is written to the crawl log. If the url for the object is not found in the index, the object is stored normally. Note that only non-text objects are examined by this module, i.e. where the mimetype of the object does not match "^text/." (like text/html or text/plain). Note that the deduplication is disabled if either the DeDuplicator element in the harvest template is disabled (the value of the attribute "enabled" is set to false), or the general setting *settings.harvester.harvesting.deduplication.enabled is set to false. NetarchiveSuite uses version 0.4.0 of the deduplicator.

...