NetarchiveSuite 4.0.X Release Notes

Planned release 28.01.2013.

Highlights

WARC functionality is now production quality and is now considered ready to replace the ARC format in NetarchiveSuite.

Upgrade-notes

Update of the harvest database

After installing the release of NetarchiveSuite, you need to run the HarvestdatabaseUpdateApplication program. Information on how to do that can be found the additional tools manual

Viewerproxy module merged into Harvester module

See NAS-2104 - Getting issue details... STATUS for details.

The following changes needs to be made:

Settings change

The Viewerproxy settings should be moved into the harvester settings. Eg. the following section in the settings xml file: should be changed:

from

<settings>
  ...
  <harvester>
    ...
  </harvester>
  ...
  <viewerproxy>
     ...
  </viewerproxy>
  ...
</settings

to

<settings>
  ...
  <harvester>
    ...
    <viewerproxy>
      ..
    </viewerproxy>
    ...
  </harvester>
  ...
</settings

Script changes

The viewerproxy.jar file no longer exists and the classes found here have been moved to the harvester.jar file. This mean all references to the viewerproxy.jarfile in manually maintained files should be replaced by a reference to the harvester.jar file (if not already included).

Indexing code merged into Harvester module

The indexing code in the archiver module has been moved to the harvester module where it logically belongs.
This means that you need to update your existing deploy configurations as follows:

All overrides if any to the settings beginning with settings.archive.indexserver is to be moved to the settings.harvester.indexserver
The default indexrequestserver implementation is renamed from dk.netarkivet.archive.indexserver.distribute.IndexRequestServer to dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer
The default indexClient class represented by the setting settings.common.indexClient.class
is renamed from dk.netarkivet.archive.indexserver.distribute.IndexRequestClient to dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient

Changes to the settings related to the harvest scheduler

The fixing of issue NAS-2069 has caused to changes to the layout of the harvester.scheduler.

The old layout was:

 <scheduler>
   <errorFactorPrevResult>10</errorFactorPrevResult>
   <errorFactorBestGuess>2</errorFactorBestGuess>
   <expectedAverageBytesPerObject>38000</expectedAverageBytesPerObject>
   <maxDomainSize>5000</maxDomainSize>
   <jobs>
     <maxRelativeSizeDifference>100</maxRelativeSizeDifference>
     <minAbsoluteSizeDifference>2000</minAbsoluteSizeDifference>
     <maxTotalSize>2000000</maxTotalSize>
     <maxTimeToCompleteJob>0</maxTimeToCompleteJob>
   </jobs>
   <configChunkSize>10000</configChunkSize>
   <splitByObjectLimit>false</splitByObjectLimit>
   <jobtimeouttime>604800</jobtimeouttime>
   <dispatchperiode>30000</dispatchperiode>
   <jobgenerationperiode>60</jobgenerationperiode>
   <useQuotaEnforcer>true</useQuotaEnforcer>
 </scheduler>

The new layout with defaults is :

<scheduler> 
  <jobtimeouttime>604800</jobtimeouttime>
  <jobgenerationperiode>10</jobgenerationperiode>
  <jobGen>
    <class>dk.netarkivet.harvester.scheduler.jobgen.DefaultJobGenerator</class>
    <objectLimitIsSetByQuotaEnforcer>true</objectLimitIsSetByQuotaEnforcer>
    <maxTimeToCompleteJob>0</maxTimeToCompleteJob>
    <domainConfigSubsetSize>10000</domainConfigSubsetSize>
    <config>
      <!-- Only used by DefaultJobGenerator -->
      <splitByObjectLimit>false</splitByObjectLimit>
      <maxRelativeSizeDifference>100</maxRelativeSizeDifference>
      <minAbsoluteSizeDifference>2000</minAbsoluteSizeDifference>
      <maxTotalSize>2000000</maxTotalSize>
      <errorFactorPrevResult>10</errorFactorPrevResult>
      <errorFactorBestGuess>2</errorFactorBestGuess>
      <expectedAverageBytesPerObject>38000</expectedAverageBytesPerObject>
      <maxDomainSize>5000</maxDomainSize>
      <!-- Only used by FixedDomainConfigurationCountJobGenerator -->
      <fixedDomainCountSnapshot>0</fixedDomainCountSnapshot>
      <fixedDomainCountFocused>0</fixedDomainCountFocused>
    </config>
  </jobGen>
</scheduler>

How to change your setup, if you want to into NetarchiveSuite WARC mode

By default NetarchiveSuite writes its metadata to ARC files, and Heritrix produces ARC files as well.

But in 4.0 there is a setting for each choice, so you can now keep on writing metadata to ARC files while asking Heritrix to produce WARC files.

If you want to generate metadata WARC files, change the setting settings.harvester.harvesting.metadata.metadataFormat to "warc" in your deploy configuration

as seen below:

<settings>
  <harvester>
    <harvesting>
      <metadata>
        <heritrixFilePattern>.*(\.journal|\.xml|\.txt|\.log|\.out)</heritrixFilePattern>
        <reportFilePattern>.*-report.txt</reportFilePattern>
        <logFilePattern>.*(\.log|\.out|\.gz)</logFilePattern>
        <metadataFormat>warc</metadataFormat>
      </metadata>
    </harvesting>
  </harvester>
</settings>

Asking Heritrix to produce WARC files, change the setting settings.harvester.harvesting.heritrix.archiveFormat to "warc" as seen below:

<settings>
  <harvester>
    <performer>releaseTest</performer>
    <aliases>
      <timeout>31536000</timeout>
    </aliases>
    <harvesting>
      <deduplication>
        <enabled>true</enabled>
      </deduplication>
      <heritrix>
        <inactivityTimeout>1800</inactivityTimeout>
        <noresponseTimeout>1800</noresponseTimeout>
        <crawlLoopWaitTime>60</crawlLoopWaitTime>
        <archiveFormat>warc</archiveFormat>
      </heritrix>

Having Heritrix write warc files also requires that you add a WARCWriterProcessor to the templates you are using. We recommend that you add the following to your templates as described in the template we used in our releasetest test default_orderxml-with-source-tag-seeds-enabled.xml

<newObject name="WARCArchiver" class="dk.netarkivet.harvester.harvesting.WARCWriterProcessor">
  <boolean name="enabled">false</boolean>
  <newObject name="WARCArchiver#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
    <map name="rules">
    </map>
  </newObject>
  <boolean name="compress">false</boolean>
  <string name="prefix">netarkivet</string>
  <string name="suffix">${HOSTNAME}</string>
  <long name="max-size-bytes">100000000</long>
  <stringList name="path">
    <string>warcs</string>
  </stringList>
  <integer name="pool-max-active">5</integer>
  <integer name="pool-max-wait">300000</integer>
  <long name="total-bytes-to-write">0</long>
  <boolean name="skip-identical-digests">false</boolean>
  <boolean name="write-requests">true</boolean>
  <boolean name="write-metadata">true</boolean>
  <boolean name="write-revisit-for-identical-digests">true</boolean>
  <boolean name="write-revisit-for-not-modified">true</boolean>
</newObject>

This template uses as WARC writer the dk.netarkivet.harvester.harvesting.WARCWriterProcessor, which is extension of the WARCWriterProcessor bundled with Heritrix. The extension only some Netarchivesuite metadata to the WarcInfo record, so if you don't need that,you can use the class org.archive.crawler.writer.WARCWriterProcessor instead.

Full list of issues resolved in this release

type	key	priority	summary
Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Known issues

type	key	priority	summary	fixVersion
Unable to locate Jira server for this macro. It may be due to Application Link configuration.