Workflow for generating deduplication indexes in Netarchivesuite

Activating a snapshot harvest on page HarvestDefinition/Definitions-snapshot-harvests.jsp calls the SnapshotHarvestDefinition#flipActive() method.

This method has the following logic if deduplication is enabled:

log.info("Snapshot harvest #{} activated. Requesting preparation of deduplicationIndex before jobgeneration can commence", harvestId);
Set<Long> jobSet = hdDaoProvider.get().getJobIdsForSnapshotDeduplicationIndex(harvestId);
jobIndexCache.requestIndex(jobSet, harvestId);

This sends a IndexRequestMessage to the IndexServer or more specifically the IndexRequestServer for a deduplicationIndex for the given list of jobs. (See IndexRequestClietn

After processing the request, it sends a IndexReadyMessage to the HarvestJobManager with either indexOK=true (index is ready), or indexOK=false (The server failed to generate the index)

(See dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer#doProcessIndexRequestMessage(), ll. 416-423)


The HarvestJobManager receives the response to the IndexreadyMessage in method HarvestSchedulerMonitorServer#processIndexReadyMessage()
Here the 'isindexready' field in the table 'fullharvests' is set to true, if the 'indexOK' field in the IndexReadyMessage is true, otherwise it is set to false.

Note that  the 'isindexready' field is used by HarvestDefinitionDBDAO#getReadyHarvestDefinitions() to only return the ready snapshot harvestsdefinitions where isindexeady is true.

Selecting the list of jobs included in the deduplicationIndex

The method HarvestDefinitionDBDAO#getJobIdsForSnapshotDeduplicationIndex is responsible for computing the list of jobs included in the deduplication index.
It uses the rather complex (and maybe wrong) getPreviousFullHarvests() method also in HarvestDefinitionDBDAO

    /**
     * Get list of harvests previous to this one.
     *
     * @param thisHarvest The id of this harvestdefinition
     * @return a list of IDs belonging to harvests previous to this one.
     */
    private List<Long> getPreviousFullHarvests(Long thisHarvest) {
        List<Long> results = new ArrayList<Long>();
        try (Connection c = HarvestDBConnection.get();) {
            // Follow the chain of originating IDs back
            for (Long originatingHarvest = thisHarvest; originatingHarvest != null;
                // Compute next originatingHarvest
                 originatingHarvest = DBUtils.selectFirstLongValueIfAny(c, "SELECT previoushd FROM fullharvests"
                         + " WHERE fullharvests.harvest_id=?", originatingHarvest)) {
                if (!originatingHarvest.equals(thisHarvest)) {
                    results.add(originatingHarvest);
                }
            }

            // Find the first harvest in the chain (but last in the list).
            Long firstHarvest = thisHarvest;
            if (!results.isEmpty()) {
                firstHarvest = results.get(results.size() - 1);
            }

            // Find the last harvest in the chain before
            Long olderHarvest = DBUtils.selectFirstLongValueIfAny(c, "SELECT fullharvests.harvest_id"
                            + " FROM fullharvests, harvestdefinitions," + "  harvestdefinitions AS currenthd"
                            + " WHERE currenthd.harvest_id=?" + " AND fullharvests.harvest_id "
                            + "= harvestdefinitions.harvest_id"
                            + " AND harvestdefinitions.submitted " + "< currenthd.submitted"
                            + " ORDER BY harvestdefinitions.submitted " + HarvestStatusQuery.SORT_ORDER.DESC.name(),
                    firstHarvest);
            // Follow the chain of originating IDs back
            for (Long originatingHarvest = olderHarvest; originatingHarvest != null; originatingHarvest = DBUtils
                    .selectFirstLongValueIfAny(c, "SELECT previoushd FROM fullharvests"
                            + " WHERE fullharvests.harvest_id=?", originatingHarvest)) {
                results.add(originatingHarvest);
            }
        } catch (SQLException e) {
            log.warn("Exception thrown while updating fullharvests.isindexready field: {}",
                    ExceptionUtils.getSQLExceptionCause(e), e);
        }
        return results;
    }



Classes involved in this workflow:

  • harvester/harvester-core/src/main/java/dk/netarkivet/harvester/webinterface/SnapshotHarvestDefinition.java, ll. 251-299 (esp. 267-282)
  • harvester/harvest-scheduler/src/main/java/dk/netarkivet/harvester/scheduler/HarvestSchedulerMonitorServer.java, ll. 196-224
  • harvester/harvester-core/src/main/java/dk/netarkivet/harvester/indexserver/distribute/IndexRequestServer.java, ll. 416-423
  • harvester//harvester-core/src/main/java/dk/netarkivet/harvester/datamodel/HarvestDefinitionDBDAO.java, ll. 1167-1187, 1189-1233
  • harvester/harvester-core/src/main/java/dk/netarkivet/harvester/indexserver/distribute/IndexRequestClient.java, ll. 358-383