Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt

Implement support for the WARC  format in the main NetarchiveSuite project.

See NAS-1720 Enable WARC file writing and handling in the NetarchiveSuite for specific list of tasks.

Suggestion for appending harvestInfo.xml to the existing Heritrix warc-info:

 

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2012-11-14T1723T17:0032:49Z58Z
WARC-Filename: 1-1-2012111417004920121123173258-00000-kb-test-har-002.kb.dk.warc
WARC-Record-ID: <urn:uuid:ae0139bac01abb4a-efee44ef-4e094ab7-824f9d35-e65d9c24811869a64066a107>
Content-Type: application/warc-fields
Content-Length: 964872

software: Heritrix/1.14.4 http://crawler.archive.org
ip: 130.226.228.8
hostname: kb-test-har-002.kb.dk
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
operator: Admin
isPartOf: default_orderxml
description: Default Profile
robots: ignore
http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.14.4 +http://netarkivet.dk/webcrawler/)
http-header-from: info@netarkivet svc@kb.dkharvest_info_

harvestInfo.version: 0.4
harvest_job_idharvestInfo.jobId: 1
harvest_job_number: 0
harvest_job_harvestInfo.priority: HIGHPRIORITY
harvest_original_harvestdefinition_id: 1
harvest_max_bytes_per_domain: 500000000
harvest_max_objects_per_domain: 500000000
harvest_max_objects_per_domain: 2000
harvest_orderXML_nameharvestInfo.harvestNum: 0
harvestInfo.origHarvestDefinitionID: 1
harvestInfo.maxBytesPerDomain: 500000000
harvestInfo.maxObjectsPerDomain: 2000
harvestInfo.orderXMLName: default_orderxml
harvest_harvestdefinition_nameharvestInfo.origHarvestDefinitionName: netarkivet
harvest_schedule_name-harvest
harvestInfo.scheduleName: Once_a_week
harvest_filename_prefixharvestInfo.harvestFilenamePrefix: 1-1

Suggestion for the warc-info in the NetarchiveSuite metadata warc-files.

 

 

Child pages (Children Display)
depth3
excerpttrue