Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Implement support for the WARC  format in the main NetarchiveSuite project.

See NAS-1720 Enable WARC file writing and handling in the NetarchiveSuite for specific list of tasks.

Suggestion for appending harvestInfo.xml to the existing Heritrix warc-info:

 

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2012-11-14T17:00:49Z
WARC-Filename: 1-1-20121114170049-00000-kb-test-har-002.kb.dk.warc
WARC-Record-ID: <urn:uuid:ae0139ba-efee-4e09-824f-e65d9c248118>
Content-Type: application/warc-fields
Content-Length: 964

software: Heritrix/1.14.4 http://crawler.archive.org
ip: 130.226.228.8
hostname: kb-test-har-002.kb.dk
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
operator: Admin
isPartOf: default_orderxml
description: Default Profile
robots: ignore
http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.14.4 +http://netarkivet.dk/webcrawler/)
http-header-from: info@netarkivet.dk
harvest_info_version: 0.4

harvest_job_id: 1
harvest_job_number: 0
harvest_job_priority: HIGHPRIORITY
harvest_original_harvestdefinition_id: 1
harvest_max_bytes_per_domain: 500000000
harvest_max_objects_per_domain: 500000000
harvest_max_objects_per_domain: 2000
harvest_orderXML_name: default_orderxml
harvest_harvestdefinition_name: netarkivet
harvest_schedule_name: Once_a_week
harvest_filename_prefix: 1-1

 

 

  • No labels