Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

6. Verify that the harvest is activated and done

  1. Click 'Harvest status'->'All Jobs' in the left menu
  2. Select "All" in "Only display job status" to the right from the menu
  3. Click the "Show" button, until the jobs have stepped through statuses "NEW", "SUBMITTED", "STARTED", "DONE"
  4. Wait until all jobs have got status "DONE"
  •  Surely the following steps are a bit superfluous? (maybe)
  1. Check the following for the domain '''raeder.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by one job of the name <eh. name>
    2. Check that this job has configuration <eh. name>_default_orderxml_400000000Bytes_500Objects
    3. Check that there is a number for 'Run number' and 'Job ID'
    4. Check that the 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name> harvest
    5. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    6. Check that the 'Stopped due to' columns contain "Domain Completed"
  2. Check the following job details for the domain '''netarkivet.dk''': (Using page SelectiveHarvests->History->Run Number 0 ->JobID 1)
    1. Check that the 'Submit time', 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name> harvest
    2. Click on "Browse reports for jobs"
    3. Check that you don't get any errors when you click on some of the links
    4. Click on "Browse harvest files for job"
    5. Check that you don't get any errors when you click on some of the links
    6. Click on "Browse only relevant crawl-log lines for domain netarkivet.dk"
    7. Check that you don't get any errors when you click on some of the links
  3. Check the following for the domain '''netarkivet.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by 2 jobs of the name <eh. name>
    2. Check that one of the jobs has configuration <eh. name>_default_orderxml_400000000Bytes_500Objects
    3. Check that the 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name>
    4. Check that one of the jobs has configuration <eh. name>_default_orderxml_500000000Bytes_300Objects
    5. Check that the 'Start time' and 'End time' approximately corresponds to time of test with <eh. name> harvest
    6. Check that 'Run number' and 'Job ID' columns contains positive numbers
    7. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    8. Check that the 'Stopped due to' columns contains "Domain Completed"
  4. Check the following for the domain '''kaarefc.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by 1 job of the name <eh. name>
    2. Check that the job has configuration <eh. name>_default_orderxml_500000000Bytes_300Objects
    3. Check that the 'Start time' and 'End time' approximately corresponds to time of test with <eh. name> harvest
    4. Check that 'Run number' and 'Job ID' columns contains positive numbers
    5. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    6. Check that the 'Stopped due to' columns contains "Domain Completed"

7. Browse in data from the first event harvest only

These step require that you have your browser set up to use viewerproxy. For example in the DK test environment use the instructions at Setup DK test environment#ViewerproxySetup, or for a standalon installation use the instructions here.

  1. Click 'Definitions'->'Selective Harvests' in the left menu
  2.  Click 'History' in column 8 on the line with the event harvest <eh. name>
  3. Click 'Show jobs' in column 'Total number of jobs' on the line with 'Run number' 1
  4. Click 'Select these jobs for QA with viewerproxy' (it may take some time to create page)
  5. Check following in the 'Current Viewerproxy status'
  6. No errors are reported
  7. Check the "Currently does _not_ collect missing URLs." appear
  8. Check the "Current list of missing URLs contains 0 URLs."
  9. Check there is a line expressing index used from harvest <eh. name>, run 1 and built on jobs being looked at.
  10. Open a New tab or window in the browser (optionally, and in same kind of browser)
  11. Go to page http://netarkivet.dk/in-english/
  12. Check that this page contains data
  13. Go to page http://indvandrerbiblioteket.dk
  14. Check that an error occurs saying that www.indvandrerbiblioteket.dk was not found
  15. Go to page http://localtimes.info/Europe/Denmark/Copenhagen/
  16. (Check that a page containing date and time of the first harvest appears - this doesn't seem to work at the moment. Maybe localtimes.info is dead?)

8. Browse in data from the second event harvest only

  1. Click 'Definitions'->'Selective Harvests' in the left menu
  2. Click 'History' in column 8 on the line with event harvest <eh. name>
  3. Click 'Show jobs' in column 'Total number of jobs' on the line with 'Run number' 2
  4. Click 'Select these jobs for QA with viewerproxy' (it may take some time to create page)
  5. Check following in the 'Current Viewerproxy status'
  6. No errors are reported
  7. Check the "Currently does _not_ collect missing URLs." appear
  8. Check the "Current list of missing URLs contains 0 URLs."
  9. Check there is a line expressing index used from harvest <eh. name>, run 2 and built on jobs being looked at.
  10. Open a New tab or window in the browser (optionally, and in same kind of browser)
  11. Go to page http://www.netarkivet.dk
  12. Check that an error occurs saying that www.netarkivet.dk was not found. 
  13. Go to page http://indvandrerbiblioteket.dk
  14. Check that an error occurs saying that www.indvandrerbiblioteket.dk was not found
  15. Go to page http://localtimes.info/Europe/Denmark/Copenhagen/
  16. Check that a page containing date and time of the second harvest appears (Note: "Refresh" may be necessary)

9. Browse in data from the selective harvest only

(Note! This only works with http-data, not https-data.)

  1. Click 'Definitions'->'Selective Harvests' in the left menu
  2. Click 'History' in column 8 on the line with the selective harvest <sh. name>
  3. Click 'Show jobs' in column 'Total number of jobs' on the line with 'Run number' 0
  4. Click 'Select these jobs for QA with viewerproxy' (it may take some time to create page)
  5. Check following in the 'Current Viewerproxy status'
    1. No errors are reported
    2. Check the 'Currently does _not_ collect missing URLs.' appear
    3. Check the 'Current list of missing URLs contains 0 URLs.'
    4. Check there is a line concerning index used for harvest <sh. name>, run 0 and built on jobs being looked at.
  6. Open a new tab or window in the browser (optionally, and in same kind of browser)
  7. Go to page http://mazda.dk
  8. Check that this page contains data and all links are functional
  9. Go to a random internet page not on http://netarkivet.dk (but not https). The page should NOT be found. (Example: http://www.pligtaflevering.dk)

10. Verify that data is deduplicated

  1. Click on the JobID for your second finished event harvest <eh-name> in the Job status overview
  2. Click on "Browse reports for jobs"
  3. Click on the "processors-report" e.g. "metadata://netarkivet.dk/crawl/reports/processors-report.txt?heritrixVersion=3.3.0-LBS-2014-03 &harvestid=1&jobid=1" (or similar. The harvestid and jobid will probably differ)
  4. Check that there is a deduplicator processors-report similar to this one (the numbers will be different), but duplicates found should be non-zero:

    Code Block
    Total handled: 88
    Duplicates found: 20 20.0%
    Bytes total: 6391852 (6.1 MB)
    Bytes discarded: 0 (0 0.0%
    New (no hits): 88
    Exact hits: 0
    Equivalent hits: 0
    .....


  5. Check also the deduplicator report for the first run of the event harvest. The number of duplicates should be zero.

11. Define and run low bandwidth selective harvest

(The idea behind this is to create a job that is slow enough that one has time to terminate it before it is finished.)

  1. Go to Edit Harvest Templates page. Download default_orderxml.
  2. Edit it to replace disposition.maxPerHostBandwidthUsageKbSec with 30.
  3. Upload it as a new config: default_orderxml_low_bandwidth
  4. Go to edit-page for domain 'netarkivet.dk', edit defaultconfig, and replace harvesttemplate with 'default_orderxml_low_bandwidth'
  5. (Skip the rest of this section)
  6. Make a new selective harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Create new harvest definition' in the bottom of the main window
    3. Fill in the Harvest name and note the name for later use (from now referred as <sh1. name>)
    4. Choose "Once_a_week" in the drop down list for 'Schedule' 
    5. Write =netarkivet.dk= in the 'Enter Domain...' window and click 'Add domains'
    6. Click 'Save'
  7. Activate the selective harvest
    1. Click 'Activate' in column 5 on the line with the <sh1. name>
    2. Check that the time in the "Next Run" column time on the line with the <sh1. name> is now.
  8. Check harvest status of the selective harvest
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the right from the menu
    3. Click the "Show" button, until the <sh1. name> appears in a new job line (approx. after a minute)
    4. Check that the job has status "NEW", it may have turned into status "SUBMITTED" or status "STARTED" before you see it.
  9. Check job creation in the system status for the selective harvest
    1. Click 'Systemstate'->'Overview of the system state'
    2. Find and click 'HarvestJobManagerApplication' in the 'Application' column for the KB kb-test-adm-001
    3. Click 'show all' in the "Index" header
    4. Check that there exists a line with the message "INFO: Created 1 jobs for harvest definition and a line after that "INFO: Job #1 submitted, and later the line: "INFO: Job #1 has been started by the harvester."

12. Terminate a running harvest (Skip)

Use the H3 Remote Access section under Harvest status to terminate the job once it has started harvesting.


13. Check the Heritrix terminated job is logged in the Job details in ADM GUI (Skip)

  1. Click on refresh until the job disappears in 'System Overview' ( 5 min.) (ie you see "Starts to listen to new jobs" on the HarvestControllerApplication where the job was running)

  2. Click 'Harvest status' and select your terminated job by clicking on the Job ID number
  3. Verify that under the 'Included domains and configurations' some domains are "Stopped due to": "Harvesting aborted" (Should be only domain: netarkivet.dk)

14. Start a snapshot harvest with max 5.000.000 bytes

  1. Make a new snapshot harvest definition with a name you can remember
    1. Click 'Definitions'->'Snapshot Harvests' in the left menu
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the 'Harvest name' and note the name for later use (from now referred as <snh. name>)
    4. Set Max number of bytes per domain to 5000000 (5 Mbytes)
    5. Click Save
    6. Click 'Activate' in column 4 on the line with the <snh. name>
  2. Check scheduling of jobs
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select to view NEW jobs
    3. Check that a new snapshot harvest <snh. name> job has been generated (may take a minute before jobs appear)
    4. Click 'Systemstate' in the left menu
    5. Check that the HarvestJobManager application contains the message "INFO: Created X jobs for harvest definition" (choose Application HarvestJobManager and Show all lines)
    6. Check That there are no warnings on the different applications

15. Terminate a Job

  1. Terminate the job harvesting netarkivet.dk in the snapshot harvest.
  2. Wait for the other jobs to finish.
17

16. Start a snapshot harvest with max 100000 bytes

  1.  Make a new snapshot harvest definition with a name you can remember
    1. Click 'Definitions'->'Snapshot Harvests' in the left menu
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the 'Harvest name' and note the name for later use (from now referred as <snh. name.2>)
    4. Set 'Max number of bytes per domain' to 100000.
    5. Click on the <snh.name> under 'Harvest only domains that were not completely harvest in a previous harvest'
    6. Click Save
    7. Click 'Activate' in column 4 on the line with the <snh. name.2>
  2. Check scheduling of jobs
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select to view NEW jobs
    3. Check that a new snapshot harvest <snh. name.2> job has been generated (may take a minute before jobs appear)
    4. Click 'System status' in the left menu
  3. Verify job status
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the right from the menu
    3. Click the "Show" button, until the jobs have stepped through statuses "NEW", "SUBMITTED", "STARTED", "DONE"
18

17. Check that the domains stopped by Heritrix termination are not part of the next harvest.

  1. Click on refresh until the job disappears in system overview ( about 5 min.)
  2. Click 'Harvest status' and for each jobs in the snapshot harvest
    1. Click on Job ID number
    2. Verify that the 'Included domains and configurations' are without the domains which was stopped due to 'Harvesting aborted' in the previous harvest, and the rest are 'Domain Completed' or 'Max Bytes limit reached'