Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Click 'Definitions'->'Find Domain(s)'
  • Search for =netarkivet.dk= by writing this text and click 'Search'
  • Check that the GUI returns a result-set of one, namely the domain =netarkivet.dk=
  • Click on the link =netarkivet.dk=, and the page for domain =netarkivet.dk= should be shown without errors
  • Click 'Edit' on the line configuration line for defaultconfig
  • Check that Name is "defaultconfig"
  • Check that Harvest template is "default_orderxml"
  • Check that Maximum number of objects is "2,000" (in some languages (e.g. Danish) this is represented as 2.000
  • Check that Maximum number of bytes is "500,000,000" (in some languages (e.g. Danish) this is represented as 500.000.000

...

  1. Make a new selective (event) harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the Harvest name and note the name for later use (from now referred as <eh. name> EH)
    4. Choose '''Once_an_hour''' in the drop down list for 'Schedule'
    5. Click Save (DO NOT CLICK ACTIVATE YET)
  2. Add seeds to the selective (event) harvest
    1. Click 'Edit' in column 6 on the line with the <eh. name> EH
    2. Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
    3. Click 'Add seeds from a file' at the bottom of the main page
    4. Click 'Browse" and pick up the just created file with seeds
    5. Choose '''frontpages''' in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000)
    6. Click 'Insert'
    7. Now click 'Add seeds'
    8. Choose '''frontpages_plus_2levels''' in the drop-down list for 'Harvest template'
    9. Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000)
    10. Click 'Insert'
    11. *Click 'Save'
  3. Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)
    1. For each of the domains =raeder.dk=, =netarkivet.dk= do:
    2. Click 'Definitions'->'Find Domain(s)'
    3. Search for domain by writing its name as text and click 'Search'
    4. Check that there exists a configuration with the name "<eh. name>EH_frontpages__" __"
    5. Check that there exists a seed list with the name "<eh. name>EH_frontpages
    6. Click 'Edit' in the line with seed list "<eh. name>EH_frontpages__" __",
    7. Check that the seed list shown corresponds to the seed list for the domain (see below)
    8. Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
    9. For the domains =kaarefc.dk=, =netarkivet.dk= do:
    10. Click 'Definitions'->'Find Domain(s)'
    11. Search for =netarkivet.dk= for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
    12. Check that there exists a configuration with name "<eh. name>name EH_frontpages_plus_2levels
    13. Check that there exists a seed list with the name "<eh. name>name EH_frontpages_plus_2levels__" __"
    14. Click 'Edit' in the line with seed list "<eh. name>list EH_frontpages_plus_2levels
    15. Check that the seed list shown corresponds to the seed list for the domain (see below)
  4. Activate the harvest
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Activate' in column 5 on the line with the <eh. name>
  5. Check harvest status of the event harvest using menu "All Jobs"
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the rigth from the menu
    3. Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
    4. Check that two jobs appears and that they both have Harvest name <eh. name>
    5. Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI. by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.

...

  1. Click 'Harvest status'->'All Jobs' in the left menu
  2. Select "All" in "Only display job status" to the right from the menu
  3. Click the "Show" button, until the jobs have stepped through statuses "NEW", "SUBMITTED", "STARTED", "DONE"
  4. Wait until all jobs have got status "DONE"
  5. Check that you can search on Harvest name, start and end date
  6. Check that you can change number of rows to be displayed per page e.g. 1 and
  7. Check that you can press next and previous page and
  8. Check that the reset button resets all changes to default(note that the display value is also blanked, but is 100 by default)
  9. Check the following for the domain '''raeder.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by one job of the name <eh. name>
    2. Check that this job has configuration <eh. name> EH_frontpages__ __
    3. Check that there is a number for 'Run number' and 'Job ID'
    4. Check that the 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name> the EH harvest
    5. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    6. Check that the 'Stopped due to' columns contain "Domain Completed"
  10. Check the following job details for the domain '''netarkivet.dk''': (Using page SelectiveHarvests->History->Run Number 0 ->JobID 1)
    1. Check that the 'Submit time', 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name> EH harvest
    2. Click on "Browse reports for jobs"
    3. Check that you don't get any errors when you click on some of the links
    4. Click on "Browse harvest files for job"
    5. Check that you don't get any errors when you click on some of the links
    6. Click on "Browse only relevant crawl-log lines for domain netarkivet.dk"
    7. Check that you don't get any errors when you click on some of the links
  11. Check the following for the domain '''netarkivet.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by 2 jobs of the name <eh. name> EH
    2. Check that one of the jobs has configuration <eh. name> EH_frontpages
    3. Check that the 'Start time' and 'End time' columns approximately corresponds to time of test with <eh. name> EH
    4. Check that one of the jobs has configuration <eh. name> EH_frontpages_plus_2levels__ __
    5. Check that the 'Start time' and 'End time' approximately corresponds to time of test with <eh. name> EH harvest
    6. Check that 'Run number' and 'Job ID' columns contains positive numbers
    7. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    8. Check that the 'Stopped due to' columns contains "Domain Completed"
  12. Check the following for the domain '''kaarefc.dk''': (Using page Harvest Status -> All jobs per domain)
    1. Check that the domain has been harvested by 1 job of the name <eh. name> EH
    2. Check that the job has configuration <eh. name> EH_frontpages_plus_2levels
    3. Check that the 'Start time' and 'End time' approximately corresponds to time of test with <eh. name> EH harvest
    4. Check that 'Run number' and 'Job ID' columns contains positive numbers
    5. Check that the 'Bytes Harvested' and 'Documents Harvested' columns contains positive numbers
    6. Check that the 'Stopped due to' columns contains "Domain Completed"

...