Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Make a new selective (event) harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menue
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the Harvest name and note the name for later use (from now referred as EH)
    4. Choose '''Once_an_hour''' in the drop down list for 'Schedule'
    5. Click Save (DO NOT CLICK ACTIVATE YET)
  2. Add seeds to the selective (event) harvest
    1. Click 'Edit' in column 6 on the line with the EH
    2. Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
    3. Click 'Add seeds from a file' at the bottom of the main page
    4. Click 'Browse" and pick up the just created file with seeds
    5. Choose default_orderxml in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000, maxhops to 0, obey robots.txt? unchecked and extract_javascript checked)  [previously used template frontpages]
    6. Click 'Insert'
    7. Now click 'Add seeds'
    8. Choose default_orderxml in the drop-down list for 'Harvest template'
    9. Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000, maxhops to 2, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages_2levels]
    10. Click 'Insert'
    11. *Click 'Save'
  3. Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)correspondingly
    1. For each of the domains raeder.dk, netarkivet.dk do:
    2. Click 'Definitions'->'Find Domain(s)'
    3. Search for domain by writing its name as text and click 'Search'
    4. Check Click the domain name link and check that there exists a configuration with the name "EH_default_orderxml_400000000Bytes_500Objects" (verify  (You have to click on Show unused configurations/seedlists show all)
    5. (Verify that the config has maxHops=0, obey robots unchecked, extract javascript checked)
    6. Check that there exists a seed list with the name "EH_default_orderxml_400000000Bytes_500Objects
    7. Click 'Edit' in the line with seed list "EH_default_orderxml_400000000Bytes_500Objects
    8. Check that the seed list shown corresponds to the seed list for the domain (see below)
    9. Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
    10. For the domains kaarefc.dk, netarkivet.dk do:
    11. Click 'Definitions'->'Find Domain(s)'
    12. Search for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
    13. Check that there exists a configuration with name EH_default_orderxml_500000000Bytes_300Objects (verify that the config has maxHops=2)
    14. Check that there exists a seed list with the name EH_default_orderxml_500000000Bytes_300Objects
    15. Click 'Edit' in the line with seed list EH_default_orderxml_500000000Bytes_300Objects
    16. Check that the seed list shown corresponds to the seed list for the domain (see below)
  4. Activate the harvest
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Activate' in column 5 on the line with the <eh. name>
  5. Check harvest status of the event harvest using menu "All Jobs"
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the right from the menu
    3. Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
    4. Check that two jobs appears and that they both have Harvest name <eh. name>
    5. Check the menu "Running jobs", that the jobs appears and that you can go to H3 Remote Access and monitor the jobs progress e.g. by viewing the cached crawl-log.

...