Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Go to http://$GUIadminserver:$http-port/HarvestDefinition/where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication. In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
  2. Make a new selective harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. *Click 'Create new harvest definition' in the bottom of the main window
    3.  Fill in the Harvest name and note the name for later use (from now referred as sh.name)
    4. Choose "Once_a_week" in the drop down list for 'Schedule'
    5. In the 'Enter Domain...' window add the name of a domain not already in the system (e.g. mazda.dk) and click 'Add domains'
    6. There should be a button "Create and add to the harvest definition" shown. Click on it.
    7. Click 'Save'
  3. Activate the selective harvest
    1. Click 'Activate' in column 5 on the line with the sh. name
    2. Check that the time in the "Next Run" column time on the line with the sh. name is now.
  4. Check harvest status of the selective harvest
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Click the "Show" button, until the name appears in a new job line (approx. after a minute)
    3. Check that the job has status "NEW", it may have turned into status "SUBMITTED" or status "STARTED" before you see it.
  5. Check job creation in the system status for the selective harvest
    1. Click 'Systemstate'->'Overview of the system state'
    2. Find and click 'HarvestJobManagerApplication' in the 'Application' column.
    3. Click 'show all' in the header.
    4. Check that there exists a line with the message "INFO: Created 1 jobs for harvest definition ' and a line after that "INFO: Job #1 submitted, and later the line: "INFO: Job #1 has been started by the harvester."

2.1 Run an Umbra Harvest

Note: In the netarkivet test environment, umbra runs on test@kb-test-har-005. If umbra is not running, it can be started with 

Code Block
[test@kb-test-har-005 ~]$ /opt/rh/rh-python36/root/usr/bin/python3 /opt/rh/rh-python36/root/usr/bin/umbra --max-browsers 5 --executable /home/test/run-chrome.sh --url amqp://guest:guest@localhost:5672/%2f --log_config_file /home/test/logging.conf > umbra.log
  1. For the same domain as above create a new Harvest Configuration using template default_orderxml (the other template available has not been modified for use with Umbra)
  2. Create a Selective Harvest definition using this new configuration
  3. Use the Mapping functionalityunder Harvest Channels in the GUI to map this new configuration to the UMBRA channel
  4. Activate the harvest and wait for it to complete
  5. Check the crawl log for the completed harvest for the strings "sentToAMQP" and "receivedFromAMQP"

...

  1. Make a new selective (event) harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menumenue
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the Harvest name and note the name for later use (from now referred as EH)
    4. Choose '''Once_an_hour''' in the drop down list for 'Schedule'
    5. Click Save (DO NOT CLICK ACTIVATE YET)
  2. Add seeds to the selective (event) harvest
    1. Click 'Edit' in column 6 on the line with the EH
    2. Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
    3. Click 'Add seeds from a file' at the bottom of the main page
    4. Click 'Browse" and pick up the just created file with seeds
    5. Choose default_orderxml in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000, maxhops to 0, obey robots.txt? unchecked and extract_javascript checked)  [previously used template frontpages]
    6. Click 'Insert'
    7. Now click 'Add seeds'
    8. Choose default_orderxml in the drop-down list for 'Harvest template'
    9. Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000, maxhops to 2, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages_2levels]
    10. Click 'Insert'
    11. *Click 'Save'
  3. Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)
    1. For each of the domains raeder.dk, netarkivet.dk do:
    2. Click 'Definitions'->'Find Domain(s)'
    3. Search for domain by writing its name as text and click 'Search'
    4. Check that there exists a configuration with the name "EH_default_orderxml_400000000Bytes_500Objects" (verify that the config has maxHops=0, obey robots unchecked, extract javascript checked)
    5. Check that there exists a seed list with the name "EH_default_orderxml_400000000Bytes_500Objects
    6. Click 'Edit' in the line with seed list "EH_default_orderxml_400000000Bytes_500Objects
    7. Check that the seed list shown corresponds to the seed list for the domain (see below)
    8. Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
    9. For the domains kaarefc.dk, netarkivet.dk do:
    10. Click 'Definitions'->'Find Domain(s)'
    11. Search for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
    12. Check that there exists a configuration with name EH_default_orderxml_500000000Bytes_300Objects (verify that the config has maxHops=2)
    13. Check that there exists a seed list with the name EH_default_orderxml_500000000Bytes_300Objects
    14. Click 'Edit' in the line with seed list EH_default_orderxml_500000000Bytes_300Objects
    15. Check that the seed list shown corresponds to the seed list for the domain (see below)
  4. Activate the harvest
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Activate' in column 5 on the line with the <eh. name>
  5. Check harvest status of the event harvest using menu "All Jobs"
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the right from the menu
    3. Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
    4. Check that two jobs appears and that they both have Harvest name <eh. name>
    5. Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI. by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.

...