...
- Go to http://$GUIadminserver:$http-port/HarvestDefinition/where GUIadminserver and http-port are specified in the deploy configuration file under the application named dk.netarkivet.common.webinterface.GUIApplication. In the one-machine setup (deploy_example_one_machine.xml ) the link will be : http://localhost:8074
- Make a new selective harvest definition with a name you can remember
- Click 'Definitions'->'Selective Harvests' in the left menu
- *Click 'Create new harvest definition' in the bottom of the main window
- Fill in the Harvest name and note the name for later use (from now referred as sh.name)
- Choose "Once_a_week" in the drop down list for 'Schedule'
- In the 'Enter Domain...' window add the name of a domain not already in the system (e.g. mazda.dk) and click 'Add domains'
- There should be a button "Create and add to the harvest definition" shown. Click on it.
- Click 'Save'
- Activate the selective harvest
- Click 'Activate' in column 5 on the line with the sh. name
- Check that the time in the "Next Run" column time on the line with the sh. name is now.
- Check harvest status of the selective harvest
- Click 'Harvest status'->'All Jobs' in the left menu
- Click the "Show" button, until the name appears in a new job line (approx. after a minute)
- Check that the job has status "NEW", it may have turned into status "SUBMITTED" or status "STARTED" before you see it.
- Check job creation in the system status for the selective harvest
- Click 'Systemstate'->'Overview of the system state'
- Find and click 'HarvestJobManagerApplication' in the 'Application' column.
- Click 'show all' in the header.
- Check that there exists a line with the message "INFO: Created 1 jobs for harvest definition ' and a line after that "INFO: Job #1 submitted, and later the line: "INFO: Job #1 has been started by the harvester."
2.1 Run an Umbra Harvest
Note: In the netarkivet test environment, umbra runs on test@kb-test-har-005. If umbra is not running, it can be started with
Code Block |
---|
[test@kb-test-har-005 ~]$ /opt/rh/rh-python36/root/usr/bin/python3 /opt/rh/rh-python36/root/usr/bin/umbra --max-browsers 5 --executable /home/test/run-chrome.sh --url amqp://guest:guest@localhost:5672/%2f --log_config_file /home/test/logging.conf > umbra.log |
- For the same domain as above create a new Harvest Configuration using template default_orderxml (the other template available has not been modified for use with Umbra)
- Create a Selective Harvest definition using this new configuration
- Use the Mapping functionalityunder Harvest Channels in the GUI to map this new configuration to the UMBRA channel
- Activate the harvest and wait for it to complete
- Check the crawl log for the completed harvest for the strings "sentToAMQP" and "receivedFromAMQP"
...
- Make a new selective (event) harvest definition with a name you can remember
- Click 'Definitions'->'Selective Harvests' in the left menumenue
- Click 'Create new harvestdefinition' in the bottom of the main window
- Fill in the Harvest name and note the name for later use (from now referred as EH)
- Choose '''Once_an_hour''' in the drop down list for 'Schedule'
- Click Save (DO NOT CLICK ACTIVATE YET)
- Add seeds to the selective (event) harvest
- Click 'Edit' in column 6 on the line with the EH
- Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
- Click 'Add seeds from a file' at the bottom of the main page
- Click 'Browse" and pick up the just created file with seeds
- Choose default_orderxml in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000, maxhops to 0, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages]
- Click 'Insert'
- Now click 'Add seeds'
- Choose default_orderxml in the drop-down list for 'Harvest template'
- Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000, maxhops to 2, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages_2levels]
- Click 'Insert'
- *Click 'Save'
- Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)
- For each of the domains raeder.dk, netarkivet.dk do:
- Click 'Definitions'->'Find Domain(s)'
- Search for domain by writing its name as text and click 'Search'
- Check that there exists a configuration with the name "EH_default_orderxml_400000000Bytes_500Objects" (verify that the config has maxHops=0, obey robots unchecked, extract javascript checked)
- Check that there exists a seed list with the name "EH_default_orderxml_400000000Bytes_500Objects
- Click 'Edit' in the line with seed list "EH_default_orderxml_400000000Bytes_500Objects
- Check that the seed list shown corresponds to the seed list for the domain (see below)
- Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
- For the domains kaarefc.dk, netarkivet.dk do:
- Click 'Definitions'->'Find Domain(s)'
- Search for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
- Check that there exists a configuration with name EH_default_orderxml_500000000Bytes_300Objects (verify that the config has maxHops=2)
- Check that there exists a seed list with the name EH_default_orderxml_500000000Bytes_300Objects
- Click 'Edit' in the line with seed list EH_default_orderxml_500000000Bytes_300Objects
- Check that the seed list shown corresponds to the seed list for the domain (see below)
- Activate the harvest
- Click 'Definitions'->'Selective Harvests' in the left menu
- Click 'Activate' in column 5 on the line with the <eh. name>
- Check harvest status of the event harvest using menu "All Jobs"
- Click 'Harvest status'->'All Jobs' in the left menu
- Select "All" in "Only display job status" to the right from the menu
- Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
- Check that two jobs appears and that they both have Harvest name <eh. name>
- Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI. by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.
...