...
- Make a new selective (event) harvest definition with a name you can remember
- Click 'Definitions'->'Selective Harvests' in the left menu
- Click 'Create new harvestdefinition' in the bottom of the main window
- Fill in the Harvest name and note the name for later use (from now referred as EH)
- Choose '''Once_an_hour''' in the drop down list for 'Schedule'
- Click Save (DO NOT CLICK ACTIVATE YET)
- Add seeds to the selective (event) harvest
- Click 'Edit' in column 6 on the line with the EH
- Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
- Click 'Add seeds from a file' at the bottom of the main page
- Click 'Browse" and pick up the just created file with seeds
- Choose default_orderxml in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000, maxhops to 0, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages]
- Click 'Insert'
- Now click 'Add seeds'
- Choose default_orderxml in the drop-down list for 'Harvest template'
- Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000, maxhops to 2, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages_2levels]
- Click 'Insert'
- *Click 'Save'
- Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)
- For each of the domains raeder.dk, netarkivet.dk do:
- Click 'Definitions'->'Find Domain(s)'
- Search for domain by writing its name as text and click 'Search'
- Check that there exists a configuration with the name "EH_frontpagesdefault_orderxml_" 400000000Bytes__500Objects"
- Check that there exists a seed list with the name "EH_frontpages_default_orderxml_400000000Bytes_500Objects
- Click 'Edit' in the line with seed list "EH_frontpagesdefault_orderxml_" 400000000Bytes__",500Objects
- Check that the seed list shown corresponds to the seed list for the domain (see below)
- Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
- For the domains kaarefc.dk, netarkivet.dk do:
- Click 'Definitions'->'Find Domain(s)'
- Search for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
- Check that there exists a configuration with name EH_default_frontpagesorderxml_plus500000000Bytes_2levels300Objects
- Check that there exists a seed list with the name EH_frontpagesdefault_plusorderxml_2levels__" __500000000Bytes_300Objects
- Click 'Edit' in the line with seed list EH_default_frontpagesorderxml_plus500000000Bytes_2levels300Objects
- Check that the seed list shown corresponds to the seed list for the domain (see below)
- Activate the harvest
- Click 'Definitions'->'Selective Harvests' in the left menu
- Click 'Activate' in column 5 on the line with the <eh. name>
- Check harvest status of the event harvest using menu "All Jobs"
- Click 'Harvest status'->'All Jobs' in the left menu
- Select "All" in "Only display job status" to the rigth from the menu
- Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
- Check that two jobs appears and that they both have Harvest name <eh. name>
- Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI. by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window again.
...
Code Block |
---|
http://www.raeder.dk/ |
Seed list "<eh. name>_frontpagesdefault_orderxml_400000000Bytes_500Objects" for domain localtimes.info
Code Block |
---|
http://localtimes.info/Europe/Denmark/Copenhagen/ |
Seed list "<eh. name>_frontpagesdefault_orderxml_400000000Bytes_500Objects" for domain =netarkivet.dk=
Code Block |
---|
http://netarkivet.dk/in-english/ http://netarkivet.dk/adgang/ |
Seed list "<eh. name>_frontpagesdefault_orderxml_plus_2levels500000000Bytes_300Objects" for domain =netarkivet.dk=
Code Block |
---|
http://netarkivet.dk/in-english/ |
Seed list "<eh. name>_default_frontpagesorderxml_plus500000000Bytes_2levels300Objects"" for domain =kaarefc.dk=
...