Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Click 'Systemstate'->'Overview of the system state' (At first, only the 'SystemState' link is visible, which leads to the same script, viz. Status/Monitor-JMXsummary.jsp   
  • Check that all internally developed applications are up and running. This depends on the configuration, of course, but there should be
    • One GUIApplication
    • One HarvestJobManagerAppplication
    • One or more HarvestControllerServer. One of these should be HIGHPRIORITY, if you need to run selective/event harvests, and/or
      LOWPRIORITY, if you need to run snapshot harvests

    • One IndexServerApplication
    • One or more ViewerProxy applications
    • One WaybackIndexerApplication (optional)
    • One AggregatorApplication (Optional)
    • If you're using the distributed archive solution (ie using the JMSArcrepositoryClient and not the LocalArcRepositoryClient or BitmagArcRepositoryClient) there should also be
      • One ArcRepositoryApplication
      • One or more BitarchiveServer applications per bitarchive replica in your configuration
      • One ChecksumFileApplication representing a checksum replica (if you have a checksum replica)
      • One BitarchiveMonitorServer for each bitarchive replica in your configuration
  • Check that last status message for each application do not contain errors or warnings
  • Check that there are no empty log messages
  • Click on an physical location in the 'Location' column e.g. "K" (you might need add "location" to the shown columns by clicking on "Location" in the "show:" line just above the column headers.
  • Check that you now only see relevant SW applications for the chosen location
  • Click 'Show all' in the 'Location' header
  • Check that you return to the full listening again
  • Repeat the above 4 steps for "Machine", "HTTP Port", "Application", "Instance Id", "Priority", "Use Replica" and "Index" (Index shows all log lines in the given appl. log)

The Instance ID column in the System overview GUI is a technical suffix to the Application column to separate more than one Application of the same type on the same server. If there is only 1 Application on a server it will normally be empty. If there is more than 1 Application of the same type on the same server, there must be added a suffix i.e. an Instance ID. It is userdefined (in the deploy script) and must be unique.

  • Click 'Systemstate' -> 'Overview of the system state'
  • Check that you are back to the full overview with log line 0(ignore any spurious messages from "GUIWebServer")
  • Check that there are no empty log messages

Check that basic database data is present

...

  1. For the same domain as above create a new Harvest Configuration using template default_orderxml (the other template available has not been modified for use with Umbra)
  2. Create a Selective Harvest definition using this new configuration
  3. Use the Mapping functionalityunder functionality under Harvest Channels in the GUI to map this new configuration to the UMBRA channel
  4. Activate the harvest and wait for it to complete
  5. Check the crawl log for the completed harvest for the strings "sentToAMQP" and "receivedFromAMQP"

...

  1. Make a new selective (event) harvest definition with a name you can remember
    1. Click 'Definitions'->'Selective Harvests' in the left menue
    2. Click 'Create new harvestdefinition' in the bottom of the main window
    3. Fill in the Harvest name and note the name for later use (from now referred as EH)
    4. Choose '''Once_an_hour''' in the drop down list for 'Schedule'
    5. Click Save (DO NOT CLICK ACTIVATE YET)
  2. Add seeds to the selective (event) harvest
    1. Click 'Edit' in column 6 on the line with the EH
    2. Write domain list from 'Seed list 1' given below to a file on your desktop e.g. notepad)
    3. Click 'Add seeds from a file' at the bottom of the main page
    4. Click 'Browse" and pick up the just created file with seeds
    5. Choose default_orderxml in the drop-down list for 'Harvest template' (set maxobjects pr domain to 500; max bytes to 400.000.000, maxhops to 0, obey robots.txt? unchecked and extract_javascript checked)  [previously used template frontpages]
    6. Click 'Insert'
    7. Now click 'Add seeds'
    8. Choose default_orderxml in the drop-down list for 'Harvest template'
    9. Write domain list from 'Seed list 2' given below (you can cut and paste from this page) (set maxobjects pr domain to 300; max bytes to 500.000.000, maxhops to 2, obey robots.txt? unchecked and extract_javascript checked) [previously used template frontpages_2levels]
    10. Click 'Insert'
    11. *Click 'Save'
  3. Check that seed lists for domains in Seed list 1 has changed correspondingly (You have to click on Show unused configurations/seedlists show all)
    1. For each of the domains raeder.dk, netarkivet.dk do:
    2. Click 'Definitions'->'Find Domain(s)'
    3. Search for domain by writing its name as text and click 'Search'
    4. Check that there exists a configuration with the name "EH_default_orderxml_400000000Bytes_500Objects" (verify that the config has maxHops=0, obey robots unchecked, extract javascript checked)
    5. Check that there exists a seed list with the name "EH_default_orderxml_400000000Bytes_500Objects
    6. Click 'Edit' in the line with seed list "EH_default_orderxml_400000000Bytes_500Objects
    7. Check that the seed list shown corresponds to the seed list for the domain (see below)
    8. Check that seed lists for domains in Seed list 2 has changed correspondingly (you have to click on Show unused configurations/seedlists show all)
    9. For the domains kaarefc.dk, netarkivet.dk do:
    10. Click 'Definitions'->'Find Domain(s)'
    11. Search for the domain by writing this text (either kaarefc.dk or netarkivet.dk) and click Search
    12. Check that there exists a configuration with name EH_default_orderxml_500000000Bytes_300Objects (verify that the config has maxHops=2)
    13. Check that there exists a seed list with the name EH_default_orderxml_500000000Bytes_300Objects
    14. Click 'Edit' in the line with seed list EH_default_orderxml_500000000Bytes_300Objects
    15. Check that the seed list shown corresponds to the seed list for the domain (see below)
  4. Activate the harvest
    1. Click 'Definitions'->'Selective Harvests' in the left menu
    2. Click 'Activate' in column 5 on the line with the <eh. name>
  5. Check harvest status of the event harvest using menu "All Jobs"
    1. Click 'Harvest status'->'All Jobs' in the left menu
    2. Select "All" in "Only display job status" to the right from the menu
    3. Click the "Show" button, until the <eh. name> appears in a new job line (approx. after a minute)
    4. Check that two jobs appears and that they both have Harvest name <eh. name>
    5. Check the menu "Running jobs", that the jobs appears and that you can go to the Heritrix GUI. by clicking on the host link and by using the login/password: "admin"/"adminPassword" and close the window againH3 Remote Access and monitor the jobs progress e.g. by viewing the cached crawl-log.

Seed list 1 (Harvest template "default_orderxml", maxhops=0, extract_javascript=true, robots.txt=ignore, max objects=500; max bytes=400.000.000):

...