Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Netarchive (Tue):
    Panel

    We are on track again and the indexing for the broad crawl is now parallelized and the total start up time
    for the broad crawl including creation of a 80 Gb deduplication index took only about 24 hours
    without any manual intervention ( in 3.16 took it 4-5 days)

    Be aware of, that the new index creation method places a heavy load during sorting in the folder tmpdircommon. 

    We had  24 had  24 broad crawl harvesters and 32 33 selective harvesters active during startup (no single low prio
    harvester) 

    What was the main problems during the upstart:

    1) Every low prio harvester died with Java out of heap space after it got the index.
       It    It seems, that the new parallelized broad crawl index demands more memory for the Heritrix processes.
       Fix    Fix: increased memory to 3 GB per Heritrix instanse in the local settings.xml file file
     <harvester>
                            <harvesting>
                    <heritrix>
                                        <heritrix>
                        <heapSize>2936M</heapSize>

      on each 64 bit server and closed and closed all 32 bit harvesters (4).

    2) Continiously Continuously start, running and fail of harvesters and log spam about trying to generate or to find a new  new 
       index    index, even though the index was in place and ready  ready  - until no more jobs in queue.
       Fix    Fix: The new requested index name was created as a link to the already created in each havester cache/DEDUP_CRAWL_LOG/  e  e.g.
       ln    ln -s 127268-127269-127270-127271-27f47726643b267f48a0368d21f7a0fe-cache 127261-127262-127263-127264-571d26967e066b5a3ccbf384c937d74d-cache
           ( the first folder is the new requested and last folder contains actually the generated lucene index)
       and    and all jobs was resubmitted.

    3) Selective harvest waits for index until broad crawl index is finished.
       Fix    Fix: no fix currently.

    4) Running jobs GUI out of sync with actually running jobs.
       Fix    Fix: used SVC's adhoc java tool to delete zombee "Running jobs" 


    What was our main problems during the upgrade to 3.18 in production:

    1) corrupt indexes in the derby admin database. 
       Fix    Fix: recreated the indexes.

    2) very slow new lookup table in the admin database. 
       Fix    Fix: reconfigurated the lookup table to one with only 1 record

...