External sort of index saturates disk

Description

Our production engineers reported that the index generation for our semestrial crawl had saturated the disk space for the system temp.

We had configured the common.settings.tempDir property to a special big partition, but this setting seemed not to have any effect in this case.

Here is the stack trace we obtained:

Nov 7, 2011 4:28:24 PM dk.netarkivet.archive.indexserver.distribute.IndexRequestServer doGenerateIndex
WARNING: Unable to generate index for jobs [823,822,825,824]
dk.netarkivet.common.exceptions.IOFailure: Error code 2 sorting crawl log '/data/PROD_CIRCUIT_3.1.0/cache/crawllog/crawllog-823-cache'
at dk.netarkivet.common.utils.FileUtils.sortCrawlLog(FileUtils.java:1005)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.getSortedCrawlLog(CrawlLogIndexCache.java:244)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.indexFile(CrawlLogIndexCache.java:179)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.combine(CrawlLogIndexCache.java:146)
at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:80)
at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:48)
at dk.netarkivet.archive.indexserver.FileBasedCache.cache(FileBasedCache.java:167)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.doGenerateIndex(IndexRequestServer.java:157)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.access$000(IndexRequestServer.java:58)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer$1.run(IndexRequestServer.java:137)

A little bit of investigation revealed that the IndexServer process had children process running the unix sort command, and this would by default use the system /temp, and cause the saturation.

The suggested fix is to add the '-T <value of common.settings/tempDir" parameter when building the sort command within application code.

Checklist

Activity

Show:

Colin Samuel RosenthalJuly 4, 2012 at 6:31 AM
Edited

Did a much simplified release test:

  1. Created a snapshot harvest based on an earlier harvest. The IndexApplication says: FINE: Running external program: sort /home/test/QUICKSTART/cache/cdxdata/cdxdata-21-cache -o /home/test/QUICKSTART/tmpdircommon/sorted4945368293446503877cdx with environment LANG=C

  2. Stopped NAS, deleted cache, set settings.common.unixSort.useCommonTempDir to true

  3. Restarted NAS

  4. Reran step 1. The log now says FINE: Running external program: sort /home/test/QUICKSTART/cache/crawllog/crawllog-28-cache -k 4b -o /home/test/QUICKSTART/tmpdircommon/sorted1645591089624950991crawllog -T /home/test/QUICKSTART/tmpdircommon with environment LANG=C

This looks fine to me. I also ran the logged sort command again by hand to check its syntax.

SrJune 29, 2012 at 12:41 PM

This update was part of revision 2287 of the NetarchiveSuite

SrMarch 30, 2012 at 11:23 AM

This will be done, after the releasetest of 3.19.0 is finished.

SrMarch 30, 2012 at 10:28 AM

The file

needs
to be updated as well.

Fixed

Details

Assignee

Reporter

Due date

Organization

BNF

Inspector (migrated)

Accuracy of estimate

Rough

Original estimate

Time tracking

2h logged

Components

Fix versions

Affects versions

Priority

Checklist

Created November 9, 2011 at 1:23 PM
Updated July 4, 2012 at 9:45 AM
Resolved July 4, 2012 at 9:45 AM