Our production engineers reported that the index generation for our semestrial crawl had saturated the disk space for the system temp.
We had configured the common.settings.tempDir property to a special big partition, but this setting seemed not to have any effect in this case.
Here is the stack trace we obtained:
Nov 7, 2011 4:28:24 PM dk.netarkivet.archive.indexserver.distribute.IndexRequestServer doGenerateIndex WARNING: Unable to generate index for jobs [823,822,825,824] dk.netarkivet.common.exceptions.IOFailure: Error code 2 sorting crawl log '/data/PROD_CIRCUIT_3.1.0/cache/crawllog/crawllog-823-cache' at dk.netarkivet.common.utils.FileUtils.sortCrawlLog(FileUtils.java:1005) at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.getSortedCrawlLog(CrawlLogIndexCache.java:244) at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.indexFile(CrawlLogIndexCache.java:179) at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.combine(CrawlLogIndexCache.java:146) at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:80) at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:48) at dk.netarkivet.archive.indexserver.FileBasedCache.cache(FileBasedCache.java:167) at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.doGenerateIndex(IndexRequestServer.java:157) at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.access$000(IndexRequestServer.java:58) at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer$1.run(IndexRequestServer.java:137)
A little bit of investigation revealed that the IndexServer process had children process running the unix sort command, and this would by default use the system /temp, and cause the saturation.
The suggested fix is to add the '-T <value of common.settings/tempDir" parameter when building the sort command within application code.
Checklist
Activity
Show:
Colin Samuel RosenthalJuly 4, 2012 at 6:31 AM
Edited
Did a much simplified release test:
Created a snapshot harvest based on an earlier harvest. The IndexApplication says: FINE: Running external program: sort /home/test/QUICKSTART/cache/cdxdata/cdxdata-21-cache -o /home/test/QUICKSTART/tmpdircommon/sorted4945368293446503877cdx with environment LANG=C
Stopped NAS, deleted cache, set settings.common.unixSort.useCommonTempDir to true
Restarted NAS
Reran step 1. The log now says FINE: Running external program: sort /home/test/QUICKSTART/cache/crawllog/crawllog-28-cache -k 4b -o /home/test/QUICKSTART/tmpdircommon/sorted1645591089624950991crawllog -T /home/test/QUICKSTART/tmpdircommon with environment LANG=C
This looks fine to me. I also ran the logged sort command again by hand to check its syntax.
SrJune 29, 2012 at 12:41 PM
This update was part of revision 2287 of the NetarchiveSuite
SrMarch 30, 2012 at 11:23 AM
This will be done, after the releasetest of 3.19.0 is finished.
Our production engineers reported that the index generation for our semestrial crawl had saturated the disk space for the system temp.
We had configured the common.settings.tempDir property to a special big partition, but this setting seemed not to have any effect in this case.
Here is the stack trace we obtained:
Nov 7, 2011 4:28:24 PM dk.netarkivet.archive.indexserver.distribute.IndexRequestServer doGenerateIndex
WARNING: Unable to generate index for jobs [823,822,825,824]
dk.netarkivet.common.exceptions.IOFailure: Error code 2 sorting crawl log '/data/PROD_CIRCUIT_3.1.0/cache/crawllog/crawllog-823-cache'
at dk.netarkivet.common.utils.FileUtils.sortCrawlLog(FileUtils.java:1005)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.getSortedCrawlLog(CrawlLogIndexCache.java:244)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.indexFile(CrawlLogIndexCache.java:179)
at dk.netarkivet.archive.indexserver.CrawlLogIndexCache.combine(CrawlLogIndexCache.java:146)
at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:80)
at dk.netarkivet.archive.indexserver.CombiningMultiFileBasedCache.cacheData(CombiningMultiFileBasedCache.java:48)
at dk.netarkivet.archive.indexserver.FileBasedCache.cache(FileBasedCache.java:167)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.doGenerateIndex(IndexRequestServer.java:157)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer.access$000(IndexRequestServer.java:58)
at dk.netarkivet.archive.indexserver.distribute.IndexRequestServer$1.run(IndexRequestServer.java:137)
A little bit of investigation revealed that the IndexServer process had children process running the unix sort command, and this would by default use the system /temp, and cause the saturation.
The suggested fix is to add the '-T <value of common.settings/tempDir" parameter when building the sort command within application code.