Hadoop-based processing
Following on from our experience in Sprint 4 (March 2019) we have concluded that a generic processing interface which encompasses both existing (batch) and future (hadoop) processing will necessarily be over-engineered. We have therefore concluded that it will make more sense to adopt a pragmatic approach:
- Many batch jobs will simply cease to be relevant when bitpreservation is moved to bitmagasinet
- Other, e.g. FileListBatchJob, will be replaceable by bitmagasinet API calls
- The few remaining cases are all associated with indexing of one kind or another and should be manageable on a case-by-case basis.
There are two workflows to be considered
- Waybck CDX generation. This is (currently) just a simple extraction of CDX lines per file, after which sorting and merging is carried out locally by a separate application. The current approach scales just fine so there is no need to deploy the full power of map-reduce on it. A simple hadoop job could be developed that just works on one warcfile and extracts its cdx. WaybackIndexerApplication should talk directly to hadoop.
- Deduplication and Viewerproxy indexes. Batch calls to get data for these are generated by the IndexRequestServer, a separate application which is responsible for caching index requests. As in the case of CDX, the current code extracts records per-file over batch, and combines them locally. See e.g. the complex lucene-magic in CrawlLogIndexCache.combine(). This method does not scale well and it would be natural to try to reimplement this as a proper map-reduce job. (It's possible that this already exists though and we should check with our international collaborators in case they are ahead of us.) The place to fork the execution pathway lies in the call to CombiningMultiFileBasedCache.cacheData(Set<T> ids) which takes a list of job-ids and indexes them to a single file. The current implementation loops over the set of ids. A better implementation would resolve the ids to a list of warcfiles to be indexed and let hadoop do the whole thing.
In both cases, rather than using generic interfaces we should be pragmatic and use alternative pathways of execution based around simple flags. We can go a long way with just
if (Settings.getBoolean(CommonSettings.USE_HADOOP_FLAG)) { ... do something including one or more hadoop calls } else { ... do something else including one or more batch calls }
with the added huge advantage that with a suitable default value, it will be intrinsically backwards compatible for our other current users.
This also means that our final implementation of ArcRepositoryClient will not implement the batch() method - or rather will have a null-implementation that always throws an exception. Our task is then to create a fully functional NetarchiveSuite which never triggers this exception because we have removed all batch calls.