Under development
A number of new technologies have appeared which makes it possible to replaced the current archive codebase with standard components.The current archive functionality can be split into the following 3 areas:
Massprocessing
Current solution
This is currently implemented through the BatchJob functionality, which has been develop
Advantage:
- Full control of source code
Disadvantage:
- All improvements have to be done with NetarchiveSuite ressources.
- Somewhat unstable.
Alternativ
The de facto standard platform for massproccessing is Hadoop, which is used at an increasingly number of webarchiving institutions for analysing the stored web data. Both SB and KB have established Hadoop clusteres, which are already used for processing the Netarkivet.dk archive.
Advantage:
- Hadoop is an mature standard processing platform for large datasets.
- Comes with a huge set of tools, including Webarchiving tools.
- Very robust and scalable.
- Enables processing resources and data to be seperated (SB setup)
Disadvantage:
- Migration cost