Institutional Usage of NetarchiveSuite

At the KB-Denmark Netarkiv we are working on some quite radical changes to our backend architecture - replacing our ArcRepository storage with bitrepository.org software, and implementing a new mass-processing architecture probably based on hadoop. As part of this process we would like to know what parts of NAS are actually in use at our partner institutions so we can develop a strategy for future support. 

NAS Applications

Which of the following NAS applications (services are in use in your production environment?

ApplicationDenmarkFranceAustriaSpainSwedenComments
HarvestControllerServeryyyyy
GUIWebServeryyyny
HarvestJobManageryyyyy
ChecksumFileServerynnny
ViewerProxyyynyy

BnF: we only use the ViewerProxy to get access to warc files

Sweden: currently not working due to storage requirements

WaybackIndexerynnn

AggregationWorkerynnn

IndexServeryyyyy
ArcRepositoryyyyyy
BitarchiveServerynyny
BitarchiveMonitorServerynyny
AccessBitarchiveServery/nnnn
This is a special read-only server which is used in a specific data-extraction system in DK, outside the main Netarkivet installation.

Plugins

Which of the following plugins are used in your production setup? Those marked with a (star) are default values set in the packaged settings file.

InterfaceImplementationDenmarkFranceAustriaSpainSweden
AbstractRemoteFileHTTPRemoteFile

y


HTTPSRemoteFile





FTPRemoteFile (star)y(star)

y
ActiveBitPreservationDatabaseBasedActiveBitPreservation

y


FileBasedActiveBitPreservation (star)y(star)

y
Admin

UpdateableAdminData







DatabaseAdmin (star)

y(star)y
y
arcrepositoryadmin.DBSpecifics

DerbyServerSpecifics (star)







DerbyEmbeddedSpecifics







MySQLSpecifics





PostgreSQLSpecifics

yyy


ChecksumArchive

FileChecksumArchive (star)

y(star)

y

DatabaseChecksumArchive







JMSConnectionJMSConnectionSunMQ (star)yyyyy
ArcRepositoryClientJMSArcRepositoryClienty




LocalArcRepositoryClient
y



MonitorRegistryClient

PrintMonitorRegistryClient







JMSMonitorRegistryClient (star)yyy
y
JobIndexCacheIndexRequestClient  (star)yyy
y
Notifications

EMailNotifications (star)

yy



PrintNotifications



y


FreeSpaceProvider

DefaultFreeSpaceProvider (star)

y(star)y


FreeSpaceProvider







OnbFreeSpaceProvider



y


datamodel.DBSpecifics



DerbyServerSpecifics (star)







DerbyEmbeddedSpecifics







MySQLSpecifics







PostgreSQLSpecifics

yyy


JobGenerator

DefaultJobGenerator (star)

y
y


FixedDomainConfigurationCountJobGenerator


y

y
ArchiveFileNaming

LegacyNamingConvention (star)

y
y
y

CollectionPrefixNamingConvention


y



FrontierReportFilter

TopTotalEnqueuesFilter (star)

yyy
y

ExhaustedQueuesFilter







MaxSizeFrontierReportExtract







RetiredQueuesFilter


y



HeritrixLauncherAbstractHeritrixLauncher (star)yyy
y
IHeritrixControllerHeritrixController (star)yyy
y
HarvestReport

LegacyHarvestReport (star)

y
y
y

BnFHarvestReport


y



IndexRequestServerInterfaceIndexRequestServer (star)yyy
y


Command Line Tools

Over the years, the NetarchiveSuite codebase has accumulated a lot of command line utilities. Some of these were probably developed for a single specialised use-case or for test purposes, but others may have become part of the normal workflow at the various repositories. Here is a partial list of those that look most likely to be of general interest. Please mark any of those you know of that are used as part of your workflows.

ToolPurposeDenmarkFranceAustriaSpainSweden
DeployApplicationCreates deploy scripts from a deploy-configyyy
n
HarvestdatabaseUpdateApplicationUpdates HarvestDB schemay
y
y
BuildCompleteSettingsMerges module settings files in NAS to one large global default settings file. Run as part of release process.y


n
GetFileRetrieves a file via the ArcRepository interfacey


n
GetRecordRetrieves a (w)arc-record via the ArcRepository interfacey


n
LoadDatabaseChecksumArchiveMigration tool from file-based checksums to database-based checksums
n(?)



ReestablishAdminDatabaseFor reestablishing the admin database from a 'admin.data' file





RunBatchRuns a batch job from the command liney




UploadUploads a file to the ArcRepository from the command line. (Handy for testdata.)y
y


ReestablishAdminDatabase

Should be deprecated (question) Reads old admin.data file.





ClassDependenciesNon NAS Utility (license is not ours)





CreateIndexCLI to talk to IndexServer via IndexClient





RunChecksumCLI to get all checksums from a Bitarchive (deprecated)
n(?)



SendDedupIndexRequestToIndexserver

Asynchronously starts a dedup indexing on an IndexServer and then exits. Tue Hejlskov Larsen is this what you use to generate deduplication indexes?

i don't know ...




MakeIndexRuns a CDX extraction on a single file in a remote ArcRepository





FindRelevantCrawllogLinesFinds crawl-log lines matching a given domain name in a local metadata file





JMXProxy"This tool will simply reregister all MBeans that matches the given query from the JMX hosts read in settings, using* its own platformmbeanserver. It will then wait forever."





DeduplicateToCDXApplicationExtracts CDX records for deduplicate annotations from a local crawl log file





ResetFailedFilesUtility for WaybackIndexer to reset files that have failed more than 3 times so they can be retried
n(?)



ARCReaderUtilsSplits an arcfile (not warc) and dumps results to a directory





ArcWrapCreates an arcfile by wrapping a file





ExtractCDXExtracts CDX records, unsorted, from a list of local input arcfiles (not warcs)





JMSBrokerChecks that a JMS broker (as specified in NAS settings) is up and running.





WriteBytesToFileJust creates large files full of null bytes





FTPValidator

Tests if an ftp server configuration in a NAS settings file points to a NAS-compliand ftp server.





ArcMergeMerges several arcfiles into one arcfile





ArchiveExtractCDX

Extracts CDX records, unsorted, from a list of local input (w)arcfiles





WARCExtractCDXExtracts CDX records, unsorted, from a list of local input warcfiles





ReformatTranslationFilei) reorders a translation file so keys are in the same order as a reference file, and ii) allows the encoding of the output file to be changed





MailValidatorChecks the validity of a mail-server configured in NAS settings by sending a test-mail





MakeNewMetadataFileCreates a metadata file. For use when postprocessing fails. Is this used?





FindDomainsForCrawllogExtraction?





CheckDuplicateReductionValidates deduplication by comparing a crawl log with a collection of arcfiles. (not warc)





StandaloneApplicationReducedCreates a standalone NetarchiveSuite in a single JVM





MigrateDefaultHarvestDatabaseThis just initialises a SiteSection object which is supposed to upgrade the harvest database as a side-effect





CreateCDXMetadataFileComplex tool that takes a set of filenames and runs a batch job to extracts the cdx'es from each files and pack them in a metadata arc or warc file, one record per input file





HarvesterQueueControlTool to count the number of messages in a given JMS queue





HarvestDatabaseValidatorValidates whether you can connect to the harvest database with the settings in a given settings file





HarvestTemplateApplicationUtility for uploading and updating heritrix templatesy (in test)




CheckDomainCrawltrapsRuns through all domains in the harvest database and checks whether each crawlertrap regexp can validly be included as text-content in an xml documenty




CheckTrapsInFile

Runs through a list of crawler-trap regexes in a fileand checks whether each crawlertrap regex can validly be included as text-content in an xml documen

y(?)