Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Documentation and implementation notes.

Overview: 

The goal of the JWAT library was to make a small package to read and validate ARC, GZip and WARC files.

All the parsers were implemented on the premise that input data would be supplied in the form of streams and not files.

So the basic operation of parsing and validating a file is a sequential operation where each record and its payload is only read once.

This is also the case when parsing/validating compressed ARC/WARC files where each record is GZip'ed. In which case the compressed data can also only be processed sequentially.

It is however possible to random access individual ARC/GZip/WARC records when working with the logical files and using a file offset. This is possible by using a simple RandomAccessFileInputStream present in the common package.

Package layout

This toolkit includes the following packages:

  • jwat-common: General purpose classes including specialized streams, binary->string encoding and common arc/warc http-response/payload code.
  • jwat-gzip: GZip input-stream/entry reader/validator.
  • jwat-arc: Contains Arc reader/validator specific classes.
  • jwat-warc: Contains Warc reader/validator specific classes.
  • No labels