Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Excerpt
hiddentrue

Documentation and implementation notes.

...

The goal of the JWAT library was to make a small package to read and validate WARC, ARC and GZip files.

All the parsers were implemented on the premise that input data would be supplied in the form of streams and not files.

So the basic operation of parsing and validating a file is a sequential operation where each record and its payload is only read once.

This is also the case when parsing/validating compressed WARC/ARC files where each record is GZip'ed. In which case the compressed data can also only be processed sequentially.

It is however possible to random access individual WARC/ARC/GZip records when working with the logical files and using a file offset. This is possible by using a simple RandomAccessFileInputStream present in the common package.

Since this project is intended to be of general use it also includes WARC and GZip writers.

Further documentation

pages below describe the individual packages and also the process by which ARC and WARC files are read and validated.

If you can not find the information you seek you can always try the javadocs or look at the source code. As a last resort you are also welcome to email us with your inquiryme.

Package layout

This toolkit includes the following packages:

  • jwat-common: General purpose classes including specialized streams, binary->string encoding and common arc/warc http-response/payload code.
  • jwat-gzip: GZip reader/validator/writer, including input/output streams for data.
  • jwat-arc: Contains Arc reader/validator/writer specific classes.
  • jwat-warc: Contains Warc reader/validator/writer specific classes.

Child pages (Children Display)
depth3
styleh3
excerpttrue

...