Using Hibernate as the harvest database layer

Migration of Harvest Database to Hibernate-based Object-Relational Framework

Background

The current NAS architecture maintains a persistent store of harvest definition information in a database which is accessed by a Data Access Object (DAO) layer. The DAO layer is written and maintained by NAS devdelopers and accesses the database directly via SQL in the form of JDBC Prepared Statements.

Hibernate is an object-relational framework which provides components that allow Java objects (such as HarvestDefinition-s or Job-s) to be persisted directly. The mapping of the objects to a database layer is carried out by the Hibernate framework itself, based on the structure of the objects to be persisted and additional information supplied in Java annotations. Objects are retrieved from storage via an object-based query language (HQL) or via a Query-API. In principle, therefore, well-written Hibernate applications are database-neutral.

Hibernate is used in NAS for the wayback-indexer component, but with a very simple data model. This document discusses the idea of moving our entire Harvest Definition Database to Hibernate.

Problems with the Current Setup

Multiple Database Support

In the current code we support DerbyDB, MySQL and Postgres, thus trebling the workload on every database change. Moreover, the relevant expertise in the different databases is spread out over the different partners. This makes significant change to the persistence layer very challenging.

Complexity of Code

The current Harvester Database persistence layer has been reworked so many time that the DAO layer has become encrusted with hard-to-understand and hard-to-maintain code.

Schema Management System

In the current code, updates to the database schema are carried out by the Java code when the DAO-s are initialized. This is contrary to industry-standard practice in which database integrity is the responsibility of a DBA.

Advantages of a Shift to Hibernate

Database Neutrality

Hibernate is database-neutral by design. It should therefore be possible to create code which can be run on any major RDBMS by configuration.

Flexibility

Coding new features or refactoring existing features should be much easier in Hibernate as a adding a generic persistence layer requires little more than adding some annotations to the relevant code. This is in contrast to the present situation where adding new features with persistence is a major challenge.

Rewriting of Critical Code

The Harvest Definition layer of NAS is arguably in need of a major overhaul. This would be a natural side-effect of a migration to Hibernate.

Risks Associated with a Shift to Hibernate

Diversion of Resources from Other Tasks

Migrating to Hibernate will divert a considerable amount of developer time from other tasks with no obvious gain in new features for the end-users.

Estimation Uncertainty

It is very difficult to estimate just how much work might be involved in the migration but it is probably measurable in man-months rather than man-days.

Performance Issues and Maintaining Database Neutrality

Some of our database queries involve complex operations on large tables. There is a risk that it may be difficult or impossible to get satisfactory performance from Hibernate for these operations. In such a case it might be necessary to hack Hibernate by reintroducing direct SQL queries, thus breaking many of the advantages of migrating in the first place.

Migration of Existing Data

We would need to create tools for migrating and validating all existing data.

Conclusions

If we could move to Hibernate tomorrow at zero cost we would undoubtedly obtain a system which was easier to maintain and develop. However the cost of migration is sufficiently large and uncertain that the process cannot reasonably be specified as a task or series of tasks as part of the regular NAS coding and maintainance efforts. It is essentially a complete rewrite of a large part of NAS's core functionality and would need to be managed as a mini-project in itself.

One way to tackle the problem would be to explore the possibility of a piecewise migration, starting with simpler parts of the database, such as schedules and global crawler traps, in order to build up more experience in coding techniques, estimation, and the pitfalls of the migration process. We would then be better placed to make a final decision about migrating the more challenging parts of the codebase such as Domains and Job-generation. Design and architecture decisions for a simplified partial migration should be made carefully so sufficient time need to be assigned for a proper analysis procedure. It is therefore estimated that a Hibernate migration test-project concentrating on less challenging parts of the system could be completed in 10-15 md.