AVH aggregation and mobilisation

position paper

Niels Klazenga, Aaron Wilton and Alison Vaughan

The Google doc

Preface
The aim of this paper is to outline the requirements and issues for data mobilisation so that HISCOM can reach a decision on a strategy for future data mobilisation and aggregation.

Recommendations

 * HISCOM to endorse the two-step aggregation model advocated by ALA, using MEL as the aggregation point for the coming few months. This will enable the AVH portal to go live as soon as possible and allow us to debug the data providers and do the necessary quality checks on the provided data.
 * HISCOM/CHAH to request ALA to work on the model with a single data aggregation point in the ALA cloud that meets all Requirements from the AVH community. This solution must be in place and properly tested by the end of the current ALA funding round.
 * HISCOM to develop a protocol for data that does not comply with standard vocabularies
 * HISCOM to endorse BioCASe, TAPIR as well as simpler transfer methods. Look into other exchange standards and formats (DarwinCore Archive). HISCOM also to investigate other data exchange options, DarwinCore Archive being a prime candidate.
 * HISCOM to review HISPID5, particularly the HISPID vocabularies, along with the simpler exchange format.
 * HISCOM to require data custodians to assign permanent identifiers to the provided records, following TDWG recommendations (http://www.tdwg.org/standards/150) where practicable..

Starting points

 * ALA is an infrastructure project and there is an expectation that it will provide tools for the community to mobilise and disseminate its data.
 * AVH is its data and therefore mobilisation and aggregation of data (i.e. getting the data out of the herbaria and into the AVH cache) is at least as important as disseminating that data (i.e. the fancy front end). Data aggregation is a very important component of the AVH hub.
 * At the end of the road, i.e. at the end of the current round of ALA funding, there must be a sustainable data aggregation process for the AVH hub (and all other hubs).
 * Any aggregation and mobilisation solution must include a solution for all herbaria (not just the Commonwealth, state or territory herbaria).

Requirements from AVH community

 * A robust data aggregation service.
 * Ability for (at the minimum) daily updates, including ‘unpublishing’ records.
 * Data validation.
 * Automated periodic reporting of harvesting activity, including successful harvesting activity, numbers of records harvested and validation reports.
 * Database access by AVH administrator from within AVH community.
 * Administration tools, e.g. ability to set up new providers, create or modify data mappings, administer user rights, vocabulary management, that are accessible to the AVH community.
 * Transparency in data storage in the ALA bio-cache.
 * Ability to deal with de-accessioned and unpublished data.
 * Transfer mechanisms that are able to deal with XML as well as CSV.
 * All transfer mechanisms need to support HISPID5 (and subsequent versions).
 * Minimisation (or ideally obliviation) of the necessity of manual handling of data at aggregation point.

Aggregation
The current ALA proposal is for a two-step data aggregation model, whereby data is harvested from the herbaria by the Royal Botanic Gardens (RBG) Melbourne on behalf of CHAH and the cache in Melbourne pushes that data to the ALA cache. This model will allow us to move forward with the AVH Hub in the short term. We believe the preferred medium and long term solution is a single-step aggregation model whereby data from the herbaria is aggregated straight into the ALA bio-cache. Note: the current ALA proposal does not resolve (for want of a better word) whether data from university herbaria will be aggregated via the two-step or single-step model in the short term.

Advantages

 * Aggregation in Melbourne is already in place.
 * Obnoxious de facto AVH administrator. (Means things happen!)

Disadvantages

 * Current code base is limited to specimen data, and would need to be updated any time new fields or data types (e.g., Images, descriptions) are added.
 * Administration is currently limited to two key staff at MEL, although access to people outside of MEL can be provided.
 * Relies on RBG Melbourne continuing to provide this resource. Niels may not want to do this in the long term (his obnoxiousness manifests itself in a variety of ways).
 * Robustness of solution. The current PHP solution does the job and is good for development, but once it’s working it should be turned into Java and ideally incorporated as part of the ALA infrastucture.
 * Maintenance of separate code base that is specific for aggregation of CHAH data.
 * Requires code to do initial aggregation, transfer and then upload to second aggregation point.



Figure 1. Two-step aggregation model. In this model AVH data is aggregated in the AVH cache (as it is now) and the cache of the AVH hub is updated from there. This scenario does not yet include a solution for university herbaria.

Advantages

 * Common code base that can be maintained as part of ALA infrastructure and used by more than CHAH/AVH and could be updated by and for the whole community when changes required.
 * Development of common tool set for administrators - i.e., is AVH Administrator role needed by OZCAM?
 * Single aggregation point for all data mobilisation methods and communities.
 * Requires access by obnoxious de facto AVH Administrator to tools within ALA infrastructure. This requires greater access and collaboration and should therefore be seen as an advantage.

Disadvantages

 * It remains to be seen how some of the requirements listed above can be met under the single-step aggregation model.



Figure 2. Single-step aggregation model. In this model all data is aggregated directly into the AVH hub.

Mobilisation
Most Australian national, state and Commonwealth herbaria already have running BioCASe providers. It has been a major effort to make this happen and there is considerable knowledge about the BioCASe provider in the AVH community now. Many New Zealand herbaria use a TAPIR provider, which delivers data in the same format (XML). Therefore it is a requirement that the aggregation solution is able to deal with the XML that is provided by these providers. ALA is also to provide a lighter, CSV-based, method (that they call simple-HISPID, but I re-dubbed HISPID-light) that should make it easier for herbaria with fewer resources to deliver their data into AVH.

Advantages

 * Very well supported.
 * One can query the BioCASe/TAPIR provider on any mapped field. This means that if a herbarium maps a new concept or re-maps a concept, which may affect only a small part of the records, one can query for just these records and doesn’t have to do a complete re-index. This is also great for debugging.
 * Utilises existing standards and transfer formats.

Disadvantages

 * While the provider software itself is quite easy to install, it takes quite an effort to get the data into a format the providers can deal with. Almost always a separate database that is either more or much less normalised than the collections database and sometimes even in a different database management system is needed.
 * Harvesting through the BioCASe/TAPIR provider can be a slow, or, depending on the server, very slow process and therefore is not ideal for initial loading of data.
 * Requires access to servers.
 * Harvesting through the BioCASe/TAPIR provider requires a firewall exception at the data provider’s end. For small herbaria (and some larger ones) this proves to be a major obstacle.
 * Do not transfer non-compliant data.

Note: The data dump feature that is available in TAPIR (unsure if available in all implementations of TAPIR) and will be available in the coming version of the BioCASe provider will ameliorate many of the problems associated with the BioCASE and TAPIR providers. It is therefore recommended that all Australian herbaria that use the BioCASe provider upgrade to the latest version.

Advantages

 * It’s simple
 * It’s lightweight, so suitable for massive data transfer. Furthermore, it can be transferred in compressed form.
 * No firewall exception is required, as the data will be pushed to an upload server via SFTP.

Disadvantages

 * More a concern than a disadvantage: how well will simple HISPID CSV deal with structured data? At the moment all data that is harvested can be transferred in completely flat form, if we transfer collectors as primary and additional collectors and phenology as a semicolon-separated string. However, some of us would very much like to see determination history in AVH, which is not possible in a flat format. Also, although rare, a type specimen could typify more than one name


 * In HISPID-light structured information needs to be concatenated into a single string. For collectors information this can be done by having primary and additional collectors fields, individual collectors in each separated by a semicolon (not a comma). For determination history, we can have the current determination split into its components (taxon name, determiner, determination date, determination notes) and all previous determinations concatenated into something similar to the Verification history (vhist) field in HISPID3. We need to have strict formatting rules on such strings, so they can be parsed.


 * Another concern: If herbaria have a vocabulary on a field which is different from the vocabulary on the corresponding concept in HISPID, will the ALA data mobilisation tool be able to deal with it and convert the custom herbarium vocabulary into the standard HISPID vocabulary? (See Review of HISPID)
 * How simple is simple-HISPID? How easy is it to convert data from database structure to HISPID-light format? How easy it is to map to HISPID vocabularies?

Permanent identifiers
Data that is disseminated through the AVH hub needs to have permanent unique identifiers. We have been talking about LSIDs at every HISCOM meeting for years. It really needs to happen now. If we don’t do it ALA will.

Review of HISPID
Many of the HISPID vocabularies are antiquated and are different from the vocabularies that are used in herbarium databases. It may be very difficult for less resourceful herbaria to shoehorn terms from their vocabularies into the standard vocabularies. We therefore recommend a review of HISPID, in particular its vocabularies, take place over the next 12 months or so. We also suggest to make some of the vocabularies recommended rather than enforced, to make it easier for herbaria to deliver data to AVH. Non-standard compliant terminology would show in the record details in the AVH front end and would be marked as such. It should also be included in the validation reports that the aggregator feeds back to the data provider.