AVH database structure

From Hiscom
Jump to navigation Jump to search

Current database structure

A diagram of the current AVH database structure is given below. Most of the table names are intuitive. The resource table is where the parameters of the BioCASe providers are set. It also reports some statistics from the BioCASe providers, such as when a harvest was last attempted (actually, that is the only statistic).

Current database structure

readable version (PDF, 183kB)

Problems with the current structure

There are three tables dealing with taxonomy and four tables dealing with geography, all of which individually link to the specimen table. There are many-to-many relationships between taxon name and family, between taxon name and scientific name, and between state and country. This means that taxa may belong to different families, that canonical taxon names and scientific names do not necessarily match, and that Tasmania may or may not be in Australia. A problem that has been observed in AVH is that, in some cases, when you query for one taxon the specimen detail gives the name of another taxon. While this was caused by a bug in the AVH harvester, which apparently has been fixed, the data model should have prevented this from happening. The fix in the harvester should have been accompanied by a change in the data model.

The database structure shows that the issue of more records being returned in club mode than in public mode (as reported by Alex) cannot have been caused by the data model, but is an error in the AVH query. However, a better database structure would have prevented the error from occurring. [Actually, it wouldn't have, as it turns out the PERTH records are really duplicated in AVH, the old ones with a leading 0 in the unit ID and the new ones without.]

Parts of the AVH database structure are highly overnormalised (see the tables on the right-hand side of the diagram). While normalisation is supposed to enhance data integrity, this degree of abnormalisation compromises data integrity. Moreover, increased normalisation of data reduces the speed of data retrieval. No data entry is done in the AVH cache, so the database structure should be optimised for data retrieval rather than data integrity. The current data model definitely isn't optimised for data retrieval.

Apart from the problems with the linking of the tables, several fields, for instance hybrid_flag and author_team, are in the wrong table.

There are no such obvious problems on the left-hand side of the diagram, although some of the relationships are the wrong way around. Under the current model, a single herbarium can have multiple sets of metadata and more than one BioCASe provider.

Proposed structure

My suggestion for a better structure is presented below. Tables or fields that are not in the current data model and fields that are placed in a different table from the current model are in red.

Preferred database structure

readable version (PDF, 180kB)

Proposed changes

The following changes are proposed:

  • Taxonomy: taxon_name, full_scientific_name and family have been merged into a single table, taxon. Moreover, hybrid_flag, hybrid_flag_insertion_point, author_team_parentheses and author_team have been moved into this table from specimen_detail.
  • Geography: country and state have been merged into one table, geography. The region and near_named_place fields have been accommodated in specimen.
  • The relationship between specimen and specimen_detail has been changed, so that now these tables have a de facto one-to-one (or one-to-none) relationship. This way there can not be more than one record in specimen_detail for any record in specimen. This was already the case in the current database structure, but the proposed relationship just makes more sense. For instance, you can see in the specimen table whether there is a linked record in specimen detail, foreign key constraints work predictably and there are no foreign key constraints on primary keys.
  • The relationships between the herbarium table and metadata and resource have been reversed, so that each herbarium can only have one set of metadata and only one BioCASe provider that is harvested by AVH. You really wouldn't want it any other way.
  • The relationship between the AVH cache and the user administration tables has been removed.