Controlled Vocabularies in HISPID

A Controlled vocabulary is a set of allowed values for an element of any database, file, or language. For HISPID, which has been revised in XML, a controlled vocabulary may be implemented in a number of ways. Each method has its benefits and drawbacks, which means that choosing between methods is a task for the project. In XML Schema, we can limit the element to allow one value from the controlled vocabulary, or as many as needed (including duplicates), or as many as needed (excluding duplicates). A controlled vocabulary must be developed for each element separately.

Phenology as an example
In a number of fields in HISPID 3, a controlled vocabulary exists that is to be interchanged as comma-separated values sorted alphabetically. Furthermore, any number of values may be used (from 0 to all values). Phenology (phe) permits the following values:

Flowers (Pre-fertilisation):



Fruits (Post-fertilisation):



Cryptogams:



General Terms:



Thus, values must be supplied as follows:

phe	"bisexual flowers, fertile, fruit",

or

phe	"bisexual flowers",

and not:

phe	"fruit, bisexual flowers",

(because the values are not sorted alphabetically).

In recasting this field into XML, we want to continue to permit the same (or very similar) set of values, but to enable validation so that applications using an XML version of HISPID could test that the file was HISPID before making dangerous assumptions about its content. There are at least 3 methods we could use to recast this field in XML: a pattern, a list, or an enumeration.

Pattern
A pattern can be used to restrict the content of an element or attribute in XML Schema using Regular Expressions. This ensures that the resulting element or attribute can only contain the values specified by the expression. However, the syntax of REs limits their use. For example, it is difficult (if not impossible) to enforce uniqueness or alphabetic order on the individual values.

   	 

A fragment from an XML document implementing this schema might be:

fruit, bisexual flowers

List
A list can be used to gain easier control over the allowed values because they can be listed simply in the schema document, compared to a pattern. However, they place extra requirements on the list values. List values cannot contain spaces, because spaces are used to separate individual values in the list. Thus the values (specified above) must be converted to fit this requirement, e.g.

 

  

   <xsd:enumeration value="buds"/> <xsd:enumeration value="female_cones"/> <xsd:enumeration value="female_flowers"/> <xsd:enumeration value="flowers"/> <xsd:enumeration value="male_or_female_cones"/> <xsd:enumeration value="male_cones"/> <xsd:enumeration value="male_flowers"/> <xsd:enumeration value="fruit"/> <xsd:enumeration value="fruiting_cones"/> <xsd:enumeration value="gametophyte"/> <xsd:enumeration value="sporophyte"/> <xsd:enumeration value="spore-bearing_bodies"/> <xsd:enumeration value="fertile"/> <xsd:enumeration value="sterile"/> <xsd:enumeration value="leafless"/> </xsd:restriction> </xsd:simpleType>

A fragment from an XML document implementing this schema might be:

flowers bisexual_flowers fertile</Phenology>

As was the case with using a pattern, there is no way to ensure each list value is unique in a single HISPID record. The following xmlschema-dev postings illustrate this.


 * http://lists.w3.org/Archives/Public/xmlschema-dev/2003Sep/0073.html
 * http://lists.w3.org/Archives/Public/xmlschema-dev/2006Apr/0019.html

Additionally, sorting is again unavailable.

Enumeration
An enumeration "facet" can be specified easily in the schema document. There is no need to avoid spaces in the controlled vocabulary, and each value can be validated, even for uniqueness within a HISPID record.

<xsd:element name="Phenology" minOccurs="0" maxOccurs="1"> <xsd:complexType> <xsd:choice minOccurs="1"> <xsd:element name="PhenologicalState" type="PhenologyTypeDList"/> </xsd:choice> </xsd:complexType> <xsd:unique name="Phenology"> <xsd:selector xpath="PhenologicalState"/> <xsd:field xpath="."/> </xsd:unique> </xsd:element>

<xsd:simpleType name="PhenologyTypeDList"> <xsd:restriction base="xsd:string"> <xsd:enumeration value="bisexual flowers"/> <xsd:enumeration value="buds"/> <xsd:enumeration value="female cones"/> <xsd:enumeration value="female flowers"/> <xsd:enumeration value="flowers"/> <xsd:enumeration value="male/female cones"/> <xsd:enumeration value="male cones"/> <xsd:enumeration value="male flowers"/> <xsd:enumeration value="fruit"/> <xsd:enumeration value="fruiting cones"/> <xsd:enumeration value="gametophyte"/> <xsd:enumeration value="sporophyte"/> <xsd:enumeration value="spore-bearing bodies"/> <xsd:enumeration value="fertile"/> <xsd:enumeration value="sterile"/> <xsd:enumeration value="leafless"/> </xsd:restriction> </xsd:simpleType>

A fragment from an XML document implementing this schema might be:

 <PhenologicalState>flowers</PhenologicalState> <PhenologicalState>buds</PhenologicalState> </Phenology>

or (with minor changes to the schema code)

flowers</Phenology> buds</Phenology>

The XML Schema Datatypes document notes for enumerations that they do not impose an order relation on the value space [they] create. Thus, just like the earlier examples, there is no ability to check that the controlled vocabulary was supplied in alphabetic order.

Summary
The benefits and problems are summarised as follows.

Benefits
XML Schema brings us the ability to describe the allowed values in one place. There has never been an easy way to validate a HISPID 3 document, thus regardless of which method we choose, we gain considerably by adopting XML. (I think this is well understood, but it is useful to remind ourselves of this.)


 * Pattern
 * Able to maintain the same comma-separated values syntax for the list in XML as it does in HISPID 3.
 * List
 * Able to maintain a list syntax, although with minor changes from the method used in HISPID 3.
 * Enumeration
 * The list of values can be easily maintained in the schema.
 * The values can be tested for uniqueness within a HISPID record.

Drawbacks
Alphabetic sorting of lists, however they are specified in XML, is not possible. This is a minor problem, as sorting is easily done outside XML Schema, and thus could be done by the program that processes the file into a database, if it is required.


 * Pattern
 * Difficult to maintain in the schema due to the complex Regular Expressions syntax
 * The values cannot be checked for uniqueness within a record.
 * List
 * List values cannot contain spaces
 * The values cannot be checked for uniqueness within a record.
 * Enumeration
 * Major change away from comma-separated values format for the list of values in a record. This may be problematic for some herbaria.