Controlled Vocabularies in HISPID

From Hiscom
Jump to: navigation, search


A Controlled vocabulary is a set of allowed values for an element of any database, file, or language. For HISPID, which has been revised in XML, a controlled vocabulary may be implemented in a number of ways. Each method has its benefits and drawbacks, which means that choosing between methods is a task for the project. In XML Schema, we can limit the element to allow one value from the controlled vocabulary, or as many as needed (including duplicates), or as many as needed (excluding duplicates). A controlled vocabulary must be developed for each element separately.

Phenology as an example

In a number of fields in HISPID 3, a controlled vocabulary exists that is to be interchanged as comma-separated values sorted alphabetically. Furthermore, any number of values may be used (from 0 to all values). Phenology (phe) permits the following values:

Flowers (Pre-fertilisation):

  • bisexual flowers
  • buds
  • female cones
  • female flowers
  • flowers
  • male/female cones
  • male cones
  • male flowers

Fruits (Post-fertilisation):

  • fruit
  • fruiting cones

Cryptogams:

  • gametophyte
  • sporophyte
  • spore-bearing bodies

General Terms:

  • fertile
  • sterile
  • leafless

Thus, values must be supplied as follows:

phe	"bisexual flowers, fertile, fruit",

or

phe	"bisexual flowers",

and not:

phe	"fruit, bisexual flowers",

(because the values are not sorted alphabetically).

In recasting this field into XML, we want to continue to permit the same (or very similar) set of values, but to enable validation so that applications using an XML version of HISPID could test that the file was HISPID before making dangerous assumptions about its content. There are at least 3 methods we could use to recast this field in XML: a pattern, a list, or an enumeration.

Pattern

A pattern can be used to restrict the content of an element or attribute in XML Schema using Regular Expressions. This ensures that the resulting element or attribute can only contain the values specified by the expression. However, the syntax of REs limits their use. For example, it is difficult (if not impossible) to enforce uniqueness or alphabetic order on the individual values.

<xsd:element name="Phenology" minOccurs="0" maxOccurs="1">
	<xsd:simpleType>
		<xsd:restriction base="xsd:string">
			<xsd:pattern value="(bisexual flowers(, |,| )?|buds(, |,| )?|female cones(, |,| )?|....
		</xsd:restriction>
	</xsd:simpleType>
</xsd:element>

A fragment from an XML document implementing this schema might be:

<Phenology>fruit, bisexual flowers</Phenology>

List

A list can be used to gain easier control over the allowed values because they can be listed simply in the schema document, compared to a pattern. However, they place extra requirements on the list values. List values cannot contain spaces, because spaces are used to separate individual values in the list. Thus the values (specified above) must be converted to fit this requirement, e.g.

<xsd:element name="Phenology" type="PhenologyTypeA" minOccurs="0" maxOccurs="1">
</xsd:element>

<xsd:simpleType name="PhenologyTypeA">
	<xsd:list itemType="PhenologyTypeAList"/>
</xsd:simpleType>

<xsd:simpleType name="PhenologyTypeAList">
	<xsd:restriction base="xsd:string">
		<xsd:enumeration value="bisexual_flowers"/>
		<xsd:enumeration value="buds"/>
		<xsd:enumeration value="female_cones"/>
		<xsd:enumeration value="female_flowers"/>
		<xsd:enumeration value="flowers"/>
		<xsd:enumeration value="male_or_female_cones"/>
		<xsd:enumeration value="male_cones"/>
		<xsd:enumeration value="male_flowers"/>
		<xsd:enumeration value="fruit"/>
		<xsd:enumeration value="fruiting_cones"/>
		<xsd:enumeration value="gametophyte"/>
		<xsd:enumeration value="sporophyte"/>
		<xsd:enumeration value="spore-bearing_bodies"/>
		<xsd:enumeration value="fertile"/>
		<xsd:enumeration value="sterile"/>
		<xsd:enumeration value="leafless"/>
	</xsd:restriction>
</xsd:simpleType>

A fragment from an XML document implementing this schema might be:

<Phenology>flowers bisexual_flowers fertile</Phenology>

As was the case with using a pattern, there is no way to ensure each list value is unique in a single HISPID record. The following xmlschema-dev postings illustrate this.

Additionally, sorting is again unavailable.

Enumeration

An enumeration "facet" can be specified easily in the schema document. There is no need to avoid spaces in the controlled vocabulary, and each value can be validated, even for uniqueness within a HISPID record.

<xsd:element name="Phenology" minOccurs="0" maxOccurs="1">
	<xsd:complexType>
		<xsd:choice minOccurs="1">
			<xsd:element name="PhenologicalState" type="PhenologyTypeDList"/>
		</xsd:choice>
	</xsd:complexType>
	<!-- Putting the uniqueness constraint here ensures that we enforce
	uniqueness *within the Phenology element in this HISPID record*, not
	across all Phenology elements in the document. -->
	<xsd:unique name="Phenology">
		<xsd:selector xpath="PhenologicalState"/>
		<xsd:field xpath="."/>
	</xsd:unique>
</xsd:element>

<xsd:simpleType name="PhenologyTypeDList">
	<xsd:restriction base="xsd:string">
		<xsd:enumeration value="bisexual flowers"/>
		<xsd:enumeration value="buds"/>
		<xsd:enumeration value="female cones"/>
		<xsd:enumeration value="female flowers"/>
		<xsd:enumeration value="flowers"/>
		<xsd:enumeration value="male/female cones"/>
		<xsd:enumeration value="male cones"/>
		<xsd:enumeration value="male flowers"/>
		<xsd:enumeration value="fruit"/>
		<xsd:enumeration value="fruiting cones"/>
		<xsd:enumeration value="gametophyte"/>
		<xsd:enumeration value="sporophyte"/>
		<xsd:enumeration value="spore-bearing bodies"/>
		<xsd:enumeration value="fertile"/>
		<xsd:enumeration value="sterile"/>
		<xsd:enumeration value="leafless"/>
	</xsd:restriction>
</xsd:simpleType>

A fragment from an XML document implementing this schema might be:

<Phenology>
	<PhenologicalState>flowers</PhenologicalState>
	<PhenologicalState>buds</PhenologicalState>
</Phenology>

or (with minor changes to the schema code)

<Phenology>flowers</Phenology>
<Phenology>buds</Phenology>

The XML Schema Datatypes document notes for enumerations that they do not impose an order relation on the value space [they] create. Thus, just like the earlier examples, there is no ability to check that the controlled vocabulary was supplied in alphabetic order.

Summary

The benefits and problems are summarised as follows.

Benefits

XML Schema brings us the ability to describe the allowed values in one place. There has never been an easy way to validate a HISPID 3 document, thus regardless of which method we choose, we gain considerably by adopting XML. (I think this is well understood, but it is useful to remind ourselves of this.)

  • Pattern
    • Able to maintain the same comma-separated values syntax for the list in XML as it does in HISPID 3.
  • List
    • Able to maintain a list syntax, although with minor changes from the method used in HISPID 3.
  • Enumeration
    • The list of values can be easily maintained in the schema.
    • The values can be tested for uniqueness within a HISPID record.

Drawbacks

Alphabetic sorting of lists, however they are specified in XML, is not possible. This is a minor problem, as sorting is easily done outside XML Schema, and thus could be done by the program that processes the file into a database, if it is required.

  • Pattern
    • Difficult to maintain in the schema due to the complex Regular Expressions syntax
    • The values cannot be checked for uniqueness within a record.
  • List
    • List values cannot contain spaces
    • The values cannot be checked for uniqueness within a record.
  • Enumeration
    • Major change away from comma-separated values format for the list of values in a record. This may be problematic for some herbaria.

External links