Nathan's Masters Thesis




                         University of California
                                Santa Cruz



                 Identifying Organisms with Computers:

  An Implementation of a Computerized Synoptic Identification System

                        with Fungi as a Test Case



                     A thesis submitted in partial satisfaction
                      of the requirements for the degree of

                                      
                            Master of Science
                                       

                                     in   
                                        
                    Computer and Information Science  
                                         
                                    by     
                                         
                               Nathan Wilson    
                                         
                                 June 1994    
                                        
                                        
                                        
                                        
                                         
                                     The thesis of Nathan Wilson is approved:
                                        
                                     Manfred K. Warmuth
                                        
                                     Charles McDowell
                                      
                                     Dennis Desjardin
 
 


                   Dean of Graduate Studies and Research




                                  Copyright by
                                  Nathan Wilson
                                      1994
                                     iii



Contents



Abstract                                                                vi



Acknowledgments                                                       vii



1. Introduction                                                          1


   1.1 Why Use Computers for Identification?                             1


       1.1.1 Identification with Books                                   2


       1.1.2 Computer Keys                                               5


   1.2 Project Goals                                                     7



2. Background                                                            9


   2.1 The Issues                                                        9


       2.1.1 Taxonomy                                                    9


       2.1.2 Database Issues                                            11


   2.2 Existing Computer Tools for Taxonomy                             13



3. The Interface                                                        18


   3.1 Concepts                                                         19


   3.2 High-Level Commands                                              22


       3.2.1 Identification System                                      23


       3.2.2 High-Level Editing                                         28


   3.3 Help System                                                      31


   3.4 Low-Level Commands                                               32


   3.5 Future Possibilities                                             33
                                     iv



4. The Internals                                                        37


   4.1 Standard Utility Objects                                         37


       4.1.1 Lists                                                      37


       4.1.2 Hash-Tables                                                38


       4.1.3 String Objects                                             38


       4.1.4 Escape Streams                                             41


   4.2 The Database                                                     42


   4.3 Database Elements                                                42


       4.3.1 Descriptions                                               43


       4.3.2 Features                                                   43


   4.4 The File Format                                                  48


       4.4.1 Database Objects                                           49



5. The Fungi Database                                                   54


   5.1 Categories                                                       54


       5.1.1 Descriptions                                               55


       5.1.2 Features                                                   55


       5.1.3 A Mushroom by Any Other Name...                            56


       5.1.4 Acknowledging Users and Authors                            58


   5.2 Future Possibilities                                             60



6. Conclusions and Future Directions                                    62


   6.1 Fungi Database Issues                                            62


   6.2 Program Issues                                                   64



A. Feature Sorting Heuristics                                           66



B. Low Level Commands                                                   70
                                     v



C. Features and Values for Fungi Database                               72



D. Example Database File                                                84



References                                                              86




                 Identifying Organisms with Computers:


  An Implementation of a Computerized Synoptic Identification System


                        with Fungi as a Test Case


                               Nathan Wilson



                                abstract

   Computers can be effectively used to provide a significantly
improved identification, and information storage system for taxonomic
information as compared with traditional paper methods. Much of the
improvement from using computers for identification comes from their
ability to fully realize, in an easy to use way, the power of synoptic
keys for identification over the more traditional dichotomous
keys. Areas of improvement include dynamic updating so the information
remains current, greater flexibility with respect to ignorance, ease
of use, and free distribution. A system developed as part of this
thesis, demonstrating these improvements is described along with a
sample database for identifying fungi.
                                    vii

Acknowledgments

   First of all I would like to acknowledge my mother, Nancy Wilson,
who taught me the difference between a Chanterelle and an Amanita 20
years ago, and nurtured my interest in fungi through her own lasting
interest in them. In addition, I would like to acknowledge my father,
Jerry Wilson, whose interest and proficiency in engineering is
ultimately responsible for my own interest in computers.
   Next I must thank with all my heart my spouse, Shoshana Abrass, who
has provided emotional, spiritual and financial support throughout the
last two years, and has in her quiet way humored and supported my
inexplicable interest in fungi.
   I would also like to acknowledge David Arora and both editions of
Mushrooms Demystified [Aro86]. While we have not had a lot of direct
personal contact, his books taught me and inspired the passionate
interest I have for the fungal world. It is also quite possible that
were it not for the still mysterious arrival of a copy of the first
edition of his book sometime around 1979, I would not have followed
the path that led to this thesis.
   I must express my deep gratitude to Dr. Manfred Warmuth, my
advisor, who allowed himself to be convinced that there really was
enough computer science in this project to make up a Masters thesis.
In addition, I thank Dr. Charles McDowell, chairman of the UCSC
Computer and Information Science board, for his work as one of my
readers and his interest in the project.
   On the mycological side, I would like to thank Dr. Dennis Desjardin
and Kris Shanks of the San Francisco State University for their help
in trying to make the system of as much value to mycologists as
possible.
   The members of the Fungus Federation of Santa Cruz have also
inspired me in the creation of this project. I would particularly like
to thank Gregg Ferguson who as a friend and fellow amateur mycologist
has been one of my most valued teachers.
   For specific work on Taxy, I must first thank Phyllis Cole for
providing the first complete entries to the database (genus Lactarius)
and a glossary of mycological terms. In addition,
                                    viii

Bill Maxfield (genus Boletus, Chalciporus and Xerocomus), Dan Farber,
David Fortwengler, Mark Gillespie, Ford Johnson, Shae Moss, Blaise
Pabon, Ed Ratcliffe, Julia Sauer and Christian Smith have devoted time
and enthusiasm to help make Taxy a reality.
   Finally, I would like to acknowledge the Free Software Foundation
for demonstrating that best things, even in the world of computer
software, can be for free.
                                     1

1.  Introduction

   The majority of work done in relation to this thesis was the
creation of the program Taxy and a database of fungi created for and
with it.  Taxy is designed to be a general purpose system for
describing and identifying biological species. The name of the program
comes from taxonomy which is the science of naming biological
species. The demonstration database created as part of this thesis
focuses on fungi and mycology (the study of fungi).  The primary
reason for this orientation is the author's familiarity with this
group of organisms.

1.1   Why Use Computers for Identification?

   Identification of biological species is a common problem faced
almost anytime we observe the natural world. The number of species
world-wide is estimated to be between 10 and 100 million [Wil92]. The
need to accurately identify them is important both in order to know
how they might affect us personally as well as understanding the
diverse roles they play in maintaining the health of the biosphere.
   In recent centuries the information needed for doing identification
has been principally printed in books. Several very powerful and
widespread ways for accessing the information in books have been
created.  Given the vast amount of information currently in written
form and existing techniques for accessing it, why involve computers?
After all, computers have several significant drawbacks when compared
with books. Computers are much easier to break than books, they
require power, are less readily available, and are quite expensive.
   However, computers provide the opportunity to significantly improve
upon the techniques for identification that are possible with
books. In addition, computers can greatly increase the availability
and usability of the information. With the continuing rapid
improvement in computers, the problems presented by computers are
diminishing. While it will be a long time until computers are as
reliable and convenient to use as books, the
                                     2

current state of the art is sufficient to begin to seriously realize
the potential improvements for doing identification.
   To illustrate the ways in which computers can improve upon the
standard printed identification techniques in current use, we must
first describe these techniques.  After describing these techniques,
we will discuss how computers could be used to create an
identification system that eliminates most of the disadvantages
inherent in books, combines all of the significant advantages of the
different written techniques and provides additional functionality
that is essentially impossible to provide in printed form.

1.1.1  Identification with Books

   There are three common techniques for identifying an unfamiliar
organism using books.  The simplest is to go through a set of
descriptions, either textual or pictorial, until you find a match.  We
will refer to this as the `field guide technique'.  The second
technique is the dichotomous key. In their simplest form dichotomous
keys are sequences of binary choices that lead to an
identification. This technique is the most widely used method of
identification by professional taxonomists. Finally, synoptic keys are
discussed. Synoptic keys allow a set of species descriptions to be
successively reduced based on the value of a user chosen
feature. While synoptic keys are more powerful than dichotomous keys,
the difficulty associated with creating and using them has prevented
their widespread acceptance.

The Field Guide Technique

   The field guide technique is just a simple linear search through a
set of descriptions.  This technique works just fine if the set of
descriptions is small and sufficiently precise to accurately
discriminate between each description.  However, with this technique
there is a tension between having easily understood descriptions (to
reduce the search time), and the precision of the descriptions. The
most extreme example of this difficulty is with guides that rely
primarily on photographs.  In this case, a single image of a
particular individual or small group of individuals is relied on to
provide sufficient information to
                                     3

decide upon the identity of the specimen at hand. If that specimen is
unusual in any way or is from a significantly different population
than the subject of the photograph, then errors can be
made. Fortunately, most field guides that rely on this technique also
include textual descriptions of the species which can be consulted to
decide on the correctness of the identification.  While the
deficiencies of this technique are obvious, it has the major advantage
of ease of use and it is commonly used in conjunction with the other
techniques once they have reduced the set of descriptions to a
manageable size.

Dichotomous Keys

   Dichotomous keys are textually based, typically binary, decision
trees.  They greatly improve upon the field guide technique by
significantly reducing the potential for errors and can be effectively
used to handle hundreds, thousands, or potentially millions of
possible descriptions.  Dichotomous keys are typically composed of
pairs of short discriminating descriptions, or couplets. The
expectation is that the specimen in hand will match only one of the
members of the couplet. Each member then either provides a name or
directs the user to some other couplet in the key.
   The biggest advantage of dichotomous keys over the field guide
technique is a somewhat less than a base-two logarithmic speedup in
the search time assuming the tree is well designed. The speedup is
less than logarithmic since it is often difficult for the designer to
evenly split the current set of solutions and it is sometimes
necessary to include a description on both sides of a node. In fact,
in large keys it is not unusual for the tree to actually be a directed
acyclic graph (DAG) with more than one path to a particular internal
node.
   Another advantage to dichotomous keys is that the time necessary to
decide between the two components of a couplet is normally
significantly faster than the time needed to read and compare a
textual description. However, for fungi, it is probably a little bit
slower than the average time needed by most people to decide whether a
given photograph matches or not. The textual form of the couplets, on
the other hand, allow non-visual features such as taste or odor to be
used effectively. The couplets also provide a simple consistent way
for
                                     4

the person designing the key to emphasize the features that they
consider to be significant and discriminating. This allows the
designer to take advantage of descriptions of accepted larger
groupings such as genera and families, and indirectly teaches the user
to make these distinctions when trying to identify other species in
the future.
   The major problem with dichotomous keys is their reliance on the
user's ability to make any particular discrimination the designer
chooses. This can be a very significant problem if the specimen is
damaged in some way, if an important feature such as habitat is
unknown, if a necessary tool like a microscope is unavailable, if the
user is color blind or unable to detect a particular odor, or if the
designer simply used a term that the user is not familiar with. In
response to these problems, most people who use dichotomous keys are
used to marking particular couplets as uncertain and then following
the different paths through the tree to find a set of results. They
then rely on reading the pertinent descriptions to decide between the
members of the resulting set. This process significantly reduces the
speed with which a key can be used and greatly increases the
frustration for the user.

Synoptic Keys

   A synoptic key takes a set of important discriminating features and
lists all their possible values for some set of descriptions.  Each
description is typically assigned a number and then for each value of
each feature, a list of matching descriptions is given. To use the
key, a user selects an initial feature and then writes down the set of
matches that are listed under whatever value or values they observe in
their specimen. They then select another feature, find the set of
matches and intersect the two sets by hand. They then select a third
feature, intersect these matches and so on.
   Synoptic keys can be significantly more efficient than dichotomous
keys since they can greatly increase the branching factor for a given
feature. More importantly, synoptic keys allow the user to chose which
feature to use next. This completely circumvents the uncertain choice
problem with dichotomous keys.
                                     5

   The major problems with textual synoptic keys are essentially all
clerical problems. In particular, it can be quite tedious to write
down by hand the initial large set of numbers, and then to find the
intersection of that set and another set by crossing out all the
numbers not included in the second set. Furthermore, it is extremely
easy to make a clerical error in this process.  Because of the
abstraction of the method, such errors are very difficult to
detect. Finally, textual synoptic keys do not scale well. By their
nature synoptic keys cannot be efficiently structured using static
text. Textual synoptic keys work reasonably well up to about 150
descriptions, but even at this size they begin to become unwieldy. A
textual synoptic key with 1000 descriptions would be unusable.
   Another significant problem with synoptic keys is that they often
present the user with too many options. It is often difficult while
using such a key to know which feature to select next in order to
reduce the set. In addition, synoptic keys do not provide the designer
the opportunity to present the choices in a structured way that allows
the user to learn the conventionally accepted groupings of
descriptions.
   A final problem with synoptic keys is that they rely entirely on
discretized features. For example, the only way to handle measurement
features such as spore size, or stipe length is to break up the range
of possible sizes into a fixed set of ranges. This is actually also a
problem with dichotomous keys, but it is less apparent because of the
form of the key.

1.1.2  Computer Keys

   The major advantage of a computer over books comes from its ability
to quickly rearrange data under user control. Simulating any of the
standard textual techniques with a computer is very straight
forward. The technique that gets the most obvious boost from the
computer is the synoptic key.  All of the clerical problems of set
intersection can be efficiently handled by the computer. Names can be
used instead of numbers making the intermediate states more
meaningful.
   In addition, the computer can further assist the user by
automatically determining the expected `effectiveness' of the
different feature choices at any point during the identification
                                     6

process. The computation of the effectiveness can be done entirely
automatically based on distributions of values in the remaining set of
descriptions. In addition, heuristic information provided by an expert
can be added to encourage the use of particular features in preference
to others. Such heuristic information could potentially increase the
efficiency of the search in addition to providing a way to indirectly
communicate the recognized importance or certain features.
   A computerized synoptic key can provide further improvement over a
textual one by dynamically determining which values are possible for
the current set.  Similarly, such a system can make much more
effective use of measurement information rather than relying on a
predetermined fixed set of ranges.
   Furthermore, such a system can be seamlessly integrated with the
field guide technique by providing a simple way to pull up textual and
pictorial descriptions of the species in question. Computers could
significantly improve this technique by automatically highlighting the
difference between a set of descriptions so the user can quickly
determine the important features to examine.
   Going even further, the computer provides the opportunity for a
variety of integrated capabilities that are not possible with books.
Books are essentially fixed snapshots of information. Information in a
computer, on the other hand, can be efficiently updated with new
information. For example, a database of descriptions could be extended
by the user to include new descriptions or to comment on, refine or
reorganize existing descriptions.  These changes could then be
efficiently exchanged with other existing, possibly divergent,
databases. Because of the high speed at which computers can duplicate
information and the increasing availability of inexpensive world-wide
networks, this exchange can be done essentially for free.
   Other possibilities include automatic generation of any of the
various types of standard textual descriptions or keys, or maintaining
a record of the identifications made by a particular user as a form of
field notes for a collected specimen. This field note idea could be
expanded further to provide a field note subsystem for correlating
different collections,
                                     7

creating lists of collected specimens, or identifying unusual
specimens that could represent new species.
   The work presented in this thesis focuses on the creation of a
computerized identification system that is most analogous to textual
synoptic keys. The implemented system includes the idea of feature
sorting based on their effectiveness in reducing the size of the
current set, and makes more effective use of measurement information
than is possible with standard textual techniques.  The system also
provides rudimentary support for viewing textual descriptions and
maintaining a log of comments on each member of the database.

1.2   Project Goals

   The most basic goal of this project is to create computer tools
that are of use to taxonomic biologists and mycologists in
particular. In addition, it is intended to provide a way for the user
to easily access recent, accurate taxonomic information. Currently
this type of information can be very difficult for the amateur to
find, since it tends to appear only in scientific journals or to be
out of print. The principal reason for this difficulty it that the
cost of making the information more widely available in printed form
is not justified by the level of demand. In comparison, publishing the
information in electronic form can be done essentially for free
through world-wide networks such as the Internet.
   Another goal of the project is to attempt to take advantage of the
knowledge of people interested in any area of taxonomy at any level of
expertise.  As a result, the system is designed to be applicable to
any taxonomic subject and to be easily extended by amateurs as well as
experts.  Furthermore, since amateurs are expected to extend as well
as use the system, it is designed to work with incomplete information,
and to take advantage of expert's knowledge when it is available.
   By appealing to the expert as well as the novice and by making it
an open system, it is hoped that the databases created with the system
will become lasting resources for taxonomists. Furthermore, as new
taxonomic knowledge develops the system can act as a conduit for the
dissemination of this information.
                                     8

   Pursuant to the goal of being as widely usable as possible, the
system is designed to be as portable as possible between different
computer platforms. One negative side effect of this decision is a
relatively crude user interface for the initial version of the
software. However, now that the initial version has been completed,
more effective, easy to use interfaces can be designed for the
different platforms.
   Finally, although the system is expected to remain free in order to
encourage the free exchange of knowledge, extending the system is not
intended to be without personal reward.  In addition to the obvious
benefit of increased knowledge from the exchange of information, the
system has been designed to properly acknowledge people who work on it
by carefully maintaining records of who created the information as
well as who entered it into the system.
                                     9

2.  Background

   A number of difficult problems arise when trying to create a
computerized taxonomic identification system that is consonant with
all of the goals for this project. This chapter discusses these
issues, possible approaches and existing solutions to them.

2.1   The Issues

   At the surface taxonomic identification seems to be a fairly
straight forward database problem. A set of taxonomic descriptions
need to be compiled and entered into any of a wide variety of
available commercial databases. The database must then be searched for
descriptions that match the features of a particular specimen.
However, this analysis is naive both with respect to modern taxonomy
and current database technology.

2.1.1  Taxonomy

   Looking at most popular field guides one gets the impression that
the different species discussed are clearly defined and agreed upon,
or at the very least that concepts such as species and genera are well
defined and universally agreed upon.
   Unfortunately, this is not the case. In fact one of the primary
tasks of the taxonomic biologist is to review, reorganize and refine
these definitions. During this review process it is not uncommon to
discover multiple descriptions of the same biological entity, but with
different species names. A set of internationally accepted rules
exists that determine which of these descriptions take
precedence. Whichever of these descriptions takes precedence is known
as the `type description' for that species. In some cases it turns out
that a relatively obscure name that is not in wide use has priority
over a common, widely used name. When this happens it can cause a
significant amount of confusion, especially for the unsuspecting
amateur.
   Similarly, it is not uncommon for familiar species to change genus.
This can happen either for historical reasons as with species names,
or when a person working on these
                                    10

species recognizes a logical grouping among the species they are
studying that is sufficiently different from the other species to
warrant recognition at the genus level. These definitions are
fundamentally a matter of educated opinion on the part of the
taxonomist, and as such are often hotly debated. Because the names of
even familiar species are not absolute, the envisioned identification
system must have a sophisticated yet flexible notion of concepts such
as species or genus.  In particular, it must be able to handle name
changes while maintaining the ability to find the information based on
previously used names.
   In addition, the set of features used to describe a species needs
to be flexible. This is especially important as new technology and
techniques are applied to taxonomy creating entirely new classes of
features that the taxonomist is interested in. Advances in technology
can also lead to radical restructuring of the taxonomic hierarchy. For
example, DNA analysis can result in the recognition of relationships
between species that indicate significant errors in previously held
beliefs about these relationships [HV94].
   In general, it is the type descriptions that are of the greatest
importance to the taxonomist.  Type descriptions are typically based
upon what we shall call `collection descriptions' made for a specific
collection known as the `type collection'. Ultimately, type
descriptions are the correct place for names to be stored in a
taxonomic database. In addition, it is clearly important in the ideal
system to support collection descriptions as well as type
descriptions.
   However, neither of these types of descriptions are really what the
less experienced user is interested in. Since collection descriptions
are only for specific collections, they do not necessarily provide
complete information about the range of possibilities that are
generally accepted as part of the definition for that species.  Such
`functional descriptions' are the kind typically found in field
guides. They are usually based on the author's experience with
collecting the species and careful detailed comparisons between what
they collect and the type collections that define the species. The
fungi database portion of this project focuses on such functional
descriptions since they have the widest range of
applicability. However, the Taxy has been designed in a general enough
way that it should be possible to expand
                                    11

it in the future to appropriately handle all three kinds of
descriptions.

2.1.2  Database Issues

   Currently the most successful, standard databases are relational
databases.  These databases are quite powerful and have strong
theoretical underpinnings.  Unfortunately, their development was based
on solving certain types of problems that are not entirely consonant
with the needs for a taxonomic database. In particular, relational
databases are expected to be expressed in terms of small, uniform,
fixed-length records with no internal structure, and with a fixed
`scheme' (most analogous to the set of description features discussed
in this project) [KS91].
   While an experienced relational database developer might well be
able to create an effective identification system using an off the
shelf relational database engine, it is not a natural fit.
Consequently, the resulting system would almost certainly be
inefficient and difficult to use. Particular problems that would be
difficult to handle within the framework of relational databases
include name changes, term synonyms, representation of feature values
as subsets of larger sets such as color, and correctly handling
missing information.  In addition, the functionality carried in the
different feature types described in Chapter 4 would be very difficult
to simulate with a relational database. Even given these difficulties
this approach has apparently been taken by the creators of PolyKey one
of the systems discussed in Section 2.2.
   In recent years, another style of databases known as
object-oriented databases have become popular [KS91]. These databases
deal with many of the limitations of relational databases. However,
the database engines for these tend to be very specialized. Part of
the reason for this is there has yet to emerge a consistent theory
with which to create a truly general purpose object-oriented database.
   The system created for this project can be considered an object
oriented database.  Object oriented databases are characterized as
containing large, complex, often recursively defined objects. The
database developed for this project certainly contains large, complex
                                    12

objects. Strictly speaking none of them are recursively defined,
though it is not uncommon for certain descriptions or features to
functionally dominate others.  For example, genus descriptions
dominate species descriptions, and features that indicate the presence
of a structure dominate features that describe that structure.
   Object oriented databases are also characterized by containing
objects whose response to the same command may vary significantly
based on their type or content. This aspect of Taxy is most clearly
demonstrated by the different types of features that have been
implemented. However, one important distinction between Taxy and a
fully general, object-oriented database is that most of this type of
variation is implemented within the program rather than controlled by
the designer of the database. The designer may choose between the
different types, but they do not have control over the behavior of the
types.
   Finally, object oriented databases often contain `meta-knowledge'
or rules that allow the complex data dependencies to be simply
expressed. Such systems are commonly referred to as `expert
systems'. Because of the power of the synoptic identification system
within Taxy, adding a true rule system has not been necessary. The
only aspect of the system that has a bit of this flavor are the
computed features that allow the database designer to use nested
conditionals to determine the value of a feature based on the values
of other features.
   Because of these similarities it is very likely that the work done
for this project could be efficiently and reasonably implemented
within a general purpose object oriented database if one could be
found. However, currently most database systems that are readily
available require paying a license fee to the database developers for
each copy.  Such a license fee would completely defeat the goal of
making the system freely available to anyone who is interested in the
system. Some systems allow an interface to be attached to a fixed
database and then freely distributed. However, this would defeat the
goal of allowing the system to be user extensible.
                                    13

2.2   Existing Computer Tools for Taxonomy

   For a variety of reasons it is difficult to effectively search for
existing computer systems that perform tasks as specialized as those
discussed in this thesis. In addition, when they are found they are
often without complete referencing information. When possible
bibliographic references are given.  In the cases where only
incomplete information is available, this information is given in the
text.
   A few related commercial products exist, but most of these are
essentially computerized field guides that make no serious attempt at
doing identification.  Exceptions include PolyKey and XID which are
discussed in more detail below.
   Currently, most professional taxonomists are not using computers
for much beyond word processing and some statistical clustering
analyses. The ones that are using databases are generally working with
some commercial database systems and are cataloging information such
as the contents of herbaria, or creating a literature database.
   Other people have certainly attempted to create computerized
identification systems.  Most of these have been created by
taxonomists and tend to be of limited scope.  The few more general
systems that were found were for the most part difficult to use or had
significant design flaws that would prevent them from realizing the
goals set out for Taxy.
   One of the simplest, yet most extensive examples is a database,
`MYC.ASK', created by Gene Yetter. MYC.ASK contains over 5 megabytes
of textual mycological information.  Yetter, an amateur mycologist
from New York, has recently made this large free form fungi database
of notes and species names freely available. The database was created
with the database program `askSam' [ask].  It appears that askSam is
strictly for IBM PC-style machines. However, the database is stored in
an ASCII format and can be accessed through MYCONET, a computerized,
mycological bulletin board system..  Hopefully, it will be possible to
incorporate at least some of the information in MYC.ASK into the Taxy
fungi database.
   Most of the existing systems are small subject specific examples
that do not seem to have been developed any further.  Hypercard on the
Macintosh appears to be the
                                    14

most popular development environment for these systems. The majority
of these keys are simple computerized implementations of dichotomous
keys and do not attempt to take real advantage of the
computer. Examples include Jeremy Ahouse's `Key to the Saxifragaceae'
(a family of flowering plants), `HyperTaxonomy' by Merlin Software
Inc., and `A Key and Guide to the Grasses of North America' by Feil
and Pickering.
   In addition to dichotomous keys, six programs were found that
support synoptic style identification. One is a small, very limited
bird key called `HyperBird', again implemented in Hypercard. The other
five are all serious attempts to address some taxonomic problem. One
recent system is a commercial product called `PolyKey' by Adaskaveg
and Dunlap [AD93].  PolyKey is a computerized synoptic identification
system for the fungi family Polyporaceae.  The information in PolyKey
is based entirely on the recently published two volume work `North
American Polypores' by Gilbertson and Ryvarden [GR87].  The program
itself is implemented on top of the relational database FoxBase
[Mic93]. Unfortunately due to the cost of the system, it was not
possible to actually examine the system and evaluate how the authors
have handled the issues brought up in the discussion of relational
databases.  From discussions with one of the users, it appears that
PolyKey is not user extensible, but does cover its subject in
significant depth. In addition, there appears to be a significant
amount of subject specific information carried in the way the features
are presented to the user.  The closed, commercial nature of PolyKey
as well as its domain specificity clearly differentiate it from the
work presented here.
   `XID', a system developed by Richard Old, is another commercially
available system [Pen94], [Old].  This system has many of the same
feature as Taxy and has been under development significantly
longer. The current version of XID only works on IBM compatible
machines. The interface is menu driven and can encorporate
graphics. It also relies solely on discretized features to represent
identification information about a species. This means that
information such as the size of various components of an organism are
represented using fixed designer chosen ranges rather than the more
natural and general arbitrary floating point ranges used by Taxy. It
also means that relationships between descriptions are either
                                    15

ignored or handled through a different mechanism.
   Databases created for XID can be distributed as standalone static
identification systems.  In addition, an authoring system is
available.  The general orientation of XID is to find experts on a
particular subject and have them create a database related to their
area of expertise. This contrasts with Taxy which allows the
information in the system to be easily extended and attempts to take
advantage of the information available to people at any level of
expertise.
   The final three synoptic systems are `Linnaeus' [ERB+92], `MEKA'
[DM86] and `DELTA' [Web88]. These systems are more similar to Taxy in
terms of their goals. There are several similarities between these
systems. They all run on only a single platform. Apple Macintosh in
the case of Linnaeus and IBM PC-compatibles for the other two. They
are all designed to be general purpose identification systems, and are
freely available. Linnaeus appears to be the most recently developed
of the systems. The other two were developed in the mid to late `80s
and do not appear to have been developed or used significantly since
that time.
   Linnaeus is a HyperCard program of surprising depth.  The system
described in [ERB+92] contains over 30 megabytes of information on
protists (single celled organisms).  They chose Hypercard for the
system \because it offers a highly visual system [sic] programming
environment and greatly eases the construction of the graphic user
interface...".  However, using Hypercard limits their system in
several significant ways. First, Hypercard is for the most part
limited to use with Apple Macintosh systems. While Hypercard may
eventually become available for other systems, it will always be
limited to window oriented graphical environments.  As a result other
common interfaces such as simple TTY-style text or audio interfaces
used by the blind cannot be implemented. Finally, Hypercard is a
proprietary commercial product owned by Apple as opposed to a freely
available language specification such as C or C++.  As a result the
continued support and right to use the language are controlled by
Apple. While the initial versions of Hypercard were essentially freely
available, this access has since been restricted to just the ability
to access information
                                    16

in a hypercard stack.
   The remaining systems, MEKA and DELTA, use extremely crude command
interfaces and very simple techniques for memory allocation. In the
case of MEKA the interface is entirely through commands roughly
equivalent to the low-level commands of Taxy.  The DELTA system, on
the other hand, is composed of a set of separate executables that
operate on rigidly formated command files. This type of interface is
fine for the designer of the system, but is difficult for other people
learning the system.  In addition, neither identification system
appears to do any runtime reallocation, but rather require their input
files to provide this information initially.  The database editing for
both systems is separate from the identification system.  MEKA
provides a simple, specialized editor called MEKAEDIT that permits
relatively convenient editing of its database files. DELTA, on the
other hand, expects the user to understand the various file formats
and correctly edit the inputs with the user's favorite
editor. Unfortunately, the file format used by DELTA has been adopted
by the International Botanical Congress as the \world standard" for
representing botanical and entomological information. As a result, it
may be appropriate for some future version of Taxy to be able to read
and write files in this format in addition to its current, more
general format.
   Finally, according to the documentation for both systems the
authors hoped that the systems would achieve widespread use and allow
the taxonomic databases created with them to be shared.  However, no
features to support this exchange or to allow multiple people to work
on a single database are described. In comparison, Taxy provides a set
of high-level commands that are comparatively easy to learn. In
addition, database editing and database use in Taxy are fully
integrated and all memory allocation is dynamic. Taxy also includes
features that allow databases to be merged easily and takes into
consideration some of the issues that must be addressed to create a
persistent database that many people have worked on.
   As far as functionality, MEKA is the simpler system. Its sole
purpose is identification.  The data consist of a set of binary
attributes which it refers to as `characters' and a set
                                    17

of `taxa' that consist of values for the characters. The user builds
up a set of characters that they can then use to select from the
entire set of taxa. Since characters can be added or deleted, the
system does not constrain successive selections. The value of a
character across the selected taxa can be examined in order to
evaluate its distribution within the selected set. MEKA also provides
a way to search for neighboring descriptions by allowing some number
of mismatches between the user created key and the selected
species. This is similar to Taxy's proposed Weaken command described
in Section 3.2.1.
   DELTA, on the other hand, is primarily oriented towards the
automatic creation of textual keys (either dichotomous or synoptic)
and natural language descriptions based on the user's data. These
capabilities appear to be quite powerful, and have been used to
produce scientific publications [WD85], [Web87]. Neither MEKA nor Taxy
support these functions.  Even though it is not the main emphasis of
the project, DELTA's identification system is quite powerful. DELTA's
data consists of a set of features with an arbitrary number of
discrete values. The identification process is fundamentally one of
successive refinement of the set of descriptions provided in the input
file. In particular, once a give description has been eliminated, it
can only be restored by restarting the identification
process. However, similar to MEKA, DELTA permits a user controlled
number of mismatches in feature values before a description is
eliminated. As with Taxy, DELTA provides an automatic feature sorting
function to determine which characters best divide the remaining
descriptions. DELTA also deals with unknown and inapplicable features,
though in a less detailed way than Taxy.
   In summary, all of the identification systems are significantly
weaker than Taxy either on the basis of the generality, flexibility,
or the power of the feature set. The only known features that other
systems support that Taxy does not are the automatic generation of
keys and natural language description. In addition, support for images
as well as a method for tolerating mismatches during the
identification process have not yet been implement in Taxy. Concrete
proposals for supporting images, key generation and tolerating
mismatches are discussed in Section 3.5 and Chapter 6.
                                    18

3.  The Interface

   The only interface to the current system is command line oriented.
The command line interface was developed first principally to simplify
cross-platform development. The approach has paid off since the
typical time to move the system to other types of computers is between
one and two hours, most of which is spent learning how to work with
that system's compiler. A more user-friendly graphical interface is
planned for future versions, however the command line interface will
remain available for consistency with previous versions and to support
scripting.
   Several of the commands switch the system into a different mode,
e.g.  species identification mode or editing mode.  In general, the
modes follow a consistent design.  The general syntax of the command
interface is a command followed by a space, followed by a
comma-separated sequence of arguments. Wherever possible the system is
case insensitive.  A general command abbreviation system makes it
easier for the experienced user to use the system.  The abbreviation
system depends on the capitalization of the name of the command as it
appears in its hard coded definition. When the name of a command
appears in a help screen it is always shown in the correct
capitalization so an inexperienced user quickly learns the correct
abbreviation for an unfamiliar command.
   Almost all screen printing within the system is handled by a single
low-level printing routine. This routine provides automatic, user
controlled screen paging to stop information from unintentionally
scrolling off the top of the screen.  It also allows the output of a
command to be redirected to a file for use with other programs.
   The first section of this chapter discusses three important
concepts that are needed to understand the material that follows. The
rest of this chapter is organized to reflect the value of various
commands to the expected user of the system. The most important
commands from the user's perspective are the `high-level', database
specific commands created for working with the fungi database.  The
high-level commands include the identification command as well as
commands for editing and extending the fungi database. In addition,
                                    19

the user is expected to make frequent use of the help facility.
Finally, several low-level commands that users will frequently use are
described.

3.1   Concepts

   Before describing the interface in detail it is necessary to more
precisely explain some common concepts used throughout Taxy.
   The simplest of these concepts is the term. Terms are simple data
structures used in Taxy to handle synonymy. A term consists of a set
of strings with one of the strings marked as the `canonical'
string. The canonical string is used for all output. The remaining
strings are considered to be equivalent to the canonical string. For
example, there might be a term whose canonical string is Red with
Crimson and Scarlet as alternates. If the user were to enter Crimson
for some value, the system would use this term as the value. If the
user later asked for the value, the system would output Red. Terms are
used widely throughout the system and can be thought of as
replacements for simple strings.
   Figure 3.1 illustrates the relationships between the more complex
concepts.  At the highest level a Taxy database is composed of two
types of database elements, descriptions and features. In the figure,
the rectangles labeled Da and Db are the descriptions and the circles
Fa through Ff are the features. Each database element has a unique id
string. For convenience id strings are usually the same as or closely
related to a name associated with the database element such as the
species name. However, if there are two descriptions of the same
species in a database they will have different id strings. The
italicized labels in the illustration could be interpreted as the id
strings
   Categories provide a very general way to group database elements.
Each database element belongs to a list of categories. Categories
provide a way for the system to handle important distinctions between
descriptions such as the difference between people and fungi and to
provide a way to control the applicability of features to
descriptions. In the figure, the large irregular shapes labelled C1
and C2 represent the cateogries. The description Da and features Fa
through Fd are in category C1. The other database elements, along with
                             20

Figure 3.1: Schematic Relationship Between Important Concepts in Taxy
                                    21

Fd, are in category C2. As an example C1 might be the category
`fungi', C2 might be the category `people' and Fd might be the feature
indicating the person who last modified a particular description.
   Descriptions represent the subjects of the database, such as
species or genera of fungi, or other important entities such as people
or books. Descriptions are composed of a list of feature values. These
are illustrated by the different sub-rectangles within the
descriptions.
   Features are attributes of the entities described in the database
by the descriptions, e.g.  color, edibility, size or who last edited
the description are all features. Most features have specified ranges
of accepted values. For example, edibility might have the values
Edible, Inedible and Poisonous. Such ranges are illustrated by the
values inside the circles.
   Features can differ in type. The illustration includes term
features (Fa, Fb, Fe and Ff) that are used to represent features, like
edibility or color.  The values for term features in a description are
subsets of a set of possible terms. Fc represents a float range
feature which is for numeric values such as cap size. Fd represents a
description feature. In this case, the range is a category and a
description's value is a list of descriptions from this
category. Description features are used to indicate relationships
between descriptions such as the genus of a species or the person who
last edited a particular description.
   A given feature is not always meaningful with respect to a given
description. There are two possible reasons for this.  First, the
description may be of a type of entity that the feature was never
intended to be applied to. For example, it doesn't really make sense
to have a value for the edibility of a person. Control based on the
type of the description is handled by categories.
   The other case is when a feature depends on the value of another
feature in order to be meaningful. For example, it does not make sense
to talk about the color of a mushroom's partial veil if it does not
have one. Such relationship between features is handled through
activators. A feature's activator is a value for another feature. If
the value for the second feature intersects this value in a particular
description then the first feature is considered active in that
description.
                                    22

   The small arrows between features in the figure indicate
activators. Several possible uses of activators are illustrated. The
description Da has the value A, B for feature Fa. As a result both the
features Fb and Fc are active since the activator values and the
description value intersect.  This means that it makes sense for Da to
have values for both of these features. In fact the description has
the value D for Fb but no value for Fc. The lack of value means that
the value is unknown. Description Db illustrates the case where a
feature, Ff, is not active.
   Interestingly, it is possible to represent categories using just
features and activators. In the example, one could add a feature Fg
whose possible values are C1 and C2. Adding an activator to Fa with
the value C1, another to Fd with the value C1, C2, and another to Fe
with the value C2, along with appropriate values for the descriptions
completely captures the grouping information provided by
categories. However, the distinction has proved to be useful for
database design. In addition, much of the design of the high-level
commands depends on testing category membership which would be more
difficult using features.

3.2   High-Level Commands

   The high-level commands provide relatively user friendly, database
specific interaction modes. The ultimate intention is for these
commands to be built up as scripts from the more general, low-level,
database independent commands. However due to time constraints it was
necessary to implement the existing high-level commands directly in
the underlying language, C++. In general, high-level commands are
distinguished by their dependence on particular categories or features
within a database for them to perform their function.
   In most cases the high-level commands are modal. This means that
after the command is given the user enters a different interaction
mode.  As a result most of the top-level commands are no longer
directly accessible, and a new set of `sub-commands' become
available. In order to minimize user confusion and to ease learning,
functionally identical sub-commands in the different modes always have
the same names.  There are several universally available commands.  In
particular, the `!'  command always terminates the
                                    23

current high-level command returning the user to the original
top-level command mode.  In addition, the help and search systems can
always be accessed with the `?' and `$' sub-commands
respectively. Finally, the `>' sub-command allows the user to save a
copy of the `log file'. A log file is a list of all the commands a
user has given. One is always kept in order to help reproduce
errors. If no error occurs the log file is automatically deleted and
the sequences of commands is lost unless the `>' has been used to save
it.

3.2.1  Identification System

   Identification using the fungi database is done with the IdSpecies
command. As with all the interactive commands, IdSpecies supports the
sub-commands `?', `$', `!', and `>'.
   Prior to the command prompt, IdSpecies always indicates to the user
how many species remain for them to distinguish between and gives the
Latin name of at most two of them.  It also lists the name of a
feature. This is the current feature. The entire set of features is
ordered by the database designer. Initially the current feature is set
to the first `interesting' feature in this list. The exact meaning of
`interesting' is discussed below under the feature analyzing and
sorting sub-commands, but in general it refers to features that make
sense in light of the remaining set of species and have yet to be
given values.
   The IdSpecies command uses the concept of a selection to actually
perform an identification. There are a set of sub-commands that modify
the current selection. These include the basic sub-commands Keep,
Remove, and Add, as well as the sub-commands UnDo and ReDo whose
effects depend on the previous history of sub-commands. After
discussing these sub-commands in more detail, we will discuss
sub-commands for changing the current feature followed by the
sub-commands which provide the user with information about the
features.

Selections

   Central to actually doing an identification is the
selection. Currently selections are only used by the IdSpecies
command, but they are designed to be used more generally in the
                                    24

future. Conceptually a selection is a set of descriptions. The goal of
IdSpecies is to help in reducing the set in order to make an
identification. Like database elements, selections can be members of
categories. The description list for a selection is initialized by
searching the set of descriptions in a database for descriptions that
share some category with the selection. In the future it will be
possible to develop a selection in one database, but then apply it to
another. This would be useful if one were working with an small less
complete database and reached a dead end. The selection could then be
saved and later applied to a more complete database which would
hopefully provide an identification. The complete list of descriptions
in the current selection is given by the DescriptionList sub-command.
   A selection is usually changed with selection actions. When a
selection action is applied to a selection it is automatically
recorded in a list associated with the selection. This list allows
actions to be undone or, after an action has been undone, to be
redone. This allows the user to easily backup during the
identification process to follow a different path. This is useful if
for some reason the user decides they have made a mistake or simply
wish to explore another option.
   The possible actions that can be applied to a selection are keep,
delete, and add, and are performed with the Keep, Delete, and Add
sub-commands, respectively. All of these actions are with respect to a
range of values for the current feature. This range is given in the
arguments to the sub-commands.
   The keep action removes from the selection those descriptions that
do not match the value, thus `keeping' the matches.  The delete action
removes from the selection those descriptions that match the given
value. Finally, the add action looks through the database for matches
and adds them to the selection.
   The exact semantics of calling these three sub-commands with
multiple arguments were carefully chosen to provide as much expressive
power as possible. The best to remember the semantics is to remember
that the meaning of giving a single sub-command two arguments is
always different from that of giving the same sub-command twice, once
with each argument.  For example, the command Keep orange, yellow will
keep any descriptions whose value
                                    25

for the current features contains either the value orange or the value
yellow. On the other hand, the command Keep orange followed by the
command Keep yellow will, of course, result in the set of only those
descriptions that contain both values. Similarly, the command Delete
orange, yellow removes those descriptions that contain both values,
since the command Delete orange followed by Delete yellow would remove
all descriptions with either value. Finally, the command Add orange,
yellow adds only those descriptions that have both values by a similar
argument.
   This choice of semantics results in a nice duality between the keep
and delete actions.  The result of applying one of them with multiple
arguments is the complement, with respect to the original selection,
of the set that would result from applying the other action multiple
times with each argument in turn. The dual of the Add command, on the
other hand, has no obvious, intuitive meaning and was not implemented.

Undo and Redo

   Returning to the sub-commands of the IdSpecies command, the UnDo
sub-command restores the list of descriptions to the set it was prior
to the last action.  The ReDo command re-applies an undone action. The
UnDo sub-command is actually implemented by reinitializing the
selection and going through the entire list of actions except the
last.  This implementation saves space and increases the efficiency of
the regular application of the actions. It is possible that in the
future this will prove to be unacceptably slow and will need to be
implemented through a set of check points or by finding a way to
actually reverse an action, but so far it has not been a problem.

Features

   The simplest feature related sub-command is FeatureList which
provides a list of all features the system currently considers to be
`interesting'. A feature is consider interesting with respect to a
selection if it is in the `id' category, has not yet been used in any
selection action and it has a value in at least one selected
description. In addition, a feature is not
                                    26

considered interesting if it has an activator and the feature is
`inactive' in any description.  Inactive in this case means that the
description has a value for some feature in the chain of activators
for the feature and the value does not intersect the activation value.
   While the exact description of the method of selection may seem a
little complicated the results are quite intuitive. The `id' category
provides a way to exclude unobservable features such as edibility, who
last edited this description, etc. Excluding features that have been
already given a value makes intuitive sense. Similarly, if the value
for a given feature is unknown for all members of selection it should
be excluded. Finally, it is important that the features all make sense
with respect to the entire set of selections, which is ensured by
excluding the inactive features. One nice side effect of this method
of selection, is that as soon as a feature is active in the entire
selection it is automatically added to the list. For example, if the
user has selected all the members of the genus Lactarius, most of the
pileus features are immediately active since all member of this genus
are stipitate agarics and have pilei even though the user has not
explicitly specified that this is the case.
   Whenever a feature is listed throughout the system it is preceded
by a list of all the features and values in its activator chain.  For
example, the feature for describing the color of the partial veil
would appears as Overall shape[Stipitate agaric]:Partial
Veil[Present]:Color.  This indicates that this feature only makes
sense for when the Partial Veil feature has the value Present, which
in turn depends on the Overall shape being Stipitate agaric.
   The user can change the current feature with the Next, Previous and
Find sub-commands. The Next and Previous commands follow the standard
feature order to find the interesting feature immediately following or
just prior to the current feature.  Both commands can be given an
optional set of strings as argument.  If this is done then the nearest
interesting feature in the given direction that contains all the
strings is selected. If the end (or beginning) of the feature list is
found, the user is asked if they wish to wrap around to the beginning
(or end) of the list. If no match is found the user is notified and
the current feature remains unchanged. Recent user feedback has
suggested either eliminating
                                    27

or at least changing these two sub-commands. The commands could safely
be eliminated since the Find command can used for all feature
navigation. The proposed change would be to use the order indicated by
the most recent feature sorting operation rather than the standard
feature order.
   The Find command allows the user to jump to any feature,
interesting or uninteresting.  This command is always given at least
one argument.  The first argument is checked to see if it is the id
string for a feature. If so then that feature is searched for any
remaining arguments, and if a match is made that feature is made
current. Otherwise, the entire set of features are searched in the
standard order from the beginning until a match is found.

Feature Analysis and Sorting

   The most difficult task for the person using the IdSpecies command
is deciding which feature to provide a value for next. Two
sub-commands are provided to assist the user in this task: FeatureSort
and FeatureAnalyze.  FeatureAnalyze takes a single feature and
presents the user with the distribution of the descriptions across the
possible values.  FeatureSort computes for each interesting feature
the expected set size from these distributions. It then sorts the
features based on this value.
   Throughout the various stages of this project a number of different
techniques for evaluating and sorting features were explored.  For the
purposes of this discussion only the final technique and two possible
generalizations will be discussed. A detail presentation of the final
technique is given in Appendix A.
   The general idea of the technique currently used for sorting is to
find how many descriptions would remain if the user provides one of
the possible values for a feature.  A weighted average of these values
is then taken.  The current technique assumes that all species are
equally likely so the weights are simply the normalized set sizes. One
possible generalization of the current method would be to consider the
possibility that the user may enter more than one value for a given
feature. However, the increased computational cost
                                    28

and the difficulty of deciding on a reasonable weighting scheme for
such multi-value input make this generalization of questionable value.
   The other generalization would be to implement a simple learning
system for refining the weights. The learning system would attempt to
learn the probability of each species based on the number of times it
had been identified with the system. The computational cost of such a
learning system would not be a major issue. It could also provide
valuable automatic feedback on how the system is being used. The
problems with this generalization include a significant potential for
inappropriate biases and user confusion, as well as difficulties
involved in merging divergent databases.

3.2.2  High-Level Editing

   There are currently two high-level commands that add or modify the
current database, EditPerson and EditSpecies. Both commands are set up
as a sequence of stages. The user can move from one stage to the next
with the `.' command. The first stage is always for name editing and
the last stage requests the user to add free form comments about
whatever they have been editing. The name editing stage includes both
determining the description's unique id string and any description
specific names such as common names for fungi species or name variants
and nicknames for people.  Such names are stored as terms and use a
common name editing routine for adding and removing alternates as well
as setting the canonical value of the term.
   The EditPerson command has only a name editing stage and a comments
stage.  The EditSpecies command includes two additional stages, a
general editing stage and a specialized sub-stage for efficiently
adding values for features that currently have no value.  Both of the
commands automatically update a set of `automatic' fields in the
description being edited. These fields are discussed in detail in
Chapter 5.

EditPerson

   EditPerson is principally used to add people's names to the system
so they can be
                                    29

associated with their work. As with all database elements the
description of a person has a unique id string associated with it.  If
a user is going to add information to the fungi database they are
expected to tell the system who they are with the SetCurrentUser
command. This command takes a single argument which is expected to be
the id string of their description.
   Person descriptions are distinguished from other descriptions by
being members of the `people' category. In addition to allowing the
new users to select their unique id string, the EditPerson command
allows the user to provide their full name as well as any number of
alternate names by which they may be recognized. These names are
stored in a term that becomes the value of the Name feature for the
new description. Finally, EditPerson gives the user the opportunity to
add text comments about the person through the Comments file
feature. When comments are added they are automatically preceded by
the name of the current user and a date and time string.
   In addition to creating new person descriptions the EditPerson
command can be given an existing id string as an argument. The user
can then edit the person's name term by either adding new names or
removing old ones. They may also enter new comments.

EditSpecies

   The EditSpecies command is the most important command for adding
new information to the fungi database. All the descriptions created
with this command are put in the category `fungi'.  In addition to
creating descriptions for species, this command will automatically
create minimal genus descriptions if the new species is in a genus the
system does not yet recognize. These descriptions are added to the
`species' and `genus' categories respectively.
   In the name editing stage the system requests the Latin name (genus
and species) of the species to be edited. The system parses the name
into the genus and species name. If no description matching the genus
name currently exists in the system, the user is asked if they wish to
add a new genus description. Currently the genus descriptions do
little more
                                    30

than hold the name of the genus. However, even this minimal
information is useful since it provides an easy, consistent way to
change the name of a genus or to add synonyms for a genus as discussed
in Section 5.1.3.
   If the system finds another description matching the given genus
and species names, the user is asked if they wish to edit that
description. If not they are asked for the id string they wish to
use. If that id string matches an existing description, then that
description is edited, otherwise a new description is created. After
finding or creating the species description, the system asks the user
to add any common names using the standard name editing routine
mentioned above.
   After the name editing stage is the edit stage. While in the edit
stage there is always a `current feature' that the user is assumed to
be working on. The current feature can be changed by the sub-commands
Next, Previous and Find. The only difference between the way these
commands work in this context and that described for the IdSpecies
command is due to a slightly different interpretation of when a
feature is `interesting'. In the case of EditSpecies a feature is
considered interesting if it is in the category `fungi' and not in
either of the categories `automatic' or `file'. Furthermore, the
feature's activator value must intersect the value provided for the
activator's feature.
   The edit stage has a collection of sub-commands for changing the
value of the current feature. These are Extend, Reduce, Set, and Clear
with the expected meanings. The argument syntax for the first three of
these varies depending on the feature type as well as between the
commands for a particular feature. The user can find out the syntax
for a particular command for the current feature by giving the command
`?' as the argument.
   The editing stage is terminated with either the `.' or the
`!'. Even if the `!' command is used, the user is always asked if they
wish to add comments to the current species.
   A specialize sub-stage known as an `unknown review' can be started
from the more general edit stage using the sub-command
ReviewUnknowns. The idea for this stage is to make a single pass
through all the features in the database and search the description
for any unset values. A feature is considered unset with respect to a
description if it shares
                                    31

a category with the description, has no value in the description and
the activator for the feature, if one exists, is set in the
description.
   When an unset feature is found the user is asked to provide a
value. To facilitate this process the unknown review stage does not
conform to the standard command syntax used elsewhere in the system.
Instead the user is expected to simply give the value for the feature
in question. However, the sub-commands `?', `$', `>', `.', and `!' are
still recognized to provide help, to search the database, to save the
log file, to quit the add stage, or to exit the EditSpecies command
respectively. The unknown review stage is extremely helpful for
quickly entering information into new descriptions or for filling in
the values for newly added features. The syntax for entering the
values depends on the type of the feature. The default help screen
while in the unknown review stage describes the syntax needed for the
current feature. In addition, the range of possible values for the
feature is displayed prior to the prompt.
   In both the unknown review and the more general edit stage, after
the user enters a value, it is checked to make sure it is valid. If it
is not, but the range of the feature could be extended to make it
valid, the user is asked if they wish to extend the feature.

3.3   Help System

   Help is provided through two commands, Help and Search, or with
just the symbols `?'  and `$' respectively. Both of these commands or
at least their abbreviations are available anytime the system is
waiting for input. The Help command searches for the best matches to
its arguments and provides detailed help. If no arguments are given
then a default, mode dependent, help screen is displayed. The Search
command, on the other hand, performs a thorough search through the
entire database for anything matching its arguments. It then gives a
small amount of information about each match.
   Help always searches a general help file for a match.  The help
files are set up as keyword, value pairs that get loaded into a
hash-table.  The values are blocks of text of arbitrary length. When
searching the help file, the first argument is first used as a key for
                                    32

the hash-table. If a match is found then the corresponding value is
searched for any other arguments given to the command. If this process
fails, then all the values of the hash-table are searched for any
single value that matches all of the arguments.
   If a database is loaded, then after searching the general help file
and either finding a match or not.  The Help command does a similar
search on any help file associated with that database.  The command
then searches for matches in the database itself.  If performs
separate searches of the features and descriptions.  For each of
these, a match with the unique id string is first sought. If this
fails then any other name related strings are searched. If a match is
found, then a complete description of that database element is
shown. In summary, in the most extreme case, the help facility might
print two blocks of text (one from each help file), print out a single
description, and a single feature.
   The Search command, on the other hand, looks through all the
descriptions and features of the current database. The name and id
string of any matching database elements are given.  After a search is
performed, the help command can be used to provide a more detailed
description of any of the results of the search.

3.4   Low-Level Commands

   Most of these commands are an exhaustive set of simple commands
that allow the user to view and edit all aspects of the database. The
complete list of these along with brief descriptions are given in
Appendix B. Most of the low-level commands are not expected to be used
by the typical user of the system but are provided as building blocks
for creating completely new databases. Eventually the low-level
commands will be incorporated into a general purpose interpreted
scripting language such as John Ousterhout's Tool Command Language
(TCL) [RJ94] to provide an easy way for users to create new
high-level, database specific, interactive commands.  A simple system
for setting a particular description or feature to be the default for
later commands greatly reduces the tedium of building a database by
hand and should eventually allow for a simple yet powerful way to
create
                                    33

command macros. For most of the common uses for the system, in
particular the high-level commands, these defaults can be safely
ignored.
   Some of the low-level commands are used regularly by users and
should be at least briefly discussed.  The Quit command exits the
system after warning the user of any unsaved changes and ensuring they
actually wish to leave. The DBOpen expects a filename as an argument
and tries to load a database from that file. Multiple databases maybe
loaded at the same time. The user can select a loaded database with
the command SetCurrentDB.  DBSave saves any changes made to the loaded
database. By default it will write over any existing database file. If
it is given an argument it will save a copy of the database to that
name.
   The other commonly used low-level commands are for getting
information about the database or elements contained in it. DBDescribe
gives a short description of the loaded database including the
database id string, filename, help file, database version number, and
the number of features and descriptions in the
database. DescriptionList gives a list of the print names and id
strings for all the descriptions in the system. DescriptionDescribe
gives a detailed description of a set of descriptions given as
arguments. If no arguments are given then the default description is
described. FeatureList and FeatureDescribe provide similar information
about features.  Finally, DescriptionCompare, FeatureCompare and
DBCompare provide an automated way to do comparisons.

3.5   Future Possibilities

   The work done so far on species identification has suggested some
interesting possibilities for future extensions.  One possibility is
the addition of a weaken action to the IdSpecies command. The purpose
of this action would be to allow the user to have the system
automatically find neighbors of the current selection. The technique
currently under consideration would make weaken a `meta-action' by
going through the list of regular actions used to create the current
selection and finding all selections that could have been made if any
one of the actions were removed from the list. The resulting selection
would be
                                    34

the union of all these sets. The goal of this procedure would be to
increase the confidence in the identification. Ideally after weakening
the selection the user would add additional actions to again reduce
the selection. This new set of actions could again be weakened and the
process repeated.
   The ultimate goal would be to have the weaken action have no effect
on the selection.  This would increase the confidence in the
identification since there would have to be at least two separate
actions separating the final selection from all other
possibilities. Another use for the weakened set would be to help
increase the confidence in some unobservable feature such as
edibility. If all the members of the weakened set are edible then this
greatly decreases the chance that a dangerous misidentification has
occurred. Obviously, this does not substitute for experience and
knowledge, and should not be used as the sole form of verification for
something as critical as edibility.
   Another interesting possibility would be to use the feature sorting
heuristic to automatically construct dichotomous keys. The problem of
constructing a dichotomous key with the least depth is a provably hard
problem. Formally, the problem requires the solution of an NP-complete
decision problem. Functionally, this means that the problem is a
member of a class of problems that as far as is currently known can
only be solved by algorithms whose running times are exponentially
related to the size of its input, in this case the number of
descriptions.
   The underlying decision problem is whether there exists a tree of
the type constructed for the dichotomous key of height less than some
value N. The proof that this is NP-complete is fairly straight
forward. The problem is in NP (a larger class that includes most
algorithms with less than exponential running times as well as all
NP-complete problems) since a tree of height N is a succinct
certificate for the problem. That it is NP-hard can be shown by using
a solution for this problem to solve another known NP-complete
problem. In this case the decision problem underlying the minimal
depth binary decision tree problem will suffice.  This problem is a
simplification of the dichotomous key problem since it restricts
itself to binary trees. A full explanation of this problem and the
proof of its NP-completeness is
                                    35

presented in [Riv87].
   Even though finding the optimal tree is an NP-complete problem, it
seems likely that a very good solution can be created with the simple
greedy method of selecting at each node the feature that is best
according to the sorting heuristic. Technically, this would not be a
dichotomous key unless all the features had only two possible
values. However, such multi-branched keys are already common variants
of regular dichotomous keys.  If there was some strong motivation for
the nodes to always be binary, the values for a feature with more than
two values could be automatically combined in the way that produced
the least overlap between the two sets. Alternatively, a small binary
tree that performed a binary search on the possible values could be
inserted in place of the original node.
   A key created in this way would have a slightly different flavor
from the regular hand constructed dichotomous key. In traditional
dichotomous keys, the designer tries very hard to cleanly divide the
set of species below a given node into two independent sets.  The
method proposed here would not have such a bias.  In fact, it would
naturally produce nodes where more than one option could be correct
for a single specimen, but any single, accurate choice would still
lead to the correct species. It would be interesting to produce such a
key and compare it to one produced by hand.  If a dichotomous key
generation feature were added to the system, it seems likely that a
mechanism for user selection of features would be desirable, both for
pedagogic reasons and potentially to reduce the depth of the tree.
   Finally, there are a handful of relatively simple improvements that
would help the current interface. First, it would be useful to provide
an automated way to merge two divergent databases. Currently it is
possible to compare two database, but the details of a way to merge
them have yet to be worked out.
   At the moment, the only major bottle-neck in the current system is
the database loading speed. Any improvement in this area would be
valuable. At the moment Taxy always loads the entire database into
memory. Use of a more sophisticated file access and storage method
would allow the system to run more effectively on smaller, slower
computers.
                                    36

   Finally, increasing the separation between `official' database
entries and recently added user info would be valuable. This
separation would both help with database merging and would help
alleviate any fears users might have with regard to modifying
information in the database.
                                    37

4.  The Internals

   Taxy is written in the object oriented language C++.  In keeping
with the intended style for C++ much of the code is directly
associated with the data objects used by Taxy.  Consequently this
discussion is organized around these data structures.
   The data structures include a handful of simple utility objects
which will be discussed briefly in the first section as most of the
more complex objects make use of them. This is followed by a
discussion of the database object, and in turn by discussions of the
different objects that are contained in a database. The final section
describes the file format used for storing the objects.

4.1   Standard Utility Objects

   The most widespread utility object is the list. In addition,
hash-tables, which provide an efficient mechanism for associating
strings with particular objects, as well as several specialized string
related types and objects are used throughout the system. Finally, a
set of objects called `escape streams' handle most of the system's
input and output.

4.1.1  Lists

   In this program lists provide the simplest way to make arbitrary,
ordered collections of objects. The generally preferred method for
implementing lists in C++ is to use templates.  Unfortunately, since
templates are relatively new additions to the C++ language, some of
the implementations used for compiling the system did not adequately
support templates.  As a result C++ objects were used instead. All
objects that can be members of lists are either considered to be
permanent objects, such as strings (see below) or numbers, or have a
reference count that is updated whenever the object is added or
removed from a list. If an object is removed from a list and its
reference count is zero, the object is destroyed.
                                    38

4.1.2  Hash-Tables

   Hash-tables provide a very efficient way to lookup a particular
data structure based on some piece of information or `key'. All of the
hash-tables in this program use strings as keys.  There are actually
three distinct types of hash-tables used in the system.  All strings
in the system are stored in a single specialized hash-table known as
the string hashtable. The string hash-table provides a canonical
instance of each string used in the system as discussed more fully
below. The second type of hash-table is the reference hash-table.
These store reference counted objects such as database
elements. Reference hash-tables are used most often in time consuming
search operations. To boost efficiency, reference hashtables take
advantage of the canonical strings provided by the string
hash-table. Finally, the id hash-table is a case insensitive
hash-table used to provide the unique id strings described below.

4.1.3  String Objects

   Most of the information for this system is stored in character
strings.  As a result, a consistent method of dealing with strings
throughout the system is critical. At the lowest level, strings are
handled by the string hash-table. The strings stored in this table
include both regular strings used for printing and comparison as well
as strings used to create other important components of the system.
These include the set of unique database element identifiers, or `id
strings', the names of `categories' used to create various groups of
database elements, and strings used to build `terms' which provide a
general mechanism for handling synonyms.

The String Hash-Table

   Since many strings, for example the string `White', are used many
times in a particular database a canonical instance of each string is
stored in a special string hash-table. The string hash-table is
designed to find a particular copy of a given string given any
instance of that string. Whenever a new string is created either by
reading it from a file or as input
                                    39

from the user, the string hash table is immediately consulted to find
the canonical version of that string. This allows a significant amount
of space to be saved as well as increasing the speed of many string
comparisons. Initially the string hash-table was made to be case
insensitive. However, a small amount of testing showed only a small
gain in terms of space (0.5%) from this.  On the other hand, making
the hash-table case sensitive sped up file loading, by elminating case
conversion, by 2%. While this is not a huge gain, file loading is the
only significant bottle neck in the system so any improvement helps.
   Furthermore, a more subtle problem is introduced by using case
insensitive hashing.  For output case is considered important. For
example, the system always capitalizes genus names and prints species
names in lowercase.  While these simple cases can be handled
algorithmically, the case of people's names cannot, e.g. McIlvaine. As
a result, when case insensitive hashing was done rather than
converting every string to a standard form, a particular string with a
particular capitalization was stored as the canonical form of the
string. This had the possible negative result that if the canonical
form, usually the first version of the string seen by the system, had
incorrect capitalization it would persist.

Id Strings

   In most databases there is some type of unique `key' associated
with each member of a database. In some cases the unique key is
created by combining a set of fields in each element to create values
that are known to be unique.  Initially, the Latin binomial of a
species seemed like a good candidate for the key.  However, this will
not work since the system needs to be able to deal with descriptions
that have no official Latin name yet and with cases where more than
one description are considered to have the same Latin name.  As a
result a separate, unique `id string' is associated with each database
element.
   In addition to providing a unique string to reference a given
element of the database, the id string is a convenient string to use
for associating external information, such as files containing text or
images, with an element. In order to make this use work in all the
desired environments (Microsoft-DOS in particular), it was necessary
to restrict the length of the
                                    40

id strings to eight characters and to make them case insensitive.  If
in the future these limitations are considered too stringent, they
could easily be relaxed by dissociating the two uses of the id strings
and by taking advantage of the hierarchical structure of the file
systems of all the machines.
   The id strings are, of course, stored in the string hash-table. An
additional specialized hash-table is associated with each database and
used to ensure the uniqueness of a new id string within that
database. In addition, there is a function that is guaranteed to
return an unused id string (if one exists) derived from any given
string. This will be particularly useful for the merge command, when
the two databases being merged happen to have used the same id string
for different elements. It also provides a way for the system to
propose an alternative string when the user attempts to reuse an
already existing id string.

Categories

   Each database element has associated with it a special set of
strings known as categories.  Categories provide a very general way to
group together similar or related database elements.  For example all
descriptions of fungi genera are in the categories `genus' and
`fungi', all descriptions of species are in the categories `species'
and `fungi' and all descriptions of people are in the category
`people'.  This provides an easy way to collect together these
different sets of descriptions for various purposes.
   A list of existing categories is maintained and category membership
can be efficiently tested. However, no particularly efficient way of
finding the members of a given category has been created. This could
easily be done in later versions of the system should it become
desirable.

Terms

   Terms are a simple way of handling synonyms within the system. A
term consists of a canonical string which is used by the system for
output, and a list of alternate strings that are considered to be
equivalent. Terms are typically used when a user has to choose
                                    41

between a set of possible values. When the user provides their input,
the set of canonical strings are searched.  If no match is found, then
the set of alternatives for each term is searched. This allows the
user to use more familiar, less precise terms, such as `stem', and
have the system automatically translate it to the more precise
scientific term, in this case `stipe'. There is no constraint that a
given string be used in only one term, or even that the canonical
strings be distinct. This certainly has the potential to lead to
confusion, but this has yet to cause any noticeable problems. In
general, terms are not shared between different lists of possible
values.
   One of the intended uses for terms is to support foreign
languages. Terms will ultimately be configurable by a user in order to
add alternatives or to choose the canonical string. Term
configurations could then be saved and reloaded later.

4.1.4  Escape Streams

   Escape streams are C++ objects that support the reading and writing
of characters, strings, and numbers, while ensuring that some set of
characters are reserved for special purposes. If a string containing
one of these characters is written to an escape stream then it will be
automatically preceded by the `escape character'. Similarly when such
streams are read, special characters will only be returned as part of
a string or number if they are preceded by the escape character. One
of the most important features of escape streams is that they
guarantee that any string can be written to a stream and successfully
read back without getting modified.
   Escape streams actually use two sets of special characters. The
first set are assumed to have special meaning and automatically
terminate whatever type of object is being read. The second set are
characters that are automatically ignored when reading from a
stream. As an example the database file format described in Section
4.4 uses the characters #*;:()[]<>| as special characters and
automatically ignores the tab and carriage-return characters unless
they are preceded by the escape character, \.  The escape character is
automatically considered a special character.
                                    42

   Escape streams were written to be general purpose and are not
extremely efficient. It is quite likely that they are a major
contributor to the time it takes to load a database. It would be
interesting to try writing a parser for this purpose using standard
compiler parser generation tools, such lex and yacc, and comparing the
performance.  It would be very surprising if the standard techniques
did not provide a significant improvement over escape streams. Escape
streams, however, remain useful for parsing user input where
performance is less of an issue than configurability and code
compactness.

4.2   The Database

   Each database object is assigned a unique id string similar to the
one described above.  However, rather than being unique with respect
to the members of a database, it is required to be unique among the
loaded databases. A database also has a bit indicating if it has been
modified since it was last written or loaded, a filename for saving
the database, an optional filename for a help file, a version number
indicating the version of the database, a string indicating the
version of the database program that last wrote the database, a
hash-table and a list for features, and a separate hash-table and list
for descriptions. The lists provide an ordering on the database
elements that is maintained when the database is read from or written
to disk. The ordering of features is particularly important since it
controls the default order in which features are presented to the
user. Ideally the hash-table and list for a particular type of
database element would be combined into a threaded hash-table which
combines the functionality of the two data-types.

4.3   Database Elements

   As discussed above in Section 4.1, all database elements have a
unique id string and a list of the categories to which they
belong. There are two classes of database elements, `descriptions' and
`features'. Features are further refined based on the type of
information needed to represent values for a feature. Intuitively,
features provide ranges of values for attributes that are used to
describe the members of the database and descriptions are sets
                                    43

of values for these features that describe a particular member. For
example, the color of a mushroom's cap is a feature that can take on a
wide variety of values. The collection of feature values that
correspond to a particular species like the Chanterelle is a
description.

4.3.1  Descriptions

   Descriptions are fairly simple but very important objects. In
addition to the standard id string and category list, descriptions
have a hash-table of `feature values'. Feature values represent
particular settings of a feature.  The details of a feature value
depend on the feature they correspond to, but in general they
represent a range of possible values. If there is no feature value in
a description corresponding to a particular feature that feature is
considered to be either unknown or inactive. See the discussion of
feature activators below for an explanation of what it means for a
feature to be inactive. Note that a value being unknown is very
different from the value `Absent' which might be a value for some
feature like the presence of the stipe.
   A more compact bit-vector representation was used in an early
prototype of the system to represent descriptions.  It is possible
that future versions will return to this type of representation should
the simpler system now in use prove to be too inefficient.

4.3.2  Features

   Features are attributes that vary across the descriptions in the
database. In addition to the standard id string and category list, all
features have a name, a run-time type indicator, and an optional
activator. A feature's name is a term, as discussed above, which
provides a short (usually single word), meaningful description of the
feature.  The run-time type indicator mostly provides assistance for
debugging and an easy way to do error detection. Finally, the
activator provides a way to represent dependencies between features
and is discussed in more detail below.  Most of the important
functionality of a feature is dependent on the type of the feature.
In the current system there are eight distinct types of features: term
features, name features, description features, float range features,
                                    44

time period features, file features, alias features, and computed
features. Each of these are discussed below.

Feature Activators

   Some features only make sense when some other feature has a
particular value or range of values. For example, the color of the
partial veil (or ring) depends on the partial veil being present. This
type of dependency is implemented through a feature's activator. An
activator is a feature value for another feature. A feature is
considered `active' for a given description if the activator's value
and the description's value for the same feature `intersect'.  The
exact meaning of intersection is dependent on the type of the feature
in question. If a feature has no activator it is automatically
considered active.
   The activation of a feature can be used by the program to determine
which features the user is currently likely to specify for a given
description.  In addition, if the user provides a value for an
inactive feature in a new or partial description, then the program can
automatically add the activating values to the current
description. Obviously, if the activator's feature also has an
activator then it too is made active and so on.
   Activators have turned out to be very valuable organizing tool for
the system. In the fungi database created for this project only 33 of
the 173 fungi related features did not have an activator. Furthermore,
only 9 of those features are actually observable features that could
be used for identification. The remaining features are all reference
features related to either the naming of database members or
unobservable features such as edibility or which books contain further
information about the species.

Term Features

   By far the most important feature type for the system is the term
feature. The range of values for this type of feature consists of the
set of subsets of an unordered set of terms that are considered
possible values for that feature. Examples from the fungi database
include edibility, odor, taste, and a large set of features with
specialized scientific terms such as the
                                    45

degree or type of scales, hairs, particles, partial veil, universal
veil etc. For the purpose of activation, `intersection' is just set
intersection between the two subsets.

Name Features

   Name features are similar to term features in that the values for
these features are terms.  However, there is not a fixed set of terms
from which the names are drawn. Instead each description has its own
term for storing the value. This allows descriptions to share some
names but not all names. They can even have the same canonical names,
but different sets of alternative names.  In addition, name features
can be significantly more efficient than term features since set
membership does not have to be tested.

Description Features

   Description features provide a way to link descriptions
together. By default the range of a description feature is the entire
set of subsets of descriptions in the feature's database.  This range
can be constrained either by number or by category or both. For
example the mushroom database has a Parent feature which indicates the
currently accepted parent of a description in the taxonomic
hierarchy. Values for this feature are constrained to be a single
description drawn from the set of features in the category `fungi'.

Float Range Feature

   Float range features provide a way to represent measurements in the
database, e.g.  pileus diameter or spore length. An optional units
string can be stored in the feature, as well as optional upper and
lower bounds for the value ranges.  Values are always ranges though
the upper and lower values can be equal. Intersection for float ranges
is the obvious range intersection.
                                    46

Time Period Features

   Time period features are used to represent calendar times. Calendar
times are represented by a long integer that represents the year
relative to 0AD, a floating point value indicating the time since the
beginning of the year, and a floating point value for the error in the
value in terms of days. The desired range of representable values is
from 600million years B.C. (evolution of eukaryotic life on earth)
through the foreseeable future. Fungi have been found in the fossil
record (embedded in amber) so being able to represent such values in a
consistent way is not unreasonable. To represent recent times,
accuracy to the date is necessary and to the hour is reasonable. The
error term should be able to handle most of the desired range. The
current representation meets all of these requirements and is
relatively easy to interpret and use for computations. For convenience
the input routines for periods provide the option of giving a date
which includes a month, but these are converted to days since January
1st to avoid the computational difficulties of leap years when
computing differences between dates.
   Currently this feature type is only used to record the last
modification time of a description. One expected future use is for
creating records of when a species was collected.  The expectation is
that this information can ultimately be used to explore the
distribution of species through time either to track growing season or
to track changes in species distribution over time as is currently
being done in the Netherlands [Arn92].

File Features

   File features provide a way to associate external files with
descriptions.  The feature indicates the type of file and a set of
file paths that should be searched to find the desired files. The
types include log files that the user can view or add to from within
the program and description files that the user can only view. The
values for a file feature in a description are a list of file prefixes
that are combined with the path and a file type specific suffix to
create the list of strings that are used when trying to find an
appropriate file. The current system uses log files for storing
arbitrary comments on any description. In order to ensure
                                    47

that there is no contention between the file names, by default the
file prefix is set to a description's id string. If multiple databases
are loaded into the system, it is assumed that they are contained in
different directories.

Alias Features

   Alias features provide a way for multiple features to share the
same range. The principle advantages of alias features are a
potentially significant savings in memory and file space, and the
ability to manipulate the possible range of values for an entire set
of features.  The most prominent example in the fungi database is
color. To represent color there is a single term feature called
Color. In addition to this feature there are over 40 alias features
that correspond to the colors of various parts of a mushroom from the
cap, to the spores, to the staining reaction of the volva. By using
the exact same range of colors for all the color features, the user
can quickly become familiar with the possible range of values for any
of these features.  Furthermore, if the database designer decides to
add a new color option at some point in the future, this can be done
quickly, and consistently throughout the database. The other prominent
example of alias features are features that indicate the presence or
absence of some common structure of mushrooms such as the stipe or
universal veil.

Computed Features

   Computed features provide a way to derive values for a feature
based on the values for some other set of features. For a given
description a computed value is either present or not. Each computed
feature has a computation associated with it. Computations can be
either string expressions that describe the value for the feature, or
an if-then-else style conditional that tests the value of some other
feature to decide between the two smaller computations. String
expressions consist of a possibly empty list of components that are
either literal strings or references to the value of some other
feature. The elements of the list are simply concatenated together to
derive the actual value for the feature.
                                    48

   The chief use for computed features in the fungi database is for
deriving the names associated with a particular description. For
example, the Latin name for a description is a computed feature that
uses the Name, Level and Parent features of a description to derive
its value. The Name feature is a name feature which gives the print
name for a description.  The Level feature is a term feature with the
range: Family, genus, species, and variety.  The Parent feature is a
description feature that indicates the description that is considered
to be the taxonomic parent of the current description.
   A simple case is when the Level feature has the value genus. In
this case the Latin name is the literal string `Genus ' followed by
the value of the Name feature. A more complex example is when the
Level feature is species. Under this condition the value for Latin
name is the value of the Name feature of the description that is the
Parent of the current description followed by a space and the Name of
the current description. This computation results in the latin
binomial.

4.4   The File Format

   The purpose of the following file format is to describe the set of
database objects in a human readable form that can be loaded into an
interface program. The ASCII character set is used.  In addition to
being human readable some consideration is given to loading efficiency
and error detection. An example of the file format is provided in
Appendix D.
   The format consists of a sequence of object representations.  The
objects used in the current format include the database descriptor,
features, descriptions and comments. The format is designed to be
easily extended to include new object types in the future. However, it
was generally found to be better to extend the existing objects than
to add completely new independent objects.  For example, an early
prototype included separate objects for representing species, people
and books.  These have since all been reduced to the single concept of
a description with the difference maintained by the categories of the
respective descriptions.
                                    49

   In this file format object descriptions are separated by runs of
number signs (#). The first character after the number signs indicates
the type of the object. The object indicated by one number sign
followed by a character is distinct from two number signs followed by
the same character. All the objects used in the current format only
use a single number sign, but the multiple number sign convention is
provided for possible future expansion.
   Throughout a file the backslash character (\) allows special
characters to be escaped. So the string \# indicates that a number
sign should be considered to be a part of the current object rather
than starting a new object. A backslash in front of a non-special
character is ignored. E.g. \b\\a would be the string b\a. To improve
human readability carriage returns and tabs are ignored unless
preceded by a backslash. Spaces are considered to be part of a
string. In general, everything in the file format is case insensitive.

4.4.1  Database Objects

   There are four object types defined in the current version: a
single database descriptor, features, descriptions, and comments. The
id strings for the first three of these always follow the type
character. The id string is terminated by a semicolon. Throughout the
database, term objects are represented by the canonical string
followed by a semicolon separated list of alternates that are all
enclosed in square braces, e.g. Red[Crimson;Scarlet]. If there are no
alternates, the square braces are optional.

Database Descriptor

   The database descriptor is indicated by the character I. It
consists of the unique id string for the database, an optional version
number for the database, and a program version string that indicates
the version of the program that wrote the database.

Comments

   Comments are indicated by the character C. Comments are ignored by
the program reading the file.  In the current version of Taxy comments
are not preserved when a
                                    50

database is saved. Note that these comments are strictly for
annotating a database file or for controlling by hand which database
elements are loaded. They are not related to comments stored by the
database about a particular description.  Such description comments
are considered to be part of the database and are preserved, typically
through external file features.

Features

   Features are indicated by the character F. The feature's id string
is followed by its name term, a feature type string, a list of
categories and a type specific range of possible values for that
feature. Finally, there can be an optional activator.
   The following type strings are supported. As with the commands
discussed in Chapter 3, the type strings may be abbreviated by their
capitalization.  In the list below, the type strings are followed by a
description of any type specific ranges that the feature may use.  All
ranges are required to be optional.
   Type designators:
   `Term' indicates a term feature. The range is a semicolon separated
set of terms.
   `Name' indicates a name feature. No range.
   `Description' indicates a description feature. If no range is given
it is assumed that any number of descriptions from any categories are
legitimate values. The range can be restricted with an optional
maximum list size or by a list of acceptable categories.  The list of
categories is always preceded by a colon and follows any maximum list
size value.
   `TimePeriod' indicates a period feature. No range.
   `FloatRange' indicates a double precision (IEEE 8 byte) floating
point range. The range is specified by two standard notation floating
point values separated by a semicolon. The numbers may be followed by
an optional units string. Ultimately it would be nice to provide
automatic conversion between different units, but in the current
system it the units string is just used when printing out values.
                                 51

The relative order of the two numbers is arbitrary.  Note that comma
(,) is not a special character in case there is demand to allow use of
comma in addition to period as the decimal point character. Programs
that read and write this format should be able to read and are allowed
to write values of higher precision with the understanding that this
higher precision need not be preserved.
   `ExternalFile' indicates an external reference to other files that
use different formats.  The range consists of a semicolon separated
list of type strings, followed by a semicolon separated list of
directory paths. The two lists are separated by a colon. The currently
recognized types as described earlier in the chapter (4.3.2) are log
and description.  It is up to the implementation to decide which and
how the various strings should be used to find the referenced files.
   `Alias' indicates an alias feature. The range is the id string of
the feature it is equivalent too.
   `Compute' indicates a computed feature. The range for these features
is a description of the expression for the computation. String
expressions consist of sequences of strings and references to other
feature values. The references are indicated by angle brackets.
Inside the angle brackets, references can have up to three parts. It
is required to have an id string for a feature. This may be either
preceded by another feature id followed by the character `|'. or
followed by a colon and either of the characters `u' or `l'. The
additional feature id must be a description feature. The value of this
feature indicates the description (or descriptions) in which the value
for the given feature should be found rather than the description
whose value is being computed. The characters that follow impose a
particular capitalization on the feature value. `u' indicates that the
first character should be in uppercase, and `l' indicates it should be
lowercase.  The other type of expression is the conditional. A
conditional begins with a square brace which is followed by a
parenthesized condition which in turn is followed by a one or two
semicolon separated expressions.  Conditions consist of a single
character operator and up to two parenthesized arguments.
                                    52

   The current system supports only two operators: the ? operator,
which tests if a feature has a value in the current description, and
the ~ operator, which tests if the values of its arguments
intersect. At least one of the arguments of a condition must be an
angle bracketed feature name. The second argument (if needed) can be a
either another angle bracketed feature name, or a feature value string
that is interpreted by the feature of the other argument. Here is a
concrete example:

   [(~()(Genus));;]

   To evaluate the value for this feature for some description the
value of the Level feature is examined. If the value matches the
literal string `Genus' then the value of the result is the
description's name with the first character capitalized. Otherwise,
the value is the Genus feature's value in the description indicated by
the value of the Parent feature in the original description. If the
Level is not `Genus' and the Genus feature is not known in the Parent
description, then it would be unknown in this description.
   In the representation of a feature, after the type indicator and
optional range there can be an optional colon (:) and an activator.
An activator consists of the id string for the activator feature
followed by a type specific feature value as described below for
descriptions.  Note that there is no constraint that a feature be
defined before it is used as an activator.  However, the feature must
be defined before the activator is actually used. The validity of all
the activators is checked immediately after the file is loaded.

Descriptions

   Descriptions are indicated by the character D. After the id string,
there is a sequence of feature id strings each followed by a
parenthesized type specific value. The parentheses act to separate the
feature/value pairs as well as separating the feature id strings from
the value strings. The syntax for the various feature types are as
follows.
   `Term' - A semicolon separated list of strings. Typically, the
strings are the canonical value for the term, but alternates are also
permitted.
                                 53

   `Name' - A term.
   `Description' - A semicolon separated list of description id strings.
   `TimePeriod' - Two sequences of numbers.  The two sequences are
separated by a semicolon.  All the numbers can be floating point
values.  The first sequence is required to have an initial number that
is interpreted as the year relative to 0AD.. The other numbers in the
sequence are optional and are each preceded by an alphabetic
character: M for month, D for day and H for hour.  The second sequence
has the same form as the first and indicates the error in the date.
In the error sequence the year is optional and is assumed to be 0 if
not specified. In addition, it is not valid for a number of months to
be given when specifying the error since it would not necessarily be
symmetric as months vary in length.
   `FloatRange' - Two standard notation floating point values separated
by a semicolon.
   `ExternalFile' - A list of file type strings. The value is a
semicolon separated list of file prefixes. These are combined with the
paths and file types given in the feature specification to find the
actual files.
   `Alias' - Same as the value for the feature it is an alias of.
   `Compute' - The range is just the empty string. However, the feature
must be in the description list in order for the feature be computed.
If a computed feature is not given in a description, then the value of
that feature for that description is considered to be unknown.
                                    54

5.  The Fungi Database

   The set of features in the fungi database and their values are
given in Appendix C.  The books that were most influential in creating
this set were David Largent's How to Identify Mushrooms to Genus I:
Macroscopic Features [Lar86], David Arora's Mushrooms Demystified,
second edition [Aro86], and Smith, Smith and Weber's How to
Know... books [SSW79], [SSW81]. Largent's book is particularly
recommended to anyone wishing to make use of the database.
   One general organizing principle used in creating this list is a
structure. A structure is a physical attribute that can be present or
not, such as an annulus (or ring on the stem) or ornamentation on the
spore. Another general principle was to be as scientifically precise
as possible. The expectation is that future interfaces will be able to
simplify the information for the novice user.
   The initial emphasis has been on macroscopic structures and other
features that can be easily observed, such as odor or
habitat. Microscopic features have yet to be attempted, but are
planned. In addition, features for ascomycetes (Morels, cup fungi
etc.) and gasteroid species (puffballs, earthstars, stinkhorns,
truffles, etc.) are incomplete.
   The principle discussion of the database will be centered around
the different categories used in the database. This is followed by a
discussion of some important types of information that are
incompletely handled or not yet included in the fungi database.

5.1   Categories

   Categories turned out to be very useful and powerful tools for
organizing the fungi database.  As discussed in Chapter 4, categories
are a simple way of grouping features and descriptions together. Each
database element has a list of categories that it belongs to.
Currently there is no way of enforcing any particular relationships
(e.g.  inclusion) on categories. It is up to the commands that add and
modify members of a database to maintain any such relationships. It is
certainly possible that categories may be too powerful
                                    55

in the sense that they will easily lead to a significant amount of
confusion for people creating new databases. As a result it may be
desirable in the future to implement a more complex, but less easily
misused system that allows various types of relationships to be made
explicit.

5.1.1  Descriptions

   The descriptions are split into two categories. The fungi and the
people. Members of the fungi category are further divided into groups
based on their taxonomic level.  The current database includes the
categories variety, species, genus and family. The intended members of
these various groups are obvious from the category names.

5.1.2  Features

   The use of categories for features is significantly more
complex. This is partly by design.  An effort was made to keep the
descriptions as simple as possible, since ultimately there will be
many more descriptions than features. For a feature to be considered
applicable to a description, they must share a category. This means
that features for describing attributes of fungi are all in the
category fungi. Similarly, for a feature to apply to all descriptions,
such as the Last Editor feature, it must be in both the fungi and
people categories.

Base Features

   The simplest category of features in the fungi database are the
base features. These features are provided solely to support alias
features and do not belong to any other categories.  The two base
categories in the fungi database are Color and Status.  The Status
feature has the two possible values Present and Absent and is used as
the basis for all structure features such as Pileus, Stipe, Universal
Veil, etc.

Id Features

   The remaining features are all members of one of two categories, id
or reference. The id category is the larger of the two and is for
features that are observable such as Overall
                                    56

shape[Stipitate agaric]:Pileus disc color or Odor.  It turns out that
all features with activators are id features. The members of the id
category are further broken down into the categories macroscopic and
olfactory.  These categories are not used in any special way at the
moment, but the intention is for the user to be able to specifically
select or exclude categories of feature using these names. Other
categories that have been considered for grouping into categories are
color features and habit features (i.e. Habitat, Frequency and
Grouping).

Reference Features

   The other major category for features is the reference
category. This category is for features that are not directly
observable and hence typically require consulting some sort of
reference.  This category includes features such as Name, Latin Name,
Edibility, or References (which lists books that contain more
information about a particular description). An important category
within the reference features are the automatic features.  These are
features whose values are automatically or indirectly derived by the
high-level command used to edit the descriptions.  These include
feature such as the Last Editor and Last Changed features as well as
most of the feature related to the naming of fungi species. The
automatic category is used by the high-level editing commands to avoid
them when searching for features that the user may be interested in
setting themselves. Finally, the file category is shared by all
external file features, and exists principally so the editing commands
can avoid them.

5.1.3  A Mushroom by Any Other Name...

   The species naming system is somewhat complex and deserves
discussion.  The features related to naming are Common Name, Name,
Level, Parent, Family, Genus, species, variety and Latin Name. The
Common Name feature is the simplest of these since it is just a simple
name feature that gives all the common names for a species. The Name
feature, on the other hand, is a name feature that provides the basis
for the taxonomic naming system.
                                    57

The Name of a description is the piece of the taxonomic tree which the
description pertains to. For example, in the description of the Golden
Chanterelle (Cantharellus cibarius), the Name is cibarius. Similarly,
for the description of the genus Cantharellus the Name is
Cantharellus.  In addition, the term feature Level indicates the level
in the taxonomic tree for a given description. In the two examples
just given, species and Genus respectively. Finally, the taxonomic
tree is hooked together with the description feature Parent.  The
combination of the Name, Level and Parent features provide sufficient
information for the remaining computed features, Family, Genus,
species, variety and Latin Name, to derive their respective values.
   The reason for this complex design is to ultimately permit the user
to easily edit the taxonomic tree. For example, if a researcher
decides to change the genus of a description, a simple change of the
value of the Parent feature results in all the correct changes to the
Family, Genus and Latin Name features. Actually, as complex as the
current scheme is, it is quite possible that it is not complex enough
to capture all the desirable properties for the naming system.  In
particular, it is not clear whether there will be an adequate way in
which to search for descriptions based on names that have been
changed. Since the Name feature is a name feature, simple name changes
can be handled by making the now defunct name an alternate in the
value term. The current search feature handles this type of change. In
addition, when new nodes are created within the tree, for example when
a genus is split into two separate genera, the genus with the new name
can retain the old name as an alternate. The problem arises when a
species is transfered from one existing genus to another already
existing genus. One possible solution to this problem would be to
provide an additional description feature that keeps a list of
historical membership. This has the nice property of providing a place
for groupings that have been, or may soon be completely disbanded. A
good example of this are the hypogeous (underground) gasteroid fungi.
These have traditionally been placed in a separate taxonomic class
(class is three nodes above genus in the standard taxonomic
tree). However, most of the genera within this class are now believed
to be most closely related to a variety of genera spread throughout
                                    58

the other classes of fungi. By linking these genera through a common
historical group, this historical, morphological grouping could be
maintained.
   Another problem related to species naming relates to the ownership
of a name.  The current system allows for the possibility of multiple
descriptions with the same value for the Name feature. Note that these
descriptions would still have distinct id strings. However, given
multiple descriptions for the same name, it would be nice to have a
particular description specially designated as the official
description for that name.  The obvious candidate for this designation
would be a description of the type specimen for the taxon in
question. Furthermore one could imagine actually adding a Type Species
description feature that indicated for non-type descriptions which
description it is considered to match.  Under such a system, non-type
descriptions would have no value for the Name feature and instead
would adopt the Name of its Type Species. This sort of structure could
be extended even further to support various types of intra-taxon
groupings such as tribes, sections or stirps.

5.1.4  Acknowledging Users and Authors

   Another interesting aspect of the fungi database is the handling of
the relationships between the information in the system and the people
who create that information. There are two important groups of people
related to the information in the fungi database, the taxonomists who
create the information by describing and naming the various taxa, and
the people who enter data into the system. The association of
taxonomists with their work has a long and significant tradition. An
internationally accepted standard notation for describing the nature
of these associations is supported in the fungi database. Fortunately,
this system has recently been greatly simplified [Kor94]. As a result
only three features are needed for this information: Author,
Reclassifier, and the computed feature Author Citation. See the
definition of the Author Citation feature in Appendix C. In addition,
to these fields the Workers feature maintains a list of all people who
have contributed significantly to the understanding of a particular
taxon. Between the addition of the Workers feature and the
                                    59

References feature mentioned earlier, all of the information that used
to be carried in the older more complex author citations can be
retained.
   There are a variety of fields within the fungi database to keep
track of the people who have entered or edited the information it
contains. The Last Change, Last Editor features keep track of when the
last change to this database element was made and who made that
change.  The Editors feature maintains a list of all people who have
ever modified this element. The Approved feature is a list of people
who have reviewed the current database element and believe it to be
correct. Finally, the Element Status feature indicates if the element
comes from an official Taxy database distribution, if the element is
based on an official source but has been modified, or if the element
is completely new and has not been officially approved.
   In addition, two new features, First Editor and Entry Data may soon
be added. One of the purposes of this information would be to confim
the identity of two descriptions between divergent databases.  One
argument against at least the First Editor field is the not
necessarily appropriate emphasis this places on this person with
relation to the database. It is very possible that many of the first
entries in the database will be little more than just a species
name. On the other hand, it would be nice to acknowledge the person
who first provided a significant description of a particular taxon.
   All of the features related to database editing are automatic
features that are maintained by the fungi database's high-level
commands.  The goal of all the features is to provide a mechanism both
for recognizing a person's work and to provide a traceable history for
the information.  Because the system is completely open there is no
way to actually prevent someone with malicious intent from modifying
the database in a way that could lead to erroneous identifications and
to leave no trace of their changes.  As a result it is important that
the editing history not be construed to imply any sort of
responsibility for an identification. Rather it should provide a way
for users to contact the editors of a particular element in order to
verify an identification or to track down the reasons for any errors
found in the database.
                                    60

5.2   Future Possibilities

   This section discusses possible future changes to the database that
could be implemented using the existing program or with a few
relatively straight forward extensions. Two more significant
extensions are discussed in Chapter 6.
   While developing the database, it was not always clear what should
be a feature and what should be a value for a more general feature. A
good example is the surface texture of a mushroom cap. Surface texture
includes attributes such as the sliminess (viscidity) of the cap, for
which most books recognize four or five distinct categories, or the
level of shagginess, for which there are both different degrees and
different types [Lar86]. On the other hand, there are a number of
features that are only loosely related to one another, such as whether
the cap is wrinkled (rugose) or has a small bump in the center
(umbonate).  The basic reason for creating features is to provide a
reasonable human interface, grouping together things that people want
or expect to find in the same place. Some of these organizational
decisions will have to be made after the system is in use. This
highlights the importance of providing the ability to rearrange
features and values in a reasonably general way after the data have
begun to be entered into the system. An exact mechanism for this has
not yet been explored.
   Another interesting idea that arose during the development of the
fungi feature list is the possibility of allowing a feature to
`inherit' a default value from another feature. Examples of such
features are those related to different sections of the stipe (a
mushroom's stem).  For many mushrooms the stipe is nearly
homogeneous. However, a significant number of species have distinct
differences in the color and/or texture of the upper and lower portion
of the stipe. Representing this in an intuitive way is
difficult. Clearly you want to have a set of general stipe
characteristics. At the same time there is no a priori reason to
assume that the general characteristics always apply to a particular
portion of the stipe such that the differences could be made into a
substructure.
   One solution would be to provide a set of optional substructures.
By default these substructures would simply inherit the value from the
parent structure, but this value
                                    61

could be overridden providing the desired effect. Unfortunately, this
means that a particular configuration (say red above and yellow below)
could be represented in multiple ways (yellow with red above or red
with yellow below). The solution settled on in the current feature set
is to have a Texture variation feature that when present activates
both a Texture below and a separate Texture above feature.
   Finally, representation of habitat information is quite
challenging. Different mushroom species not only live in different
habitats, but the level of affinity to a particular habitat varies.
For example, some species are only found growing in close association
with a particular species of tree, others can turn up just about
anywhere. The structure of this problem leads most naturally to a
hierarchical, or partial order representation. The details of this
representation and a mechanism for comparing it to other features for
purposes of the feature sorting heuristics have yet to be worked
out. The current system just uses a standard term feature for habitat
information and relies on the person entering the descriptions to
include any habitats which are implied by the habitats they choose.
                                    62

6.  Conclusions and Future Directions

   The program, Taxy, demonstrates the fundamental thesis of this
work: that a computerized synoptic-style identification system can be
built that significantly improves upon the textual systems available
for identification.  By providing significant added functionality, the
improvements extend well beyond simply handling the clerical overhead
associated with textual synoptic systems.  Areas of added
functionality include dynamic feature sorting and analysis, user
extensibility, efficient updating and better handling of diverse types
of information.
   Taxy and the fungi database have already been used by a number of
people. To date, the majority of serious users are members of the
Fungus Federation of Santa Cruz. For the most part the users do not
consider themselves to be experts on fungi. In fact several of the
users have little or no previous experience with identifying
fungi. Taxy has proved to be not only usable by these people, but also
to be a powerful learning tool.
   However, a huge amount of work remains ahead in order fully realize
the potential of the system. The most obvious component that needs
work is the fungi database. This includes both simple data entry and
resolution of the more straight-forward issues discussed at the end of
Section 5.2.  There are also two larger database issues that should be
addressed in order to handle all important types of information for
fungi. These issues are discussed below. In addition to the
possibilities discussed in Section 3.5, there are several major pieces
of added functionality planned for the system. These are intended both
to increase Taxy's value for handling the fungi database and to make
it fully general so it can be effectively used for a wide range of
possible subjects.

6.1   Fungi Database Issues

   One important type of information that the system does not yet
handle is the geographical distribution of a species. It would be
appropriate to have some sort of generalization mechanism for this
information. For example a species should not be eliminated if it has
                                    63

been recorded only in Santa Cruz, California and San Francisco, and
the organism at hand was found in Half Moon Bay (a town between the
two). On the other hand, if a species had previously been found only
in Japan and Thailand and the collection was made in New York, it can
probably be safely eliminated. We could take the stance that any
possible species should be considered in which case geographical
information is useless. On the other hand, regional guidebooks are
common and useful especially to novice collectors. Furthermore, one
desired application of the database is to create just such regional
sub-databases so a person new to an area can know what to expect to
find. In addition, many features such as size or color commonly vary
with geographical location. However, representing such correlational
information between features would be extremely difficult. Probably
the easiest way this problem would be to have separate descriptions
for the different geographical forms.
   One idea for representing location is to use a list of latitude,
longitude, and error tuples.  Each tuple would represent a circle on
the globe.  In addition, there would be a simple mechanism for
relating place names to such tuple lists.  Using this information, the
user could then say, for example, that they would like to work with
the set of species known to occur within 100 miles of Santa Cruz. The
system would then select only those descriptions whose set of circles
intersect this circle. A list of circles, as opposed to a single
circle, is necessary to handle species with pan-tropical or other
complex distributions. Using a list of circles permits any
distribution to be represented in a reasonably simple way.
   Another valuable type of information that could be explicitly
represented in the database is a list of subjectively similar
species. The only way to represent this information in the current
database is in the free form external files associated with a
description. The proposed `weaken' command discussed in Chapter 3
should provide an automatic way to find species that are similar to
those in the current selection. However, it remains to be seen if such
a tool successfully captures the subjective experience of similarity
that often confounds novice collectors.  If such automatic techniques
fail in this regard, then explicit addition of this information could
be of great educational value.
                                    64

6.2   Program Issues

   Probably the most obvious area for major improvement for Taxy is
the user interface.  The current command oriented interface is
powerful and extremely portable and will be maintained in order to
allow for scripting. However, an easier to use, graphical user
interface should be added. Ultimately, the systom should have a fully
windowed system capable of supporting many different types of
interactions. To begin with it would be very useful to be able to
bring up images and human-readable descriptions of species in their
own windows.  Similarly, help windows with drawings or digitized
photographs of various terms as well as their textual definitions
would make it much easier for new users to learn the terms used in the
database.
   Separate windows for reviewing and editing different descriptions
and features would be extremely useful. In addition, a graphical,
direct manipulation (click and drag) style editor for changing the
relationships between descriptions, e.g.  their position in the
taxonomic tree, or for changing the activation relationships between
features, would remove much of the complexity and tedium of the
current system.  Separate windows for displaying and changing
selections would allow the user to carry out multiple identifications
at the same time in order to explore different approaches for the same
specimen, or to maintain lists of species for a variety of
purposes. Along with such species lists one could imagine adding
facilities to maintain collection records, accession logs, and field
notes. Facilities for analyzing such records or other information in
the system could provide useful information such as distributions over
time or data for cluster analysis.
   In addition to improving the user interface, the system could be
extended to use databases available on servers connected through a
network.  This has the potential of allowing there to be huge
world-wide databases of species maintained by professional taxonomists
which could be accessed as needed by interested people anywhere in the
world.
   Another idea that should be explored is automatic textual key
generation and possibly automatic natural language descriptions. These
facilities would provide a useful connection to the standard textual
methods for identification, as well as providing a more easily
                                    65

understood summary of a description. A related idea is to provide a
method for rapidly translating existing textual synoptic keys into
Taxy descriptions.
   Finally, a significant amount of work remains to make the system
truly generalizable to other subjects. The most important component of
this is the addition of a scripting language for creating database
specific (`high-level') commands. In addition, the only mechanism for
creating and modifying a database's feature set are the low-level
commands. A more user friendly interface for these tasks, similar to
those for the high-level commands, would greatly reduce the difficulty
involved in creating a completely new database, and allow the system
to reach its full potential.
                                    66

Appendix A. Feature Sorting Heuristics

   For the purposes of this discussion we define the features, F_i
with i between 1 and N, to be sets of symbols, f_ij with j between 1
and n_i. A feature value is a single f_ij. A feature setting is a
subset of a particular F_i. A description, d_k with k bewteen 1 and M,
is a tuple of feature settings. Finally, a feature distribution,
Phi(Fi_, D), with members phi(f_ij, D), is a tuple of frequencies for
the feature values of the feature F_i in the set of descriptions D.
   The technique actually in use in the current version 1.0 of Taxy is
called the Expected size using Feature Values (EFV) heuristic. The
generalization to using multiple values is the Expected size using
Feature Settings (EFS) heuristic. Both methods attempt to compute the
expected size of the set of descriptions after the user provides a
value for the feature.  The difference between the two methods is that
the EFV heuristic only considers the sets that could result from the
user providing a single value for the feature. The EFS heuristic, on
the other hand, considers all possible inputs the user might provide.
   For the EFV heuristic, the expected size, E(F_i, D) for feature i
and set of descriptions D is a weighted sum of the phi(f_ij, D). The
heuristic assumes that all fungi are equally likely,
so the weights are the phi(f_ij, D) normalized by sum across j of the
phi(f_ij, D) so

              sum across j of phi(f_ij, D) squared
   E(F_i, D) = ____________________________________
              sum across k of phi(f_ij, D)

   The EFS heuristic is similar except that summations are across the
set of possible feature settings given the set D. This is simply the
union, U_i(D), of all the feature settings for the given feature in the
set D and all feature settings that are subsets of actual feature
settings.  Note that we do not necessarily include all subsets of the
union of the feature values of all the feature settings. For each
member of U_i(D) we can find the size of the subset of D that would
result from that feature setting. A weighted average of these values
gives an expected number of descriptions that will remain given a
setting for that feature. The EFS heuristic reduces to the EFV
heuristic if the weights for all subsets with more than one member is
set to zero and the weights for the remaining subsets are as discussed
above.
   For the EFS method the simple weight scheme of using the size of
the set for each
                                    67

possible feature setting over the sum of the sizes makes an additional
even more implausible assumption. In particular, it assumes that the
user will be just as likely to enter multiple values for a feature as
they are to enter only the single most prominent value.  This
assumption is somewhat implausible since it relies on the users
understanding and complete, correct use of the semantics of the
different ways of providing multiple values for a feature.  The EFV
method, on the other hand, is easily understood by telling the user
that it assumes they will only be providing a single value. The other
big advantage of the EFV method over the EFS method is a significant
reduction in computation time. Since the EFS method potentially relies
on the entire set of subsets of possible values, in the worst case,
the time for computing the EFS expectation is exponential in the
number of values for the feature.  The EFV method, on the other hand,
will always be a constant multiple of this value. In most cases the
number of values for a feature is small so the exponential behavior of
the EFS method is unlikely to have dramatic effects on the computation
time, but it will always be the case that the EFV method will be
faster.
   Finally, prior to deciding which heuristics to use, the EFV and EFS
methods were compared on a small but exhaustive set of test
distributions.  In particular, all possible feature distributions for
a feature with three possible values across three descriptions were
sorted using both methods. When this test was made several other
feature sorting methods were also analyzed. Of all of the methods
tested the EFS method made the most distinctions between the different
possible patterns. The EFV and EFS methods were chosen as the final
contenders since they were the most theoretically sound, they could be
explained relatively simply, and none of the other methods were found
to have any significant advantage, with the possible exception of
computational cost, over these two methods. The EFV method was chosen
for the actual implementation primarily because of explanatory
simplicity and its lower computational cost.
   For three descriptions drawn from a set of three possible feature
values, there are 22 distinct types of feature settings. The types are
represented by triples of sets drawn from the set A, B, and C. Thus
(A,AB,ABC) represents three descriptions one of which has a single
                                    68

value for this feature, the second has two values for this feature one
of which it shares with the first feature and the last description has
all three values.  (B,AB,ABC) is considered equivalent since A and B
have just exchanged rolls, but (A,BC,ABC) is considered to be
different. The EFS heuristic identified 19 distinct groups of feature
setting patterns. The EFV heuristics combines some of these groups to
form 12 distinct groups. Ranks are from the best feature for
distinguishing between the descriptions to the worst.

EFS
                                  (A,B,C)
                             (A,A,BC) (A,B,AC)
                                 (A,B,ABC)
                                 (A,A,ABC)
                                 (A,AB,BC)
                                (AB,AC,BC)
                                  (A,A,B)
                                 (A,AB,AC)
                                (A,BC,ABC)
                                (A,AB,ABC)
                                 (A,B,AB)
                                 (A,BC,BC)
                                 (A,A,AB)
                                (AB,AB,AC)
                                (AB,AC,ABC)
                                (AB,AB,ABC)
                                (A,ABC,ABC)
                               (AB,ABC,ABC)
                      (A,A,A) (AB,AB,AB) (ABC,ABC,ABC)

EFV
                                  (A,B,C)
                             (A,A,BC) (A,B,AC)
                                  (A,A,B)
                       (A,BC,BC) (A,AB,BC) (A,B,ABC)
                            (A,AB,AC) (A,A,ABC)
                            (A,B,AB) (AB,AC,BC)
                      (AB,AB,AC) (A,AB,ABC) (A,BC,ABC)
                                 (A,A,AB)
                          (A,ABC,ABC) (AB,AC,ABC)
                                (AB,AB,ABC)
                                    69

                                (AB,ABC,ABC)
                       (A,A,A) (AB,AB,AB) (ABC,ABC,ABC)
                                    70

Appendix B. Low Level Commands

   The following is a list of all the low-level interface commands
along with brief descriptions of each.

Description commands:
   DescriptionList - List all descriptions in current db
   DescriptionDescribe id* - Get info on descriptions in current db
   DescriptionAdd id - Add description to current db
   DescriptionRemove - Remove description from current db
   DescriptionCompare [id [id]] - Compares two descriptions
   DescriptionCategoriesClear - Clear all categories from a description
   DescriptionCategoryRemove category* - Remove description from a category
   DescriptionCategoryAdd category* - Add description to a category
   DescriptionFeatureReduce range - Reduce description range for feature
   DescriptionFeatureExtend range - Extend description range for feature
   DescriptionFeatureClear - Clear description range for feature
   DescriptionFeatureSet range - Set description range for feature
   DescriptionIdSet newid - Change description id string

Feature commands:
   FeatureList - List all features in current db
   FeatureDescribe id* - Get info on features in current db
   FeatureAdd id type - Add feature to current db
   FeatureRemove - Remove feature from current db
   FeatureFind string* - Search all the features for the given strings
   FeatureCompare [id [id]] - Compares two features
   FeatureActivatorRangeReduce range - Reduce a feature's activator's range
   FeatureActivatorRangeExtend range - Extend a feature's activator's range
   FeatureActivatorRangeSet range - Set a feature's activator's range
   FeatureActivatorClear - Clear a feature's activator
   FeatureActivatorSet id - Set a feature's activator feature
   FeatureRangeReduce range - Reduce a feature's range
   FeatureRangeExtend range - Extend a feature's range
   FeatureRangeSet range - Set a feature's range
   FeatureCategoriesClear - Remove feature from all categories
   FeatureCategoryRemove category* - Remove feature from a category
   FeatureCategoryAdd category* - Add feature to a category
   FeatureNameRemove name - Remove a name from a feature
   FeatureNameAdd name - Add a name to a feature
   FeatureCanonicalSet name - Change feature's canonical name
   FeatureIdSet newid - Change feature's unique id string
                                 71

Database commands:
   DBList - Lists all open databases
   DBDescribe [id*] - Get info on open databases
   DBCompare [id [id]] - Compares two databases
   DBSaveAll - Save all database
   DBSave [filename] - Save current database
   DBOpen filename [id] - Load a new database [changing its name]
   DBNEW id [filename] - Create new database [instead of id]

Miscellaneous commands:
   SetOutputFile [filename] - Command output goes to file.  No filename resets
     output to standard out
   AboutTaxy - Get info about the current version of Taxy
   SetCurrentDescription id - Change current description
   SetCurrentFeature id - Change current feature
   SetCurrentDB id - Change current database
   SetCurrentUser id - Change current user (a description)
   DescribeCurrentUser - Display user's description
   LoadCmdFile filename - Loads and executes a file of commands
   Search string* - Search features and descriptions for strings
   Help string* - Find help on a particular subject or element
   Quit - Ends program
   ExternalFileShow - Print the contents of default file
   ExternalFileAdd - Append to end of default file
   SaveLog [string] - Preserve the command log.
   SetPageHeight [number] - Set the number of lines printed before asking user
                                    72

Appendix C. Features and Values for Fungi Database

   This appendix contains a complete list of the features and values
developed for the fungi database. The features are arranged based on
what categories that belong to. Synonyms are enclosed in square braces
([]). In addition, each feature name is followed by the feature's id
string enclosed in brackets ({}).  The id strings are necessary to
distinguish between features with the same name, but different
activators.


Features in the category base:

Color {Color}:  A term feature with the range:  Black [Blackish], DarkBrown
  [Dark brown], Brown [Brownish, Medium brown], LightBrown [Light brown],
  Gray [Grayish], Off white [OffWhite, Buff, Cream], White, Pink [Pinkish],
  Red [Redish], Orange [Orangeish], Yellow [Yellowish], Green [Greenish],
  Blue [Blueish], Purple [Purpleish].

Status {Status}: A term feature with the range: Present [Yes, Y], Absent
  [No, N].


Features in the categories people, reference, automatic:

Author abbreviation {Abbrev}: A name feature.


Features in the categories people, fungi, reference, automatic:

Last Change {LastChng}: A period feature.

Last Editor {LastEdtr}: A description feature. Values must be a
  single description from the category people.

Editors {Editors}:  A description feature.  Values must be from the category
  people.

Approved {Approved}: A description feature.  Values must be from
  the category people.

Element Status {ElemStat}:  A term feature with the range:  O [Official], M
  [Modified], N [New].

Name {Name}: A name feature.


Features in the categories fungi, reference, automatic:

Common Name {CmmnName}: A name feature.

Level {Level}: A term feature with the range: Family, Genus, species,
  variety.

Parent {Parent}: A description feature. Values must be a single
  description from the category fungi.

Family {Family}: A compute feature.
  Computation: if intersect(, Level: Family) then
   (with leading cap)
  else  from descriptions in 
                                 73

Genus {Genus}: A compute feature.
  Computation: if intersect(, Level: Genus) then
   (with leading cap)
  else  from descriptions in 

species {species}: A compute feature.
  Computation: if intersect(, Level: species) then
   (with leading lowercase)
  else  from descriptions in 

variety {variety}: A compute feature.
  Computation: if intersect(, Level: variety) then
   (with leading lowercase)

Latin Name {LtnName}: A compute feature.
  Computation: if intersect(, Level: Family) then
  'Family ', 
  else if intersect(, Level: Genus) then
  'Genus ', 
  else if intersect(, Level: species) then
  , ' ', 
  else if intersect(, Level: variety) then
  , ' ', , ' var. ', 

Author Citation {Citation}: A compute feature.
  Computation: if set() then
  '(',  from descriptions in , ') ',
       from descriptions in 
  else  from descriptions in 


Features in the categories fungi, reference:

Author {Dscrbr}: A description feature. Values must be from the
  category people.

Reclassifier {Rclssfr}: A description feature. Values must be from the category
  people.

Contributors {Cntrbtrs}: A description feature. Values must be from
  the category people.

References {Refs}:  A term feature with the range:  AoC [Thiers, H.D. et al
  (1982-) The Agaricales (Gilled Fungi) of California.  Mad River Press;
  Eureka, Ca.], Amanitas [Jenkins, David T. (1986) Amanita of North
  America.  Mad River Press; Eureka, Ca.], MD2[Arora, David (1986)
  Mushrooms Demystified (2nd ed.).  Ten Speed Press; Berkeley, Ca.],
  AtRP [Arora, David (1991) All that the Rain Promises and More....  Ten
  Speed Press; Berkeley, Ca.], Audubon [Lincoff, Gary (1981) The Audubon
  Society Field Guide to North American Mushrooms.  New York:  Alfred
  Knopf.], MNA [Phillips, Roger (1991) Mushrooms of North America.  Little,
  Brown and Co.; Boston, Ma.].

Edibility {Edibilit}:  A term feature with the range:  Choice, Good, Edible,
  Inedible, Too small, Caution, Poisonous, Dangerous, Deadly,
  Hallucinogenic, Medicinal, Questionable [Unknown].
                                 74

Features in the categories fungi, id, macroscopic:

Overall shape {Shape}: A term feature with the range: Stipitate agaric
  [Mushroom], Shelf, Jelly fungi, Crust fungi [Non-stalked flask fungi],
  Club [Earth tongues], Branching [Coral], Gastroid [Puffball, False
  truffles, Truffles], Earthstar, Stalked puffball, Birdnest fungi, Stinkhorn,
  Cup fungi [Disc fungi], Stalked ascomycete [Morels, Elfin saddle].

Hymenophore Type {HmTp}: A term feature with the range: Gilled, Pored,
  Toothed, Smooth [Flat], Wrinkled [Folds, Rugose].

Size from above {SzFmAb}:  A float range feature.  Values are in mm.  Feature
  is active when the value for Overall shape {Shape} is in the range Stipitate
  agaric, Shelf, Club, Branching, Earthstar, Stalked puffball, Birdnest fungi,
  Stinkhorn, Cup fungi, Stalked ascomycete.

Total height {Height}:  A float range feature.  Values are in mm.  Feature is
  active when the value for Overall shape {Shape} is in the range Stipitate
  agaric, Shelf, Club, Branching, Earthstar, Stalked puffball, Birdnest fungi,
  Stinkhorn, Cup fungi, Stalked ascomycete.

Head height {HdHt}: A float range feature.  Values are in mm.  Feature
  is active when the value for Overall shape {Shape} is in the range
  Stinkhorn, Stalked puffball, Stalked ascomycete.

Size {Size}:  A float range feature.  Values are in mm.  Feature is active when
  the value for Overall shape {Shape} is in the range Jelly fungi, Crust fungi,
  Gastroid.

Surface color {SrCl}:  An alias of Color.  Feature is active when the value for
  Overall shape {Shape} is in the range Club.

Surface stains {SrSt}: An alias of Status. Feature is active when the value for
  Overall shape {Shape} is in the range Club.

Color {SrStCl}: An alias of Color. Feature is active when the value for Surface
  stains {SrSt} is in the range Present.

Upper surface color {UpCl}: An alias of Color. Feature is active when the value
  for Overall shape {Shape} is in the range Shelf, Birdnest fungi, Cup fungi.

Lower surface color {LowCl}: An alias of Color. Feature is active when
  the value for Overall shape {Shape} is in the range Shelf, Birdnest
  fungi, Cup fungi.

Head color {HdCl}:  An alias of Color.  Feature is active when the value for
  Overall shape {Shape} is in the range Stinkhorn, Stalked puffball, Stalked
  ascomycete.

Flesh stains {FlSt}:  An alias of Status.  Feature is active when the value for
  Overall shape {Shape} is in the range Shelf, Club, Cup fungi.

Color {FlStCl}:  An alias of Color.  Feature is active when the value for Flesh
  stains {FlSt} is in the range Present.

Pileus disc color {PlCl}: An alias of Color. Feature is active when
  the value for Overall shape {Shape} is in the range Stipitate agaric.

Pileus height {PlHt}: A float range feature. Values are in mm. Feature
  is active when the value for Overall shape {Shape} is in the range
  Stipitate agaric.
                                 75

Pileus surface staining {PlSt}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Color {PlStCl}: An alias of Color.  Feature is active when the value for Pileus
  surface staining {PlSt} is in the range Present.

Pileus shape from side {PlSd}:  A term feature with the range:  Parabolic,
  Campanulate [Bell-shaped], Conic, Convex, Plane [Applanate],
  Convoluted, Umbonate, Depressed, Infundibuliform [Funnel-shaped],
  Umbilicate. Feature is active when the value for Overall shape {Shape} is in
  the range Stipitate agaric.

Degree Umbonate {PlUm}: A term feature with the range: Cuspidate, Papillate
  [Mammilate]. Feature is active when the value for Pileus shape from side
  {PlSd} is in the range Umbonate, Umbilicate.

Degree Depressed {PlDp}: A term feature with the range: Shallow, Deep, Narrow,
  Broad. Feature is active when the value for Pileus shape from side {PlSd} is
  in the range Depressed, Umbilicate.

Margin cross section {PlCS}:  A term feature with the range:  Convoluted,
  Inrolled, Incurved, Decurved [Straight], Plane, Upturned.  Feature is active
  when the value for Overall shape {Shape} is in the range Stipitate agaric.

Margin color {PlMrCl}: An alias of Color.  Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Margin edge {PlMrEd}: A term feature with the range: Entire, Crenate
  [Scalloped], Crisped [Crenulate], Undulating [Frilled], Eroded, Rimose,
  Appendiculate. Feature is active when the value for Overall shape {Shape} is
  in the range Stipitate agaric.

Margin surface {PlMrSr}:  A term feature with the range:  Smooth [Plain],
  Striate [Pectinate], Translucent-striate [Pellucid], Sulcate,
  Tuberculate-striate, Plicate-striate [Plicate, Pleated].  Feature is active
  when the value for Overall shape {Shape} is in the range Stipitate agaric.

Shape from above {ShSh}:  A term feature with the range:  Round [Orbicular],
  Conchate [Flabelliform, Fan-shaped], Dimidate [Semi-circular].  Feature is
  active when the value for Overall shape {Shape} is in the range Shelf.

Other pileus features {ShFt}: A term feature with the range: Absent [N, None,
  Plain], Zonate [Zoned], Rugulose, Rugose [Wrinkled, Corrugate].  Feature is
  active when the value for Overall shape {Shape} is in the range Shelf.

Pileus surface texture {PlTx}: A term feature with the range: Glabrous
  [Smooth, Glabrose], Shiny [Lucidus], Dull, Silky [Sericeous]. Feature is
  active when the value for Overall shape {Shape} is in the range Stipitate
  agaric.

Pileus ornamentation {PlOr}: A term feature with the range: Absent [N, None],
  Fibrils, Attached scales, Particles, Patch, Warts, Cracks.  Feature is active
  when the value for Overall shape {Shape} is in the range Stipitate agaric.

Color {PlOrCl}: An alias of Color.  Feature is active when the value for Pileus
  ornamentation {PlOr} is in the range Fibrils, Attached scales, Particles,
  Patch, Warts.
                                 76

Location {PlOrLc}: A term feature with the range: Entire surface, Margin only,
  Center only. Feature is active when the value for Pileus ornamentation {PlOr}
  is in the range Fibrils, Attached scales.

Fibril Type {PlFbTp}:  A term feature with the range:  Velutinous [Velvety],
  Villose, Pubesent, Hirsute, Hispid, Strigose, Fibrillose, Appressed-
  fibrillose, Virgate [Radiating-streaks], Downy-fibrillose, Canescent
  [Hoary], Floccose, Downy-wooly [Cottony], Tomentose, Matted-fibrillose.
  Feature is active when the value for Pileus ornamentation {PlOr} is in the
  range Fibrils.

Scale type {PlScTp}:  A term feature with the range:  Squamose, Squamulose,
  Appressed-squamulose, Imbricate-scaly, Recurved-squamulose, Squarrose,
  Punctate-squamulose, Scabrous, Scabrulose. Feature is active when the value
  for Pileus ornamentation {PlOr} is in the range Attached scales.

Particle type {PlPrTp}:  A term feature with the range:  Atomate, Micaceous,
  Superficial, Innate, Pruinose [Pulverulent], Furfuaceous, Granulose
  [Granular]. Feature is active when the value for Pileus ornamentation {PlOr}
  is in the range Particles.

Surface cracks {PlCrTp}: A term feature with the range: Split [Rimose,
  Lacinate], Areolate [Cracked].  Feature is active when the value for Pileus
  ornamentation {PlOr} is in the range Cracks.

Other pileus features {PlFt}: A term feature with the range: Absent [N, None,
  Plain], Zonate [Zoned], Rugulose, Rugose [Wrinkled, Corrugate],
  Hygrophanous, Scrobiculate, Rivulose, Spotted, Wavy.  Feature is active when
  the value for Overall shape {Shape} is in the range Stipitate agaric.

Viscidity {PlVs}: A term feature with the range: Dry [Arid], Viscid, Viscid
  when wet [Tacky, Subviscid], Glutinous, Gelatinous, Moist, Lubricous
  [Oily]. Feature is active when the value for Overall shape {Shape} is in the
  range Stipitate agaric.

Pileus KOH reaction {PlKOH}: A term feature with the range: Green, Absent [No,
  N, Negative], Yellow. Feature is active when the value for Overall shape
  {Shape} is in the range Stipitate agaric.

Pileus flesh color {PlFC}: An alias of Color. Feature is active when the value
  for Overall shape {Shape} is in the range Stipitate agaric.

Pileus flesh staining {PlFlSt}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Color {PlFlStCl}: An alias of Color. Feature is active when the value
  for Pileus flesh staining {PlFlSt} is in the range Present.

Pileus flesh consistency {PlCn}: A term feature with the range: Gelatinous,
  Fragile, Soft, Chalky, Brittle, Cartilaginous, Firm, Hard, Fibrous, Leathery
  [Tough, Coriaceous], Corky, Woody. Feature is active when the value
  for Overall shape {Shape} is in the range Stipitate agaric.

Stipe surface color {StSC}: An alias of Color. Feature is active when the value
  for Overall shape {Shape} is in the range
  Stipitate agaric.
                                 77

Stipe flesh color {StIC}: An alias of Color. Feature is active when
  the value for Overall shape {Shape} is in the range Stipitate agaric.

Stipe position {StPs}: A term feature with the range: Central, Eccentric [Off
  center, Excentric], Lateral. Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Stipe shape {StSh}: A term feature with the range: Base Enlarged, Ventricose,
  Radicated [Pseudorhiza, Radicating], Base pinched, Bulbous, Tapered upward
  [Subclavate], Clavate, Incrassate, Clavate-bulbous, Equal, Tapered
  downward.  Feature is active when the value for Overall shape
  {Shape} is in the range Stipitate agaric.

Basal bulb {StBl}: A term feature with the range: Absent [No, N], Marginate,
  Marginate-depressed, Abruptly-bulbose.  Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Stipe cross section {StCS}: A term feature with the range: Round, Compressed.
  Feature is active when the value for Overall shape {Shape} is in the range
  Stipitate agaric.

Stipe pith {StPt}:  A term feature with the range:  Hollow, Solid, Stuffed,
  Gelatinous. Feature is active when the value for Overall shape
  {Shape} is in the range Stipitate agaric.

Stipe consistency {StCn}: A term feature with the range: Gelatinous, Fragile,
  Soft, Chalky, Brittle, Cartilaginous, Firm, Hard, Fibrous, Leathery [Tough,
  Coriaceous], Corky, Woody.  Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Stipe surface staining {StSt}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Color {StStCl}:  An alias of Color.  Feature is active when the value for Stipe
  surface staining {StSt} is in the range Present.

Stipe flesh staining {StFlSt}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Color {StFlStCl}: An alias of Color. Feature is active when the value for Stipe
  flesh staining {StFlSt} is in the range Present.

Stipe surface texture {StTx}: A term feature with the range: Smooth, Pruinose,
  Pubescent,  Hispid,  Downy,  Fibrillose, Fibrous,  Finely striate,  Floccose,
  Lacunose, Reticulate, Glandular-dotted, Scabrous, Ridged, Scrobiculate,
  Longitudinal-striate [Striate], Rugose [Wrinkled], Rugulose, Veined
  [Ribbed, Costate], Canaliculate [Canalliculate, Fluted].  Feature is active
  when the value for Overall shape {Shape} is in the range Stipitate agaric.

Stipe texture varies {StTxVr}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Texture above {StTxAb}: An alias of StTx.  Feature is active when the value for
  Stipe texture varies {StTxVr} is in the range Present.

Texture below {StTxBl}: An alias of StTx.  Feature is active when the value for
  Stipe texture varies {StTxVr} is in the range Present.
                                 78

Stipe length {StLn}: A float range feature. Values are in mm. Feature is active
  when the value for Overall shape {Shape} is in the range Stipitate agaric.

Stipe diameter at apex {StDm}: A float range feature. Values are in mm. Feature
  is active when the value for Overall shape {Shape} is in the range Stipitate
  agaric.

Rhizomorphs {StRz}:  An alias of Status.  Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Basal tomentum {StTm}: An alias of Status. Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric.

Type {StTmTp}: A term feature with the range: Small hyphae [Small], Rhizoidal,
  Strigose.  Feature is active when the value for Basal tomentum
  {StTm} is in ther range Present.

Amount {StTmAm}:  A term feature with the range:  Scarce, Moderate, Abundant.
  Feature is active when the value for Basal tomentum {StTm} is in the
  range Present.

Partial Veil {PV}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Stipitate agaric.

Color {PVCl}: An alias of Color. Feature is active when the value for
  Partial Veil {PV} is in the range Present.

Type {PVTp}: A term feature with the range: Fibrillose, Gelatinous, Cortinate,
  Membranous. Feature is active when the value for Partial Veil {PV}
  is in the range Present.

Form {PVFr}: A term feature with the range: Single, Double, Stellate, Pendant,
  Intermediate,  Peronate [Sheathlike].   Feature is active when the value for
  Partial Veil {PV} is in the range Present.

Attachment {PVAt}: A term feature with the range: Free [Movable], Attached.
  Feature is active when the value for Partial Veil {PV} is in the
  range Present.

Position {PVPs}: A term feature with the range: Inferior [Basal],
  Central, Superior [Apical]. Feature is active when the value for
  Partial Veil {PV} is in the range Present.

Thickness {PVTh}: A term feature with the range: Thick, Medium [Intermediate],
  Thin.  Feature is active when the value for Partial Veil {PV} is in the range
  Present.

Fate {PVFt}: A term feature with the range: Persistent, Appendiculate,
  Evanescent. Feature is active when the value for Partial Veil {PV}
  is in the range Present.

Staining {PVSt}: An alias of Status. Feature is active when the value
  for Partial Veil {PV} is in the range Present.

Color {PVStCl}: An alias of Color. Feature is active when the value
  for Staining {PVSt} is in the range Present.

Under Color {PVUdCl}:  An alias of Color.  Feature is active when the value for
  Partial Veil {PV} is in the range Present.

Universal Veil {UV}:  An alias of Status.  Feature is active when the value for
  Overall shape {Shape} is in the range Stipitate agaric, Stinkhorn.
                                 79

Color {UVCL}: An alias of Color. Feature is active when the value for Universal
  Veil {UV} is in the range Present.

Type {UVTp}: A term feature with the range: Gelatinous, Collar
  [Circumsessile], Constricted [Flaring], Indistinct, Zoned [Rings],
  Saccate, Fragmented [Scaly], Powdery [Farinose].  Feature is active when the
  value for Universal Veil {UV} is in the range Present.

Staining {UVSt}: An alias of Status. Feature is active when the value
  for Universal Veil {UV} is in the range Present.

Color {UVStCl}: An alias of Color. Feature is active when the value
  for Staining {UVSt} is in the range Present.

Cup slit {CupSl}: An alias of Status. Feature is active when the value
  for Overall shape {Shape} is in the range Cup fungi.

Branch color {BrCl}:  An alias of Color.  Feature is active when the value for
  Overall shape {Shape} is in the range Branching.

Branch staining {BrSt}: An alias of Status. Feature is active when the
  value for Overall shape {Shape} is in the range Branching.

Color {BrStCl}: An alias of Color.  Feature is active when the value for Branch
  staining {BrSt} is in the range Present.

Branches with pendent teeth {BrTh}: An alias of Status. Feature is active when
  the value for Overall shape {Shape} is in the range Branching.

Tip shape {BrTS}:  A term feature with the range:  Blunt, Toothed, Wrinkled.
  Feature is active when the value for Branches with pendent teeth {BrTh} is in
  the range Absent.

Branch width {BrSz}: A float range feature. Values are in mm. Feature is active
  when the value for Overall shape {Shape} is in the range Branching.

Branch KOH reaction {BrKOH}: A term feature with the range: Green,
  Absent [No, N, Negative], Yellow. Feature is active when the value
  for Overall shape {Shape} is in the range Branching.

Peridiole color {PerCl}: An alias of Color. Feature is active when the
  value for Overall shape {Shape} is in the range Birdnest fungi.

Peridiole attachment {PerAt}:  A term feature with the range:  Gelatinous,
  Threads.  Feature is active when the value for Overall shape {Shape}
  is in the range Birdnest fungi.

Ray color {RayCl}: An alias of Color. Feature is active when the value
  for Overall shape {Shape} is in the range Earthstar.

Rays Hygroscopic {RayHy}: A term feature with the range: Yes. Feature is active
  when the value for Overall shape {Shape} is in the range Earthstar.

Peridium attachment {PrdAt}: A term feature with the range: Sessile,
  Stipitate. Feature is active when the value for Overall shape
  {Shape} is in the range Earthstar.

Peridium opening {GsOp}:  A term feature with the range:  None, Irregular,
  Apical pore [ApicalPore], Multiple pores [MultiplePores]. Feature is active
  when the value for Overall shape {Shape} is in the range Gastroid, Earthstar.
                                 80

Peridium thickness {GsPS}:  A term feature with the range:  Thin, Moderate,
  Thick. Feature is active when the value for Overall shape {Shape} is
  in the range Gastroid, Earthstar.

Surface color {GsSC}:  An alias of Color.  Feature is active when the value for
  Overall shape {Shape} is in the range Gastroid, Earthstar.

Surface stains {GsSS}: An alias of Status. Feature is active when the value for
  Overall shape {Shape} is in the range Gastroid, Earthstar.

Color {GsSSCl}: An alias of Color. Feature is active when the value for Surface
  stains {GsSS} is in the range Present.

Gleba color {GsIC}: An alias of Color. Feature is active when the
  value for Overall shape {Shape} is in the range Gastroid, Earthstar.

Gleba stains {GsIS}:  An alias of Status.  Feature is active when the value for
  Overall shape {Shape} is in the range Gastroid, Earthstar.

Color {GsISCl}:  An alias of Color.  Feature is active when the value for Gleba
  stains {GsIS} is in the range Present.

Hymenophore type {USTp}: A term feature with the range: Pitted, Gyrose,
  Convoluted, Saddle, Cup.  Feature is active when the value for Overall shape
  {Shape} is in the range Stalked ascomycete.

Lower surface texture {LSTx}: A term feature with the range: Smooth,
  Downy. Feature is active when the value for Overall shape {Shape} is
  in the range Stalked ascomycete.

Pore Color {PrCl}:  An alias of Color.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Pored.

Pores staining {PrSt}: An alias of Status. Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Pored.

Color {PrStCl}:  An alias of Color.  Feature is active when the value for Pores
  staining {PrSt} is in the range Present.

Pore diameter {PrSz}: A float range feature. Values are in mm. Feature
  is active when the value for Hymenophore Type {HmTp} is in the range Pored.

Pores Radiating {PrRd}: An alias of Status. Feature is active when the
  value for Hymenophore Type {HmTp} is in the range Pored.

Tube Color {TbCl}:  An alias of Color.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Pored.

Tubes staining {TbSt}: An alias of Status. Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Pored.

Color {TbStCl}:  An alias of Color.  Feature is active when the value for Tubes
  staining {TbSt} is in the range Present.

Tube length {TbSz}: A float range feature.  Values are in mm.  Feature
  is active when the value for Hymenophore Type {HmTp} is in the range Pored.

Tubes Separable {TpSp}: An alias of Status. Feature is active when the
  value for Hymenophore Type {HmTp} is in the range Pored.
                                 81

Gill Color {GlCl}:  An alias of Color.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Gill color changes with age {GlClCh}:  An alias of Status.  Feature is active
  when the value for Hymenophore Type {HmTp} is in the range Gilled.

Color when young {GlClYn}: An alias of Color. Feature is active when the value
  for Gill color changes with age {GlClCh} is in the range Present.

Color when old {GlClOd}: An alias of Color. Feature is active when the
  value for Gill color changes with age {GlClCh} is in the range Present.

Gill distance {GlDs}: A term feature with the range: Remote, Distant,
  Subdistant, Close [Typical], Crowded.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Gill attachment {GlAt}: A term feature with the range: Free, Adnexed, Sinuate
  [Notched], Emarginate [Abruptly adnexed], Adnate,
  Subdecurrent [Slightly-decurrent], Decurrent. Feature is active when
  the value for Hymenophore Type {HmTp} is in the range Gilled.

Collarium {Cllr}:  An alias of Status.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Gills seceding {GlSe}: An alias of Status. Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Gills staining {GlSt}: An alias of Status. Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Color {GlStCl}:  An alias of Color.  Feature is active when the value for Gills
  staining {GlSt} is in the range Present.

Gill height {GlHg}: A float range feature.  Values are in mm.  Feature
  is active when the value for Hymenophore Type {HmTp} is in the range Gilled.

Gill shape {GlSp}: A term feature with the range: Narrow, Broad, Moderately
  broad, Ventricose [Swollen], Concave, Straight, Convex. Feature is
  active when the value for Hymenophore Type {HmTp} is in the range Gilled.

Gill margin distinctive {GlMr}: An alias of Status. Feature is active when the
  value for Hymenophore Type {HmTp} is in the range Gilled.

Color {GlMrCl}: An alias of Color. Feature is active when the value
  for Gill margin distinctive {GlMr} is in the range Present.

Type {GlMrTp}: A term feature with the range: Entire [Smooth], Serrate,
  Serrulate, Crenate [Scalloped], Eroded, Fimbriate, Crisped, Undulating.
  Feature is active when the value for Gill margin distinctive {GlMr} is in the
  range Present.

Gills deliquescent {GlDl}: An alias of Status. Feature is active when the value
  for Hymenophore Type {HmTp} is in the range Gilled.

Gill face pattern {GlPt}:  A term feature with the range:  Mottled, Absent
  [None, N]. Feature is active when the value for Hymenophore Type
  {HmTp} is in the range Gilled.
                                 82

Gill face texture {GlFc}:  A term feature with the range:  Smooth [Plain],
  Pruinose [Pulverulent], Laccate [Waxed]. Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Lamellulae {GlLm}:  An alias of Status.  Feature is active when the value for
  Hymenophore Type {HmTp} is in the range Gilled.

Series {GlLmSr}:  A float range feature.  Feature is active when the value for
  Lamellulae {GlLm} is in the range Present.

Branching gills {GlBr}: An alias of Status. Feature is active when the
  value for Hymenophore Type {HmTp} is in the range Gilled.

Type {GlBrTp}:  A term feature with the range:  Furcate, Forked [Bifurcate,
  Dichotomus, Dichotomusly branched], Intervenose [Costate, Anatomosing].
  Feature is active when the value for Branching gills {GlBr} is in
  the range Present.

Latex {Lt}: An alias of Status.

Color {LtCl}: An alias of Color. Feature is active when the value for
  Latex {Lt} is in the range Present.

Color changes {LtCh}: An alias of Status.  Feature is active when the value for
  Latex {Lt} is in the range Present.

Color {LtClCh}:  An alias of Color.  Feature is active when the value for Color
  changes {LtCh} is in the range Present.

Opacity {LtOp}: A term feature with the range: Opaque, Clear, Watery [Whey-
  like]. Feature is active when the value for Latex {Lt} is in the
  range Present.

Spore Print {SpPr}: A term feature with the range: Obtainable [Yes, Y],
  Unobtainable [No, N].

Color {SpCl}: An alias of Color. Feature is active when the value for
  Spore Print {SpPr} is in the range Obtainable.

Habitat {Habitat}: A term feature with the range: Woods, Mixed woods
  [MixedWoods], Conifer, Hardwoods, Alder, Aspen, Birch, Cypress, Eucalyptus,
  Fir, Larch, Madrone, Manzanita, Oak, Pine, Redwood, Spruce, Tanoak, Willow,
  Live Oak [LiveOak], Lignicolous [Rotting wood, RottingWood], Compost,
  Hard packed soil [HardPackedSoil], Sandy soil [SandySoil], Lawns, Pasture,
  Moss, Bogs [Swamps], Terrestrial, Coprophilous [Dung, Manure],
  Fungicolous [Parasitic on fungi], Parasitic on insects.

Attachment to substrate {SbAt}:  A term feature with the range:  Stipitate,
  Substipitate, Piliate-sessile [Sessile], Effuso-reflexed, Imbricate
  [Shelf-like, Shelving], Ungulate [Hoof-like], Resupinate, Hypogeous.

Frequency {Freq}: A term feature with the range: Common, Uncommon, Rare.

Grouping {Group}: A term feature with the range: Connate, Caespitose,
  Gregarious, Scattered, Solitary.
Features in the categories fungi, id, olfactory:

Taste {LtTs}: A term feature with the range: Mild, Bitter, Acrid.  Feature is
  active when the value for Latex {Lt} is in the range Present.
                                 83

Odor {Odor}: A term feature with the range: Almond, Fruity, Indistinct [None,
  No, N], Phenolic, Pleasant, Unpleasant [Disagreeable], Green corn
  [GreenCorn], Raphanoid [Radishes], Anise, Fabaceous [Beans], Farinaceous
  [Ground meal, GroundMeal], Cucumber, Spermatic, Cedar, Alliaceous [Garlic,
  Onion], Skunk [Rotten cabbage].

Taste {Taste}: A term feature with the range: Mild, Bitter, Sweet, Acrid [Hot,
  Peppery], Sour, Farinaceous, Alliaceous [Garlic, Onion], Skunk [Rotten
  cabbage].
Features in the categories file, fungi, people:

Comments {Comments}: A file feature.  File type is log and the search
  paths are :comments:}, {\tt comments/.
                                    84

Appendix D. Example Database File

   The following is a toy example of a file conforming to the file
format in Section 4.4. The objects described are juggling balls.

#IJuggling;0;Taxy 1.0
#C  A name feature
#FN;Name(id;people)Name
#C  Term features
#FT;Color(id)Color(Green;Blue;Red[Crimson];White;Orange;Brown)
#FT;Shape(id)Shape(Sphere;Cube[Box])
#FT;Text(id)Text(Present;Absent)
#C  An activator
#FT;Message(id)Message(Indy;JOPC):Text;Present
#C  An alias feature
#FA;MsgColor(id)Color(Color):Text;Present
#FT;Contents(id)Contents(Rubber;Plastic;Nut shells)
#C  A float range feature
#FFR;MaxWidth(id)Max width(1;3;inches)
#C  A description feature
#FD;Owner(reference)Owner
#C  A computer feature.  If the value for the level feature
    includes the value "Genus", then something like
    "Nathan's Lacross Balls" with be the value.
    Otherwise it will just be the value for the name feature.
#FC;Title(reference;automatic)Title([(~()(Genus));
's ;])
#FT;Level(reference)Level(Family;Genus;species;subspecies)
#FD;Group(reference)Group
#FD;Members(reference)Members
#C  A time period feature
#FTP;Aquired(reference)Aquired
#C  An external file feature
#FEF;Comments(file;people;reference)Comments(log:\:comments\:;comments/)
                                    85

#C People:
#DNathan(people)Name(Nathan Wilson[N. Wilson;N.J. Wilson])Comments(Nathan)
#C Descriptions of individual juggling balls:
#DGeorge(id;reference)Name(George)Color(Green;White)Shape(Sphere)
Text(Present)Message(Indy)Contents(Plastic)Max width(1.71;1.8)
Level(subspecies)Group(Indy)Owner(Nathan)Aquired(1992D288;D15)
#DBill(id;reference)Name(Bill)Color(Blue;White)Shape(Sphere)
Text(Present)Message(Indy)Contents(Plastic)Max width(1.72;1.8)
Level(subspecies)Group(Indy)Owner(Nathan)Aquired(1992D288;D15)
#DRobert(id;reference)Name(Robert)Color(Red;White)Shape(Sphere)
Text(Present)Message(Indy)Contents(Plastic)Max width(1.73;1.8)
Level(subspecies)Group(Indy)Owner(Nathan)Aquired(1992D288;D15)
#DWombat(id;reference)Name(Wombat)Color(White)Shape(Sphere)Text(Absent)
Contents(Rubber)Max width(2.2;2.3)Level(subspecies)Group(Lacross)
Owner(Nathan)Aquired(1977D0;3D0)
#DOlive(id;reference)Name(Olive)Color(Orange)Shape(Sphere)Text(Absent)
Contents(Rubber)Max width(2.2;2.3)Level(subspecies)Group(Lacross)
Owner(Nathan)Aquired(1977D0;3D0)
#DJim(id;reference)Name(Jim)Color(Red)Shape(Cube)Text(Present)
Message(JOPC)Contents(Nut shells)Max width(2.2;2.3)Level(species)
Group(Bean)Owner(Nathan)Aquired(1981D211;D30)
#DJenny(id;reference)Name(Jenny)Color(Red)Shape(Cube)Text(Present)
Message(JOPC)Contents(Nut shells)Max width(2.2;2.3)Level(species)
Group(Bean)Owner(Nathan)Aquired(1981D211;D30)
#DRebecca(id;reference)Name(Rebecca)Color(Red)Shape(Sphere)Text(Absent)
Max width(2.2;2.3)Level(species)Group(Bean)Owner(Nathan)Aquired(1983D0;2D0)
#DBernice(id;reference)Name(Bernice)Color(Brown)Shape(Sphere)Text(Absent)
Max width(2.2;2.3)Level(species)Group(Bean)Owner(Nathan)Aquired(1984D0;2D0)
#C Descriptions of categories of juggling balls:
#DIndy(id;reference)Name(Indy balls)Level(species)Group(Rubber)Members(
Robert;Bill;George)
#DLacross(id;reference)Name(Lacross balls)Level(species)Group(Rubber)
Members(Wombat;Olive)
#DBean(id;reference)Name(Bean bags)Level(Genus)Group(Balls)Members(
Jenny;Jim;Bernice;Rebecca)
#DRubber(id;reference)Name(Rubber balls)Level(Genus)Group(Balls)Members(
Indy;Lacross)
#DBalls(id;reference)Name(Juggling balls)Members(Bean;Rubber)
                                    86

References

[AD93] J.E. Adaskaveg and Dunlap.  PolyKey.  Software package, 1993.
         University of California, Davis.

[Arn92] Eef Arnolds.  Mapping and monitoring of macromycete in
         relation to nature conservation. McIlvainea, 10(2):4{27, 1992.

[Aro86] David Arora. Mushrooms Demystified. Ten Speed Press, Berkeley,
         Ca., second edition, 1986.

[ask]    askSam Systems. askSam. Software package. Perry, Florida.

[DM86] T. Duncan and C.A. Meacham. MEKA 1.2. Software package,
         1986. University of California, Berkeley.

[ERB+92] K.W. Estep, F. Rey, K. Bjorklund, T. Dale, B.R. Heimdal,
         A.J.W. van Hertum, D. Hill, D. Hodell, E.E. Syvertsen,
         K. Tangen, and J. Throndsen. Deus Creavit; Linneaus
         Disposuit: An international effort to create a catalogue and
         expert system for the identification of protistan species. Sarsia,
         77:275{285, 1992.

[GR87] R.L. Gilbertson and L. Ryvarden. North American
         Polypores. Fungiflora, Oslo, Norway, 1986-1987.

[HV94] J.S. Hopple, Jr. and R. Vilgalys. Phylogenetic relationships
         among coprinoid taxa and allies based on data from
         restriction site mapping of nuclear rDNA. Mycologia,
         86(1):96{107, 1994.

[Kor94] Richard Korf. Simplified author citations for fungi and some
         new complications and old traps.  Unpublished. May appear in
         one of McIlvainea, Mycologia or Mycological Research, 1994.

[KS91] H.F. Korth and A. Silberschatz. Database System
         Concepts. McGraw-Hill, Inc., New York, New York, second edition, 1991.

[Lar86] David Largent. How to Identify Mushrooms to Genus I:
         Macroscopic Features. Mad River Press, Eureka, Ca., 1986.

[Mic93]  Microsoft, Inc. FoxBase. Software package, 1993. Redmond, Washington.

[Old] Robert Old. Personal communication. XID Services Inc., Pullman,
         Washington.

[Pen94] Elizabeth Pennisi. Name that fly: Computers help make species
         identification child's play. Science News, 145(7):108{111,
         February 1994.

[Riv87] R.L Rivest. Learning decision lists. Machine Learning,
         2:229{246, 1987.

[RJ94] Kevin Reichard and Eric F. Johnson. Tickled pink: The tool
         command language, or tcl, scripting langauge simplifies x
         window system development. UNIX Review, 12(3):87{91, 1994.

[SSW79] Alexander H. Smith, Helen V. Smith, and Nancy S. Weber. How to
         Know the Gilled Mushrooms. Wm. C. Brown Company Publishers,
         Dubuque, Iowa, 1979.

[SSW81] Alexander H. Smith, Helen V. Smith, and Nancy S. Weber. How to
         Know the Non-gilled Mushrooms.  Wm. C. Brown Company
         Publishers, Dubuque, Iowa, 1981.
                                    87

[WD85] L. Watson and M.J. Dallwitz. Australian Grass Genera: anatomy,
         morphology, keys, and classification.  Australian National
         University, Canberra, Australia, 1985.

[Web87] R.D. Webster. The Australian Paniceae (Poaceae). J. Cramer,
         Stuttgart, West Germany, 1987.

[Web88] Robert D. Webster.  DELTA.  Software package, 1988.
         Systematic Botany, Mycology, and Nematology Laboratory,
         Beltsville, Maryland.

[Wil92] E.O. Wilson. The Diversity of Life. W.W. Norton & Co., New
         York, New York, 1992.

Last modified: Jun 29, 1995

Nathan Wilson, nathan at collectivesource dot com