Using the Web to Find `It': Solving the Identification Problem
Copyright 2000, 2001 by Nathan Wilson,
further information on this idea and the author is available at http://www.collectivesource.com.
Permission is given to copy and distribute this document in its entirety.
In particular, the copyright, reference URL and copy permission must be
included.
Contents
History and Background
Products
and Technology
Markets
Competition
Manufacturing
History and Background
I am a 13-year veteran of
software development and education with a masters in computer science and
a masters in cognitive psychology, currently leading development for Dreamworks.
My expertise spans a wide range of applications including artificial intelligence,
robotics, computer languages, distributed intranet applications and network
based game engines. I am a published author with contributions to
scientific literature, publication of photographs and repeated requests
to participate in prominent amateur events.
As
such, an area of particular interest is accurately and efficiently choosing
the best solutions to difficult search problems. Difficult search problems
exist in heathcare, e-business, job placement, manufacturing, finance,
and many, many other areas. For example, in healthcare, it can be critical
that a doctor correctly identify the cause of an illness. In online
retail, finding the exact product that will best meet a user's needs, will
make them more likely to buy and to return as a repeat customer.
In job placement, the better a match a recruiter can make between a job
opportunity and their candidates, the more successful they will be.
In each of these cases, there is only one best solution.
Part
of the problem is having the efficient access to the information. However,
an often neglected problem is how to efficiently search and sort through
all that information once it is available. As a professional software developer
I have always found this lack of efficient access and searching very frustrating.
The information should be available through a computer and it should be
easily searchable. In 1992 I left my job in artificial intelligence
and robotics to go back to school and study computer science. The
master thesis I wrote in 1994 addressed exactly this type of search problem
and included an early version of the prototype discussed below. Since
that time I have worked on and off on this problem as time has permitted.
With
the growth of the Web, I have realized that there is a growing need for
solving exactly the type of search problem my research addresses. One of
the most common problems people try to solve with the Web is to identify
a specific resource they need to solve a problem. They want to find
the perfect piece of furniture, or the best book to introduce them to soccer,
or the ideal stock broker, etc. The data structures and algorithms
that I have developed could be easily adapted to solve any of these problem.
Products and Technology
I call the technology I have
developed for solving identification problems ID (for Identification Database
technology). It is based on techniques used by scientists to identify
unfamiliar organisms. The most interesting of these techniques for
computer based systems are synoptic keys. Synoptic keys allow
the user to identify the members of a group that have a particular value
for some feature, e.g., a blue colored beak or 24 hour customer service.
By applying additional features to each successive result, the user eventually
arrives at a solution. One interesting thing about synoptic keys is
that the user is not forced to provide the features in any particular order.
This differentiates this method from most databases, which use relational
data. The other distinctive feature is that each successive search
is done only over the members of the group that were selected in the last
search. This feature separates this method from most standard Internet
search engines.
The
biggest problem in using textual synoptic keys is the need to maintain
the list of active candidates. ID relieves this burden completely.
It actually goes even further by providing the user with suggestions of
which features would best help them reduce the set of candidates they are
considering. By adding this piece, the computer/human interaction
becomes an active dialog where the user is providing the information they
care about and the computer is helping to direct the user's interest to
most quickly find the best answer. Finally, ID recognizes
that features are inter-related. For example, if an organism has
a blue colored beak, it must have a beak and in fact must be an animal.
Current database technology and search engines cannot make this inference.
ID can do this type of inference automatically, allowing the user
to accelerate the identification process.
A
prototype program called Taxy is available at http://collectivesource.com/taxy/taxy.html.
The prototype is strictly text based and requires a fair amount of experience
to effectively use. For the purpose of providing an example, the
demo database describes about 50 species of common fungi (mushrooms).
I chose this subject because there are over 70,000 different species of
fungi, and is one of the most complex identification problems. Most of
this work was done as part of my 1994 master thesis on using computers
to help with biological identification. A version of my thesis is
available on the above page.
My
current intention is to create a set of general purpose identification
tools that could be applied to any domain. They would provide easy
mechanisms for experts to create synoptic keys for any subject as well
as methods for easily comparing small groups of candidates. I believe
these tools could become as critical to creation of a well rounded web
site as relational databases such as Oracle are to the current web.
The
most important feature the current prototype is lacking is a graphical
user interface. Java is the logical choice for the front end.
The existing program would need to be reworked to act as the backend database.
The focus areas for that work would be efficient interprocess communication
and persistent storage. It would probably make the most sense to
essentially start from scratch using the prototype for inspiration.
The data structures would stay essentially the same as would the algorithms
for searching and selecting features. The best push for this work
would be a target real world application area selected from one of the
targeted market areas.
Once
the basic implementation is solid and there is a useful working database,
additional features could be added such as integration with other identification
techniques, images, glossaries for domain specific terms, and domain specific
data types. Some of these may be part of the original implementation
depending on the targeted market.
Markets
ID would be useful in a number
of commercial contexts. They could provide a new feature to various
special interest sites to attract new customers and to encourage repeat
visiting. For example, a movie/video site could use the technology
to help users identify which movie they should rent or a book site could
help their users pick out the right book to buy. ID could also allow
a business to help their employees with difficult search tasks like equipment
problem diagnosis or unknown chemical identification.
Competition
Currently the distinction
between identification and searching is not well recognized in the Internet
world. On the Internet the existing search tools are the only tools
available to do identification. However, these tools generally overwhelm
the user with a collection of candidates that may or may not meet their
needs and give the user no easy way to sift through all this information.
Internet search tools are polarized between the
very general text engines, i.e., pattern matching engines like AltaVista
or Google, and the very structured
table engines, i.e., relational
database engines like the Internet Movie Database (imdb.com) or Amazon.com
based on industry standard databases like Oracle. Text engines are
characterized by the automatic processing of large amounts of essentially
unstructured data. Table engines are characterized by very
structured data created by the database designers.
There is also a middle ground that
has largely been unexplored between the unstructured data of text engines,
and the highly structured data of table engines. The few existing
examples (Yahoo and AskJeeves being the most prominent) either simply combine
techniques from the more standard methods or provide only rudimentary identification
capability. Yahoo is an example of the first, in that it extends
a text engines with a database of keywords to find web pages. AskJeeves,
on the other hand, begins to set up a dialog with the user, but it has
a rather short attention span. AskJeeves generally attempts to answer
a particular question in a single attempt. In some cases it will
come back with a single followup question. The problem with this approach
is that many identification problems are too complex to be handled in this
way. Consider trying to diagnose an illness. If you ask the
question "What illness do I have?" to AskJeeves it goes down one
level to ask for a particular symptom and then gives you a definition of
that symptom and long list of possible causes. It does not engage
you in further dialog to actually help you decide between these possibilities.
Both
text and table searching are insufficient for effective identification.
The text searching tools are hindered because there is no effective way
to reveal the structure within the results of a search. For example,
you might do a search for `furniture' near `livingroom'. However,
once you had that presumably very large set, what would you use to organize
the results? You might get lucky and stumble on a site that provides
that organization, but finding such a perfect web site is much like finding
the perfect wave for a traditional surfer.
Table
searching at first glance seems like it might stand a better chance of
finding the perfect piece of furniture. However, table-based searches
are best suited for problems where there is a significant amount of similarity
between the things that are being compared. Classic examples are
addresses or bank charges. They tend to get confusing and difficult
to use when the things are more difficult to compare. In the furniture
example, a table-based search might contain information about whether a
particular type of lamp can use a three-way bulb. The person using
the system might be still trying to decide if they want to put a lamp or
a set of shelves in a particular corner, but they know if it's a lamp then
they want to use a three-way bulb. Table-based systems have a difficult
time with this type of information and either end up being very inefficient
or they force the user into making `high-level' decisions before the system
allows them to give more detailed information. Continuing the example,
the database would either have to say that shelves take neither three-way
nor regular bulbs, or the user would have to decide between shelves or
lamps before discussing the type of bulb. Neither of these approaches
scale well as the number of distinguishing features grows.
Manufacturing
In general the manufacturing
for ID would be a straight forward software development process.
One potential issue is that the original prototype has been released as
part of an open source project for biological identification. As
the author of this project I would not want to make any intellectual property
agreements that restricted my or anyone's ability to continue to work freely
on this project for that purpose.
Home
| Contents
Last
Modified: January 28, 2001