Using the Web to Find `It': Solving the Identification Problem

Copyright 2000, 2001 by Nathan Wilson, further information on this idea and the author is available at http://www.collectivesource.com.  Permission is given to copy and distribute this document in its entirety.  In particular, the copyright, reference URL and copy permission must be included.

 

Contents

History and Background
Products and Technology
Markets
Competition
Manufacturing

History and Background

I am a 13-year veteran of software development and education with a masters in computer science and a masters in cognitive psychology, currently leading development for Dreamworks. My expertise spans a wide range of applications including artificial intelligence, robotics, computer languages, distributed intranet applications and network based game engines.  I am a published author with contributions to scientific literature, publication of photographs and repeated requests to participate in prominent amateur events.
 
As such, an area of particular interest is accurately and efficiently choosing the best solutions to difficult search problems. Difficult search problems exist in heathcare, e-business, job placement, manufacturing, finance, and many, many other areas. For example, in healthcare, it can be critical that a doctor correctly identify the cause of an illness.  In online retail, finding the exact product that will best meet a user's needs, will make them more likely to buy and to return as a repeat customer.  In job placement, the better a match a recruiter can make between a job opportunity and their candidates, the more successful they will be.  In each of these cases, there is only one best solution.
 
Part of the problem is having the efficient access to the information. However, an often neglected problem is how to efficiently search and sort through all that information once it is available. As a professional software developer I have always found this lack of efficient access and searching very frustrating.  The information should be available through a computer and it should be easily searchable.  In 1992 I left my job in artificial intelligence and robotics to go back to school and study computer science.  The master thesis I wrote in 1994 addressed exactly this type of search problem and included an early version of the prototype discussed below.  Since that time I have worked on and off on this problem as time has permitted.

With the growth of the Web, I have realized that there is a growing need for solving exactly the type of search problem my research addresses. One of the most common problems people try to solve with the Web is to identify a specific resource they need to solve a problem.  They want to find the perfect piece of furniture, or the best book to introduce them to soccer, or the ideal stock broker, etc.  The data structures and algorithms that I have developed could be easily adapted to solve any of these problem.
 

Products and Technology

I call the technology I have developed for solving identification problems ID (for Identification Database technology).  It is based on techniques used by scientists to identify unfamiliar organisms.  The most interesting of these techniques for computer based systems are synoptic keys.  Synoptic keys allow the user to identify the members of a group that have a particular value for some feature, e.g., a blue colored beak or 24 hour customer service.  By applying additional features to each successive result, the user eventually arrives at a solution. One interesting thing about synoptic keys is that the user is not forced to provide the features in any particular order.  This differentiates this method from most databases, which use relational data.  The other distinctive feature is that each successive search is done only over the members of the group that were selected in the last search.  This feature separates this method from most standard Internet search engines.
 
The biggest problem in using textual synoptic keys is the need to maintain the list of active candidates.  ID relieves this burden completely.  It actually goes even further by providing the user with suggestions of which features would best help them reduce the set of candidates they are considering.  By adding this piece, the computer/human interaction becomes an active dialog where the user is providing the information they care about and the computer is helping to direct the user's interest to most quickly find the best answer.  Finally, ID recognizes that features are inter-related.  For example, if an organism has a blue colored beak, it must have a beak and in fact must be an animal.  Current database technology and search engines cannot make this inference.  ID can do this type of inference automatically, allowing the user to accelerate the identification process.
 
A prototype program called Taxy is available at http://collectivesource.com/taxy/taxy.html.  The prototype is strictly text based and requires a fair amount of experience to effectively use.  For the purpose of providing an example, the demo database describes about 50 species of common fungi (mushrooms).  I chose this subject because there are over 70,000 different species of fungi, and is one of the most complex identification problems. Most of this work was done as part of my 1994 master thesis on using computers to help with biological identification.  A version of my thesis is available on the above page.

My current intention is to create a set of general purpose identification tools that could be applied to any domain.  They would provide easy mechanisms for experts to create synoptic keys for any subject as well as methods for easily comparing small groups of candidates.  I believe these tools could become as critical to creation of a well rounded web site as relational databases such as Oracle are to the current web.

The most important feature the current prototype is lacking is a graphical user interface.  Java is the logical choice for the front end.  The existing program would need to be reworked to act as the backend database.  The focus areas for that work would be efficient interprocess communication and persistent storage.  It would probably make the most sense to essentially start from scratch using the prototype for inspiration.  The data structures would stay essentially the same as would the algorithms for searching and selecting features.  The best push for this work would be a target real world application area selected from one of the targeted market areas.

Once the basic implementation is solid and there is a useful working database, additional features could be added such as integration with other identification techniques, images, glossaries for domain specific terms, and domain specific data types.  Some of these may be part of the original implementation depending on the targeted market.
 

Markets

ID would be useful in a number of commercial contexts.  They could provide a new feature to various special interest sites to attract new customers and to encourage repeat visiting.  For example, a movie/video site could use the technology to help users identify which movie they should rent or a book site could help their users pick out the right book to buy.  ID could also allow a business to help their employees with difficult search tasks like equipment problem diagnosis or unknown chemical identification.

 

Competition

Currently the distinction between identification and searching is not well recognized in the Internet world.  On the Internet the existing search tools are the only tools available to do identification.  However, these tools generally overwhelm the user with a collection of candidates that may or may not meet their needs and give the user no easy way to sift through all this information. Internet search tools are polarized between the very general text engines, i.e., pattern matching engines like AltaVista or Google, and the very structured table engines, i.e., relational database engines like the Internet Movie Database (imdb.com) or Amazon.com based on industry standard databases like Oracle.  Text engines are characterized by the automatic processing of large amounts of essentially unstructured data.  Table  engines are characterized by very structured data created by the database designers.

There is also a middle ground that has largely been unexplored between the unstructured data of text engines, and the highly structured data of table engines.  The few existing examples (Yahoo and AskJeeves being the most prominent) either simply combine techniques from the more standard methods or provide only rudimentary identification capability.  Yahoo is an example of the first, in that it extends a text engines with a database of keywords to find web pages.  AskJeeves, on the other hand, begins to set up a dialog with the user, but it has a rather short attention span.  AskJeeves generally attempts to answer a particular question in a single attempt.  In some cases it will come back with a single followup question. The problem with this approach is that many identification problems are too complex to be handled in this way.  Consider trying to diagnose an illness.  If you ask the question "What illness do I have?"  to AskJeeves it goes down one level to ask for a particular symptom and then gives you a definition of that symptom and long list of possible causes.  It does not engage you in further dialog to actually help you decide between these possibilities.
 

Both text and table searching are insufficient for effective identification.  The text searching tools are hindered because there is no effective way to reveal the structure within the results of a search.  For example, you might do a search for `furniture' near `livingroom'.  However, once you had that presumably very large set, what would you use to organize the results?  You might get lucky and stumble on a site that provides that organization, but finding such a perfect web site is much like finding the perfect wave for a traditional surfer.
 
Table searching at first glance seems like it might stand a better chance of finding the perfect piece of furniture.  However, table-based searches are best suited for problems where there is a significant amount of similarity between the things that are being compared.  Classic examples are addresses or bank charges.  They tend to get confusing and difficult to use when the things are more difficult to compare.  In the furniture example, a table-based search might contain information about whether a particular type of lamp can use a three-way bulb.  The person using the system might be still trying to decide if they want to put a lamp or a set of shelves in a particular corner, but they know if it's a lamp then they want to use a three-way bulb.  Table-based systems have a difficult time with this type of information and either end up being very inefficient or they force the user into making `high-level' decisions before the system allows them to give more detailed information.  Continuing the example, the database would either have to say that shelves take neither three-way nor regular bulbs, or the user would have to decide between shelves or lamps before discussing the type of bulb.  Neither of these approaches scale well as the number of distinguishing features grows.

 

Manufacturing

In general the manufacturing for ID would be a straight forward software development process.  One potential issue is that the original prototype has been released as part of an open source project for biological identification.  As the author of this project I would not want to make any intellectual property agreements that restricted my or anyone's ability to continue to work freely on this project for that purpose.

 
 

Home | Contents

Last Modified: January 28, 2001