SIGIA-L Mail Archives: RE: SIGIA-L: Suggestions for a taxonomy?
RE: SIGIA-L: Suggestions for a taxonomy?
From: Paul Bryan (pbryan_at_sapient.com)
Date: Wed Feb 20 2002 - 11:46:43 EST
1. Who is doing the indexing?
2. What is the nature of the documents that need to be retrieved?
If the average user submitting a document is going to be doing the indexing,
there may be a QA problem that will completely destroy the precision and
accuracy of search results based on a controlled vocabulary.
If documents consist of unstructured data (i.e. articles with regular text
and images, but no inherent metadata), then a structured indexing approach
will not necessarily yield better results than a free text search (see
infosci research such as
The posts to this list about taxonomies that I've read seem to be focusing
on structured data sets that are indexed by professionals. If the situation
arises in which you need a taxonomy for a text database indexed by the
sytem's users, then you might want to consider the following quick and dirty
approach. (It requires a fairly robust text search tool, e.g. Verity.)
First, create an inverted index (a list of terms in the document, in order
of frequency of appearance) of a sample of about 10,000 docs. Scoop off the
top 500 or so terms, and create a flat keyword list. Let users filter the
terms according to subjective criteria. Do a test and iterate on the flat
list of terms. This list should then reflect the natural language of your
subject matter. In the screen for document submission, give the user an
option to quickly assign terms from your keyword list. In the search screen,
give the user the capability to add terms from the term list, and include
instructional copy about how these terms will be used in the search.
So shoot me. It's not scientific, but it's quick, free, and simple for users
to relate to.
From: Arbing, Susan [mailto:susan.arbing_at_cyberplex.com]
Sent: Tuesday, February 19, 2002 4:46 PM
Subject: SIGIA-L: Suggestions for a taxonomy?
I am working on a project for a client who is in the learning business. One
of the things I am doing is establishing an initial hierarchy for them to
get them up and running with their content management system.
I'm looking for any recommendations for a good taxonomy to serve as the
basis of their hierarchy. The top-level of the hierarchy will be domains
such as Business, Science, Information Technology, etc. Does anyone have a
suggestion for a taxonomy that would address this area specifcially?
This archive was generated by hypermail 2.1.2
: Sun Nov 23 2003 - 22:55:02 EST