Preliminary Isearch FAQ

This file is meant as an addition to, not a replacement of, the other
documentation provided with Isearch. Please be sure to read the TUTORIAL and
README files carefully before proceeding.
----------------------------------------------------------------------------

I'm trying to build Isearch-cgi, and can't find idb.hxx, even looking
through the whole file system. Compilation is with g++ from gcc-2.7.0.
     Compiling Isearch-cgi requires that you first compile Isearch in a
     separate directory, and it needs to be the most recent release of
     Isearch.  Newer version of Isearch include Isearch-cgi and
     Isearch-cgi is no longer distributed separately.

How can I make Iindex index single character words?
     Remove them from the stop word list (sw.hxx).

I ran Iindex on a large set of files overnight, but didn't see any index
files created. How can I speed up indexing?
     Try using -m to increase the amount of memory Iindex uses for indexing.
     The higher you set -m, the faster indexing will be, until you reach the
     physical memory limits of the machine. Iindex uses about 3 or 4 times
     (depending on the number of document records) the amount of memory
     specified in -m. At a minimum, -m should be set to slightly larger than
     the size of the largest document record being indexed, but if you have
     the memory, you should set it higher for the sake of speed.

I am trying to compile Isearch on my machine, but I'm running into problems.
Are the pre-compiled binaries available?
     Precompiled binaries of ISearch and Isearch-cgi are available at:
     ftp://ftp.cnidr.org/pub/software/

I've indexed a set of web pages, and searching seems to work fine, except
for when I search for something like: 9-12, I get no hits, even though it is
in the data. What's going on?
     I think the problem is the "-" character. Right now Isearch stops at
     the first non-alphanumeric character. You would have to search for '9
     and 12' since they are treated as seperate words. Until we put in
     configurable stop characters, that's all you can do.

Do I really have to re-index the data every time a file is modified or
deleted?
     Yes. Like many full-text search engines, Isearch is based on the
     assumption that fast searching is more important than fast updating
     (indexing). (We will, however, be speeding up Isearch indexing
     significantly.) And it doesn't really make sense to search on an old
     index if the data have changed, because the results would no longer be
     meaningful. For example, if you removed all the occurences of a word
     from a file, and the index still reported the word as being there, then
     you would get wrong results if you were to search on that word.
     The main restriction is that the data being modified must not be the
     same files that were indexed. Each night, just before indexing, you
     could make a copy of the most recent versions of the files, and index
     those static copies. As soon as the index is finished, you would alias
     the new index files and data as current, and delete the old data and
     index files. Of course you would do this in a script, and it would take
     some disk space; but it is the best way I know of to provide continuous
     search access to changing data with a search engine that uses an
     indexing phase.

Are there any plans for Isearch to support spatial searching like freeWAIS?
i.e. northernmost latitude etc.?
     Yes - Isearch-2.00 supports numeric, date and spatial data searching.

Does Iindex supports indexing with headlines?
     Try using Iindex with "-t FIRSTLINE" if all you need is to return the
     first textual line.

How Complex can boolean queries be?
     Isearch supports arbitrarily complex boolean queries, as long as they
     are phrased in RPN format. That is:
     Isearch -d STORIES -rpn cat dog and mouse or
     is equivalent to ((cat and dog) or mouse). We should be releasing a
     version of Isearch cgi to enable more 'normal' boolean queries soon.

I've encountered several bugs with Isearch. What information should I
include in a bug report?
     In general, information needed to track bugs is:
       1. Detailed bug description so that it may be duplicated!
       2. Operating System and revision information (showrev)
       3. Compiler version and revision information
       4. Hardware
             + Arch (eg. sun4m)
             + Model (eg. SS20M71)-- yes some bugs show up on different
               machines of the same arch.

Is there stemming support in Isearch?
     No, but we are planning on adding it.

I'm indexing an extremely large database (around 600 megs), and just killed
Iindex after allowing it to run over 4 days. What's going on?
     Isearch was started with an emphasis on design, and some performance
     issues (for example, large data sets) have taken a back seat to adding
     features in an extensible way. In time, Isearch will far exceed the
     features, speed, and large-database capacity of other search engines
     such as freeWAIS.

Is there any documentation (or would someone be willing to answer
questions) on the document types supported by Iindex v1.09?

I'm working on a database that we are trying to update about once a day.
From what I can tell, it seems to be re-indexing all the data. What's going
wrong?
     No, nothing wrong... The index merging is very slooow! We suggest
     having two databases, one for the incremental additions and the main
     index.. The incremental database is always much smaller so the append
     time short (and well suited for a process thread).

Is it possible to search multiple databases with one Isearch command?
     Kevin Gamiel wrote a "virtual IDB" class that lets you treat multiple
     Isearch databases as a single database (it opens an array of IDB
     objects and searches across all of them and combines the results). If
     you ask him *really* nice, he may be willing to put up the code, but
     you will have to link it in and fiddle with it until it works the way
     you want it to. (E.g., modify Isearch to use VIDB instead of IDB, etc.)

From what I can tell, all the index hits are located before any results are
provided. I'm working with a very large db, though, and I'd like Isearch to
produce output as it gets a hit so I can process it on the fly. Is something
like this prossible?
     Don't really see how. Untill we finish we don't know the scores...
     While we can know some hits we don't know untill we are done what the
     top hits are. With a large DB this would most all the time return the
     "wrong" result set. The model is not grep where we only want to know
     any hits but we want to know what the highest scores are with sentences
     composed of multiple terms.

I've been trying everything to get Isearch to compile and run, but nothing
seems to work. I'm using the very latest version of gcc.
     We've recieved periodic reports of strange problems with the newest
     version of gcc. We used version 2.6.3 to compile the pre-compiled
     version available, and suggest trying 2.6.3 if other versions don't
     work.

I'm working with a large datbase of files which have many occurences of
words like "hud" "house" and "home". A search on "house" seems to foul up
the search. Can I fix this by adding these words as stopwords int he sw.hxx
file?
Searching for those words does not "foul" up the search but a single word
"house" is just too narrow to return anything reasonable. Try many words!
Something will or should float to the top. Single words queries are often
(in large databases) not very interesting...
Adding to a stoplist removes any reference to the words. The question is: Is
the word without any meaning in the context of the database? If the answer
is yes then add it. If not don't. The question is not "common" occurances
but one of semantics.

If one indexed a set a documents and some of documents are editted does that
make the previous index totally invalid??
     YES. You must mark the old version deleted. Move it, change the MTD
     path to reflect the new path and THEN add the new version... Version
     control is critical to the functioning of the index! The Isearch model
     does not have a dictionary so any, even minor, change to any of the
     documents can invalidate the index.

I'm seeing inconsistencies in assigning relavnace scores. The same file is
given a different relevance score for the same search terms depending on how
large a range I select with -startdoc and -enddoc. How are the relvance
scores assigned?
     It is because the scores are scaled based on the result subset. So, if
     the same file shows up as part of a different result subset (by
     specifying starting and ending docs), it will be given a different
     score. This is a bug in the ISearch command line tool, and we're
     working on it.

Can Index documents that repeat the same field several times?
     The Isearch engine does support repeating fields, but there is no
     interface for retrieving the individual instances, simply because I
     haven't gotten around to it yet. It is quite easy to add a simple
     method that lets you retrieve field contents based on a subscript
     (e.g., title[1], title[2]). If people need this feature, I will go
     ahead and add it for the next version.
     Also, when I designed the field classes, I did want to support
     hierarchical fields, but felt that the field spec classes were already
     too complicated. I finally decided to use a flat model, with plans to
     add eventually some kind of support methods to allow a directory-like
     structure within the field name string that would have the effect of
     hierarchical fields.

Has anyone developed a document type for MARC records in Iindex?
     Yes, that has been added for 1.09.10, along with some other bugs.

I've just compiled Isearch 1.09.09 using gcc-2.7.2 and libg++-2.7.1. When I
search for a _single_ term, I always get zero results. If I search for
multiple terms, the results look fine. What's going on?
     Sounds like another gcc problem. we've had good results using gcc
     2.6.3, and recommend falling back on it if newer versions cause strange
     problems with Isearch.

Does Isearch support hierarchical fields?
     Isearch does not now have the code to access fields in this way, but
     the capability is there b/c the interface to the engine is "open" (i.e.
     the engine doesn't care whether fields are nested or not; all field
     access is via internal mappings). Some additional work needs to be done
     in the Isearch library to provide a context sensitive interface to the
     field data. Once this is done, we can allow any combination of flat
     (context-independent) or context-dependent field specifications in the
     query.

Can I update indexes while users are searching the database?
     The current code does not not handle updates during searching. Since
     all access to the database happens through IDB, a simple way to do this
     would be to lock the entire database in the constructor. But then your
     update would need to be pretty fast, and the current version does not
     have code for fast updates. You could update a secondary database, and
     then merge results during searching from the main and secondary
     databases.


> soundex (or other phonetic matching)

No, Nassib's code notwithstanding.  Implementing it (if I remember what
he said at the time) would require a rewrite of the way we do the
indexing.

> root/suffix matching, i.e. "child" should match "children", and moreso,
> "index"
> should match "indicies".

Isearch supports right truncation - "child*" will match child, children,
childless, etc.  It has to be explicitly requested, though - it doesn't
do it by default.  It also does not currently do left truncation or word
stemming.

> synonym matching, i.e. infant should match neonate.

We've talked about this.  In principle, it's not hard to implement but
getting a good list of synonyms is hard.  Roget's charges big bucks for
their thesaurus.  We'll probably implement this at some point with some
small, discipline-specific lists to see how it goes.

> similar concept matching, i.e. child should match baby, or even dog should
> match bitch, in the proper context (yeah, I know, getting pretty hairy).

No.  Right now, Isearch does no real content-analysis.