[*] SWISH.C, VERSION 1.0 _________________________________________________________________ Contents * What is SWISH? * Great! How do I get started? * Searching with SWISH * Indexing with SWISH * Configuration file options + Basic index variables + Using ReplaceRules + Using file name rules * Usage * Command-line options * That's it! _________________________________________________________________ What is SWISH? SWISH stands for Simple Web Indexing System for Humans. With it, you can index directories of files and search the generated indexes. For an example of swish can do, try searching for the words "office and map" at EIT. All of the search databases you see there (with the exception of the WAIS Directory of Servers) were indexed by swish. When you do a search, it's the swish program that's doing the actual searching. SWISH was created to fill the need of the growing number of Web administrators on the Internet - many current indexing systems are not well documented, are hard to use and install, and are too complex for their own good. Here are some pros and cons regarding SWISH: * It's simple. I've tried to make SWISH as simple as possible while keeping some of the things that people look for in an indexer. The drawback is that you can't do many things that full-featured indexers and searching programs can do, such as stemming (searching for different versions of a word), use of synonyms, and complex boolean searches. * It's made for Web sites. In indexing HTML files, SWISH ignores data in tags and give higher relevance to information in header and title tags. Titles are extracted from HTML files and appear in the search results. SWISH can automatically search your whole Web site for you in one pass, if it's under one directory. * It's fairly nice on disk space and is pretty fast. Index files consist of only one file, so they can be transported around and easily maintained. The SWISH source is only around 50k and generated indexes average out to around half the size of comparable WAIS indexes. Searching usually is as fast as using a non-commercial WAIS-based solution. * You can fix the source. I encourage people to send in patches and suggestions on how to make SWISH better. Although it's not in the public domain, I am always more than happy to integrate contributed code into the distribution. _________________________________________________________________ Great! How do I get started? First, you need to grab the source code and related files at ftp://ftp.eit.com/web.software/swish/. There should be: * The source code (something like swish.c) * A sample index file * A sample configuration file Compile the source code with your favorite C compiler, using the math libraries. Everything was written in pretty strict ANSI C, so it should work just about anywhere. Try something like this first: cc swish.c -o swish -lm The swish program can go under /usr/local/bin - you may want to put other SWISH things somewhere such as /usr/local/httpd/swish, if you're using NCSA's httpd. You'll also want to create a directory to hold SWISH databases, somewhere like /usr/local/httpd/swish/sources. You can store the files anywhere you like, as long as you remember where they are! After you've compiled (and installed) SWISH, make sure the swish program is somewhere in your executable path (somewhere such as /usr/local/bin). _________________________________________________________________ Searching with SWISH If you got the sample SWISH index (called sample.swish), you can do a simple search on it. Try typing this: swish -f sample.swish -w internet and resources and archie This will search the file sample.swish for files consisting of the words internet and resources and archie. You should get something back like this: # SWISH format 1.0 search words: internet and resources and archie 1000 http://www.eit.com/web/www.guide/guide.15.html "Guide to Cyberspace 6.1: I ndex/Glossary" 11566 500 http://www.eit.com/web/netservices.html "Internet Resources List" 48391 . The results tell you: 1. The format the results are in (so future versions of SWISH or other searching programs know this), 2. The search words you used, 3. A result line - this is made up of: + The relevance rank. This number is generated with each result and is the program's "best guess" as to how relevant it thinks the file is to your query. This rank number, which can range from 1 to 1000, depends on a number of factors, such as how many times your search word appears in the file, how many words are in the file, and if the word appears in a title or header tag (if it's an HTML file), among other factors. + The path name to the file. This may be an address, such as a URL, or a full path to the file. + The title of the file. If this is an HTML file, this is the title. This may also be the name of the file (if there is no title). + The size of the file. This size is always in bytes. 4. A period. This signifies the end of the results. A line with a period always signifies the end of swish output. If there are errors, instead of the results list, you may get one of the following error lines. These lines will always be prefixed with err:. * err: no results There were no results of the search. * err: could not open index file Either the index file could not be found or it couldn't be opened. * err: no search words specified No words were specified for searching. _________________________________________________________________ Indexing with SWISH SWISH has the capability to use configuration files in which you can specify all sorts of options for indexing. To use a configuration file, call it something such as swish.conf, and place it somewhere such as /usr/local/httpd/swish/. The configuration file below is an example of a typical SWISH configuration file: _________________________________________________________________ # SWISH configuration file # Lines beginning with hash marks (#) and # blank lines are ignored. IndexDir /usr/local/www # This is the root directory of the Web tree you want to index. IndexFile /usr/local/httpd/swish/sources/index.swish # This is the name your SWISH index will be built as. IndexOnly .html .txt .c .ps .gif .au .hqx .xbm .mpg .pict .tiff # Only files with these suffixes will be indexed. IndexVerbose yes # Put this to show indexing information as swish is working. NoContents .ps .gif .au .hqx .xbm .mpg .pict .tiff # Files with these suffixes won't have their contents indexed, # only their file names. ReplaceRules replace "/usr/local/www" "http://www.eit.com" # ReplaceRules append "" # ReplaceRules prepend "" # ReplaceRules allow you to make changes to file pathnames # before they're indexed. # File names matching the following criteria will not be indexed. pathname contains admin testing demo trash construction confidential filename is index.html filename contains \~ .bak .orig .old old. title contains construction example pointers _________________________________________________________________ To index a site using the options in a configuration file, type: swish -c /usr/local/httpd/swish/swish.conf To run swish and index your site. Taking as an example the above configuration in the script, you'd have the directory /usr/local/httpd/swish/sources and one file called index.swish in the directory. The name of the database you've just created is index.swish. _________________________________________________________________ Configuration file options You can specify variables and values in the configuration file by typing the variable name (it's not case sensitive), a space (tabs are OK), and the value you want for the variable. If the value has spaces, you can enclose it in quotes to keep the space. If you want to specify multiple values, separate the values with a single space. In the configuration file, lines beginning with a hash mark (#) and blank lines are ignored. BASIC INDEX VARIABLES * IndexDir directory The IndexDir variable tells swish what directory to index. This means that all files under that directory (and files in subdirectories) will be indexed. You can't specify a filename or more than one directory. This can be a full pathname or a simple directory name. * IndexFile indexfile The IndexFile variable tell swish what to save the indexed results as. Indexes generated by swish should have a suffix of .swish. * IndexOnly .suffix1 .suffix2 .suffix3 ... Only files with these suffixes will be indexed. If you omit this variable, swish will index every file it comes across. * IndexVerbose value This variable can have the values yes or no. If you specify yes, swish will tell you what's going on while it's indexing, printing out directory and file names, number of words indexed, and so on. * NoContents .suffix1 .suffix2 .suffix3 ... This variable lets you control what files will have their contents indexed. If a file with a suffix in this list is indexed, only its file name (and not any words in the file) will be indexed. This is useful because normally swish will try to index the contents of every file, even files without words (such as images or movies). USING REPLACERULES When results are returned from swish searches, you may get a bunch of funny pathnames to files that you can't access. Using ResultRules, you can specify a series of operations to perform on the pathname result to change it into a URL and other things if you desire. There are three operations you can specify: replace, append, and prepend. They will parse the pathname in the order you've typed these commands. More than one command and its arguments can appear on the same line, but it's easier to read when commands are broken up over a few lines. You can't put a command and its argument(s) on different lines, however. Here's the syntax: replace "the string you want replaced" "what to change it to" This replaces all occurrences of the old string with the new one. prepend "a string to add before the result" append "a string to add after the result" Study the above sample configuration file and try things out. You'll find that by having swish return URLs instead of pathnames, you can create interfaces to swish that can allow users to get to the search results over the World-Wide Web. USING FILE NAME RULES You can specify certain file directives in the configuration file - any files or directories matching these criteria will be ignored and will not be indexed: * pathname contains string1 string2 string3 ... Any path names containing these strings, whether they be paths to directories or paths to files, will be ignored. Using this you can avoid indexing temporary directories or private material. * filename is filename Any file name exactly matching the specified file name will be ignored (this is case-sensitive). This cannot be a path. * filename contains string1 string2 string3 ... Any file name containing these strings will be ignored (this is not case-sensitive). This cannot be a path. * title contains string1 string2 string3 ... Any HTML file with a title that contains these strings will be ignored (this is case-insensitive). _________________________________________________________________ Usage usage: swish -w "word1 word2 ..." [-m num] [-i dir] [-c conf] -f indexfile -v -V options: -w : perform a search with words "word1 word2 ..." -m : the maximum number of results to return -i : create an index from the files in directory -c : configuration file to use for indexing -f : index file to create or search from -v : turns on verbose indexing -V : prints the current version version: 1.0 docs: http://www.eit.com/software/swish/swish.html To see the usage, run swish with a -z or -? option. _________________________________________________________________ Command-line options -w word1 word2 ... (search words) This performs a case-insensitive search using a number of keywords. If no index file to search is specified, swish will try to search a file called index.swish in the current directory. You don't need to put quotes around search words. You can use the booleans and and or in searching. Without these booleans, swish will assume you're anding the words together. Evaluation takes place from left to right only - parentheses will be ignored. example 1: swish -w john and doe OR jane example 2: swish -w john or (doe and jane) example 3: swish -w john doe and jane example 4: swish -w john doe jane 1. This search evaluates the expression from left to right. 2. This search will also be evaluated from left to right, despite the parentheses, which will be ignored. 3. This is equivalent to john and doe and jane. 4. This is equivalent to john and doe and jane. -m number (number of results) While searching, this specifies the maximum number of results to return. -i directory (directory to index) This specifies the directory to index. All files below it and subdirectories will be indexed. -c configfile (configuration file) This specifies the configuration file to use for searching. You can use this as an only option to swish to do automatic indexing, if all the necessary variables are set in the configuration file. If you specify a directory to index, an index file, or the verbose option on the command-line, these values will override any specified in the configuration file. example 1: swish -c swish.conf example 2: swish -i /usr/local/www -f index.swish -v -c swish.conf 1. The setttings in the configuration file will be used to index a site. 2. These command-line options will override anything in the configuration file. -f indexfile (index file) If you are indexing, this specifies the file to save the generated index in. If you are searching, this specifies the index file to search from. The default index file is index.swish in the current directory. -v, -V (verbose and version options) The -v option tells swish to print out a progress report as it's indexing files - directory and files names are printed, and you get the number of words indexed for each file as well as the total number of files and words indexed. The -V options makes swish spit out its version number. _________________________________________________________________ That's it! As always, patches, improvements, suggestions, and corrections are gratefully accepted. Send 'em all to Kevin Hughes at kevinh@eit.com. _________________________________________________________________ Last update: 11/4/94