UIUC DLI Testbed Database & Client Technologies

The testbed team has expended a great deal of effort in evaluating the available SGML full-text database management and retrieval software packages. After careful study, the Testbed Team chose the Open Text Corporation's Open Text Index search engine for indexing and accessing the DLI Project documents. The Open Text engine, originally developed at the University of Waterloo, is an extremely robust and expandable system that allows phrase, Boolean, and proximity searching and is tailored to SGML processing and retrieval. The Open Text engine provides us with:

  1. efficient phrase indexing (through its use of Patricia Tree index structures) and the capability of hierarchical searching within SGML tags;
  2. the scalability to index large document stores (the Open Text Web Search Server indexes over one million Web sites and fifteen million links);
  3. a parallel execution monitor that supports combined simultaneous searching of multiple Open Text databases and combines search results into a single result;
  4. the availability of system API (Application Program Interface) hooks which allow the construction of client front-ends to the Open Text engine;
  5. the capability of normalizing, or aliasing, the heterogeneous publisher DTDs and SGML to allow us to perform, with a consistent search argument, searches across multiple DTDs and databases;
  6. the ability to define, extract, and store document-level metadata to provide more effective search/retrieval (with tags normalized and non-searchable characters removed) and short-entry display;
  7. the ability to utilize the metadata information in combination with the index files to fashion a document retrieval system that does not require the storage of the full-text and images on-site;
  8. the ability to utilize a pre-index filter to screen out non-significant characters and strings (from a search and retrieval perspective), while preserving correct byte offset pointers to all occurrences of the search words within the article.

Following from points 5), 6) and 7) above, the Open Text software provides us with a framework for modeling a federated system of distributed publisher document repositories, in which individual publishers can mount and maintain their separate document repositories based on their specific DTD. In this model, any designated subset of these discrete repositories can be accessed and documents then retrieved from an intelligent client based on expressed user needs and information requirements. Our expectation is that the client gateway function which will link user information needs-via comprehensive indexing stores-to documents in designated repositories will be an important component in retrieval across the Web. The retrieval enhancement techniques, such as vocabulary switching and co-occurrence table lists would be placed in the context of the client-indexing stores side of the process.

The specifications for metadata generation and normalizing the seven heterogeneous publisher DTDs have been written, tested, and deployed. Our metadata objects reflect what we regard as critical retrieval and display needs, but will be carefully tested in an end-user searching environment. Recent studies by the American Physical Society, Elsevier's Project TULIP review team, and our DLI Evaluation group support the usefulness of the metadata elements we have selected. We will continue to modify our metadata specification based on user feedback and comments from the professional community.

All of the Open Text software modules have been tested, including the parallel execution monitor software used to search across multiple databases. The first journals are available for public access, and we expect to see all publishers in production within the next several months.

To study access techniques and retrieval effectiveness over these full-text journals, we have designed and implemented a prototype client written in Visual Basic 3.0 operating in a Microsoft Windows environment. This custom front-end has been designed, from the onset, as a demonstration system to study full-text retrieval and explore functions that pose problems in a Web environment. The present limitations in Web search capability include the inability to maintain state or hold open the connection to the database, the difficulties in dynamically updating forms, and bandwidth limitations that preclude dynamically updating word wheels and like. In addition, the prototype DLI client is part of an overarching gateway client that provides access to remote and local information resources from a public workstation. Indeed, the integration of A&I service databases, online catalogs, locally mounted and remote periodical index databases, campus and Library maintained databases, and the full-text DLI Project data is an integral part of the UIUC comprehensive digital library system. Users typically desire relevant information from multiple, sometimes disparate resources in what are often multiple formats. In particular, the linking of A&I service databases with full-text document stores is an area that demands attention and will be investigated in this project.

The custom client utilizes a TCP/IP connection to our Open Text and Ovid servers and employs both OLE (Object Linking and Embedding) and DDE (Dynamic Data Exchange) methods to converse with the SoftQuad Panorama and Netscape Navigator display and rendering applications. The actual SGML documents are stored on an HTTP Server and use the Panorama/Netscape CCI (Command Interface) linkage to retrieve and display the SGML documents.

The custom client has been designed to serve as a rapid prototyping platform for exploring techniques to facilitate full-text document retrieval. The client employs intelligent multimedia interface techniques, such as voice synthesis, demonstration searches using successive screen capture and voice-over instruction, and full-motion video, to provide context-specific help and assistance. The interface is designed to assist end-users with search strategy formulation and navigation through the search process. The features implemented in the initial version of the custom client include:

  1. Author search for articles written by an author, citing a particular author, or that mention an author's name in the body of the text; or any combination of the above;
  2. Author query-by-example entry form;
  3. Search shortcuts for limiting to Key Fields (Title, Section Title, Abstract), Title only, and Bibliography information only;
  4. Capability of specifying at the search term level the fields to be searched;
  5. Select any combination of the following fields for searching: Full-Text, Title, Section Title, Abstract, Figure Caption, Table Caption, Table Text, Body of the article, Organization, Author name, Cited Author name, Cited Article Title, Cited Journal Title;
  6. Search Tree modification mechanisms in which the user is given the on-the-fly option of expanding a search term result (e.g., by going from exact phrase match to searching for component words in the same paragraph) or limiting a search term result (with the number of hits threshold values modifiable);
  7. Scrolling Boolean input form with indications for adding synonyms or related terms and modifying previous results with AND or OR operators;
  8. Stemming or Exact Match searching;
  9. Multiword phrase proximity searching that allows exact matches, within same paragraph, within 'n' words of each other, and words appearing anywhere within the document;
  10. Short entry display including Author, Title, Source, Abstract, and optional Bibliography, Figure/Table Captions, and Author Affiliations with print and download capabilities;
  11. Full-text SGML rendering via SoftQuad Panorama viewer;
  12. Links to locally mounted Engineering Index database (1987--), Open Text Web Index via the Netscape Web browser, and locally mounted databases from within the DLI client.

One of the primary functions of the prototype client is to demonstrate interface and retrieval technologies that can be ported to the Web environment.

The Open Text engines does provide a word index generation capability, but because of its underlying phrase indexing structure, it does not provide the browsing and display of headings common to systems employing an inverted file indexing structure. To get around this problem, we have exported the word tables generated within OpenText into a Microsoft SQL Server database structure. These "word wheels" can then be used to display for users a letter-by-letter match with user-entered search arguments. This commonly used word wheel approach has been proven to minimize user spelling errors and suggest alternate word forms that enhance retrieval.

Return to Testbed Homepage