Overview
The UIUC DLI testbed effort is focusing on designing and building a large-scale
multi-journal, multi-publisher production testbed of scientific journal
articles. This testbed is centered in the Grainger Engineering Library Information
Center, a $30 million facility which opened in March 1994 with the primary
charge of investigating emerging information technologies. The initial testbed
collection has been introduced on Grainger Library public workstations and
is in the process of being expanded into selected College of Engineering
Laboratories and other departmental libraries. To build the collection of
full-text electronic documents, collaborative agreements have been reached
with a number of major publishers in engineering and science to provide
the Project with materials prior to the articles reaching the print stage.
Each electronic article contains the complete text, graphics, images, tables,
and equations in SGML (Standard Generalized Markup Language) format, with
publisher-defined DTDs (Document Type Definitions) that define the document
tags that delineate the document structure (e.g. title, author surname,
section titles, figures, tables, equations, bibliography citations).The
testbed collection will, in the course of the four year project, grow to
contain all of the articles published in approximately thirty major journals
(from 1995 forward) in the areas of computer science, electrical engineering,
physics, civil engineering, and aerospace engineering. There are approximately
3,000 articles presently available for searching and display to users at
public terminals in the Grainger Library. The publisher partners supplying
the Project with articles presently include: the IEEE Computer Society,
the IEEE (Institute of Electrical and Electronic Engineers), the American
Physical Society (APS), the American Institute of Physics (AIP), the American
Society of Civil Engineers (ASCE), IEE (Institution of Electrical Engineers),
the American Society of Agricultural Engineers (ASAE), and the American
Institute of Aeronautics and Astronautics (AIAA). Additional commitments
have been obtained from several other professional societies, e.g. the American
Association for Advancement of Science (AAAS - Science) and commercial
publishers, e.g. Academic Press and John Wiley & Sons, to also supply
us with articles in SGML format.The DLI Testbed Team is studying the short-term
and long-term technical and policy issues connected with the procurement,
processing, indexing, searching, retrieval, transmission, and display of
full-text scientific articles. The testbed work is meant to provide a design
and methodological framework for the development of a model for providing
effective access to a federated system of World Wide Web-based distributed
document repositories that will contain the scientific journal literature
in electronic format. The primary focus of the Testbed Team is on developing
and testing a prototype system that can be used to study the computing and
information infrastructures necessary for the federated system.In addition,
the project is focusing on mechanisms to provide intelligent front-end and
gateway functions to enable end-users to seamlessly access a host of heterogeneous
information resources. Indeed, one of the primary requirements for a functioning
digital library system is the ability to link and integrate a host of heterogeneous
information resources, in all their multiple formats, which will include
online catalog collections, full-text repositories, Web and Internet resources,
digital images, digital video, A&I document surrogate records, and a
myriad of locally maintained databases.The Testbed Team is focusing on these
issues within the context of the present publisher-centered article submission,
editorial, review, and production environment. Our investigations are focusing
on enhancing a publishing environment in which the major change in emphasis
will be the conversion to Internet-based organizational, retrieval, and
distribution mechanisms for journal articles. However, our model of a federated
system of distributed document depositories will apply in any publishing
environment comprised of registered document depositories. The model uses
SGML as the standard for open document retrieval and delivery by assuming
that the full-text journal articles are made available in SGML format with
a standard Document Type Definition (DTD) specifying the form and content
of the SGML.The primary focus/goal of the testbed component of the DLI project
is to process, index, search, retrieve, and display full-text SGML articles
supplied to us by major publishers of engineering and physics journals.
In the past year, testbed activities have centered around:
1) streamlining document procurement, transfer, and processing procedures
including writing software to convert publisher SGML/DTDs into database
load forms;
2) analysis of publisher SGML text and accompanying DTDs to construct a normalized set of metadata elements that will facilitate retrieval and display--in particular retrieval across the heterogeneous DTDs;
3) examine & test the available full-text SGML database management
systems for applicability to the DLI project;4) explore options for SGML
on-screen rendering;5) design and development of a customized Windows client
for search, retrieval and display, including tailoring the client to Open
Text and SQL database management systems.
In particular, testbed activities have included:
setting up document servers for the APS, AIP, IEEE Computer Society, IEEE,
IEE, and ASCE full-text SGML; enhancing software procedures for more efficient
SGML processing (in particular, to capture images, handle entities, and
add necessary metadata to each document); enhancing software that normalizes
the heterogeneous publishers' DTD's for cross-publisher indexing, search,
and retrieval; modifying SGML indexing software (utilizing the ability of
SGML to indicate content and structure of a document) and insert the data
into Open Text and SQL database structures; writing communications software
that allows the development of graphical user interfaces to the Open Text
database engine; testing the efficacy of the SQL database structure in producing
word wheels for letter-by-letter search term entry; designing and developing
a custom Windows interface client to retrieve and display text from an OpenText
database; exploring search tree searching and retrieval models;working with
SoftQuad and the publisher partners to develop techniques and strategies
to more accurately render mathematics and equations within their Panorama
SGML viewer;constructing style sheets for the SoftQuad Panorama viewer that
are consistent with the published articles;
Personnel
The Testbed Team is headed by William Mischo, who is responsible for database and interface design and software, and Timothy Cole, who is primarily responsible for database processing and indexing software. In addition, two full-time research programmers--Robert Ferrer and Maria Pflaum--are responsible for SGML text processing, database loading, and interface development. Several one-half time Graduate Assistants--Donal O'Connor, Zing Zhao, John Isenhour--have assisted in text processing and database production work. Susan Harum is the Project Coordinator for External Relations and assists in SGML viewer style sheet design and other production needs. Mary Schlembach also assists in top-level interface design and the integration of DLI software into Grainger front-end software.
SGML
SGML is regarded as the standard for open document transmission and display. The far-reaching value of SGML lies in its capability to identify the fine-grained content and structure of a document. This allows sophisticated indexing and retrieval of documents, which is a necessity in a full-text retrieval environment. While SGML is becoming ubiquitous, it is still, for the most part, being generated by publishers as a byproduct, rather than an integral part, of their production process. For example, in many cases, we have been the first to actually display the SGML version of the published articles. For the display of the full-text articles, the DLI testbed team has been working with SoftQuad in the testing and evaluation of their Panorama SGML viewer. The Panorama viewer has been implemented in an Internet environment with Netscape Navigator using the CCI (Common Client Interface) capability. This configuration provides a means to access DLI testbed documents over the Internet using HTTP protocols with Panorama and a Web browser. Work continues with SoftQuad on the many configuration and style issues that arise when using Panorama for a heterogeneous collection of science-oriented documents. There have also been some major concerns with Panorama's ability to properly render mathematical equations and formulas. In particular, there have been problems with rendering special characters and diacritics; the alignment and kerning within formulas and equations; and fraction and matrix bar and overbar lengths. These have proven to be difficult problems and a number of solutions are being pursued, including rendering equations in MathType or TeX, and moving to PDF (Portable Document Format) for display purposes. While it is clear that SGML will play a major role in providing mechanisms for effective full-text indexing and retrieval, the Testbed Team continues to investigate the efficacy of SGML as a rendering and display tool, particularly in the scientific and engineering disciplines that rely heavily on mathematical equations. In these areas, current implementations of SGML viewers and renderers exhibit shortcomings, and it is not clear if these are simply the inevitable "growing pains" of a new technology (akin to what TeX went through some 15 years ago) or symptomatic of a real problem.Here's what have we learned. As the Project completes/enters year two, a number of issues have emerged and a number of critical questions have arisen as technical issues addressed. More robust Web database, multimedia, and search & retrieval technologies are needed. These technologies are being developed and it is anticipated that the work being performed by Sun with Java, Netscape with plug-ins, Microsoft with Internet aware applications and Object Linking and Embedding will move forward the state-of-the-art in Web information retrieval. SGML is an extremely powerful mechanism for indexing and retrieval. Our chosen commercial search engine for SGML from Open Text has proven to be scalable and robust. However, the use of Metadata (information either extracted from the document or added to the document proper) for normalization for retrieval and display is extremely important. Normalizing the heterogeneous SGML is necessary. In the first phases of this project, we have developed procedures for generating collections of SGML materials. For the SGML itself, these include document gathering (file transfer and format conversion) and tag processing (handling entities and adding metadata). In addition, an SGML document needs two attached files to effectively manipulate it. The DTD (Document Type Definition) describes the tags used within the text to mark up the structural sections, e.g. begin/end chapter. The structure is independent of any actual display of the document. The style sheet provides a logical to physical mapping for the tags including font and layout information, e.g. chapters are displayed in Times 18 bold centered. Effectively handling these attached files for scientific literature has provided challenging problems for the testbed collection.The heterogeneous SGML received from the publishers must be processed to produce a federated repository of structured documents from the scientific literature. The tags differ from publisher to publisher. Some of these differences can be federated with simple syntactic transformations, such as AU or AUT or AUTHOR for the author tag. But others reflect semantic differences and conventions. Author tags are particularly messy: every publisher has several such tags and the tags differ across publishers, yet the user wants to merely issue a query for ìauthorî. We have settled on an extension of the ISO 12083 Article DTD standard for the canonical form DTD used by the project. There are a set of standard canonical tags that are used for indexing and retrieval. We have written heuristic software for each DTD that maps the publisher tags into our canonical set. This tag normalization is our approach to structure federation.Note that SGML is not a prescriptive code for comprehensively describing a document; it is not akin to the AACR2 code for description and form of entries, rather it is a flexible template for marking the content and structure of a document based on publisher needs and conventions. There needs to be general agreement around the ISO 12083 SGML standard and, additionally, a commitment by publishers to follow certain conventions for author names, author affiliations, bibliography entries.
Database Management Software
The Testbed Team and other Project staff carefully evaluated the available
SGML full-text database management and retrieval software packages. Representatives
from EBT (Electronic Book Technologies), Open Text Corporation, Dataware/BRS,
OCLC, and Ovid Technologies made presentations to the Team. The Project
received the EBT software suite as part of a University Grant program. After
careful study, the testbed team chose the Open Text Corporation's Open Text
Index software for the Testbed production database management system. The
Open Text engine, originally developed at the University of Waterloo, is
an extremely robust and expandable full-text search engine that allows phrase,
Boolean, and proximity searching and is tailored to SGML processing and
retrieval. The Open Text engine provides:
1) efficient phrase indexing (through its use of Patricia Tree index structures)
and the capability of hierarchical searching within SGML tags;2) the scalability
to index large document stores (the Open Text Web Search Server indexes
over one million Web sites and fifteen million links);3) a parallel execution
monitor that supports combined simultaneous searching of multiple Open Text
databases and combines search results into a single result;4) the availability
of system API (Application Program Interface) hooks which allow the construction
of client front-ends to the Open Text engine;5) the capability of normalizing,
or aliasing, the heterogeneous publisher DTDs and SGML to allow us to perform,
with a consistent search argument, searches across multiple DTDs and databases;6)
the ability to define, extract, and store document-level metadata to provide
more effective search/retrieval (with tags normalized and non-searchable
characters removed) and short-entry display;7) the ability to utilize the
metadata information in combination with the index files to fashion a document
retrieval system that does not require the storage of the full-text and
images on-site; and 8) the ability to utilize a pre-index filter to screen
out non-significant characters and strings (from a search and retrieval
perspective), while preserving correct pointers to all search hits.
Following from points 5), 6) and 7) above, the Open Text software provides
us with a framework for modeling a federated system of distributed publisher
document repositories, in which individual publishers can mount and maintain
their separate document repositories based on their specific DTD. In this
model, any designated subset of these discrete repositories can be accessed
and documents then retrieved from an intelligent client based on expressed
user needs and information requirements. Our expectation is that the client
gateway function which will link user information needs--via comprehensive
indexing stores--to documents in designated repositories will be an important
component in retrieval across the Web. The retrieval enhancement techniques,
such as vocabulary switching and co-occurrence table lists would be placed
in the context of the client-indexing stores side of the process.At this
point, the specifications for metadata generation and normalizing the seven
heterogeneous publisher DTDs have been written and tested. Our metadata
objects reflect what we regard as critical retrieval and display needs,
but will be carefully tested in an end-user searching environment. We expect
to modify our metadata specification based on user feedback and comments
from the professional community. All of the Open Text software modules have
been tested, including the parallel execution monitor software used to search
across multiple databases. The first journals are available for public access,
and we expect to see all publishers in production within the next several
months. The Open Text engine does provide a word index generation capability,
but because of its underlying phrase indexing structure, it does not provide
the browsing and display of headings common to systems employing an inverted
file indexing structure. To get around this problem, we have exported the
word tables generated within OpenText into a Microsoft SQL Server database
structure. These "word wheels" can then be used to display for
users a letter-by-letter match with user-entered search arguments. This
commonly used word wheel approach has been proven to minimize user spelling
errors and suggest alternate word forms that enhance retrieval.
Custom Client
To study access techniques and retrieval effectiveness in conjunction with these full-text journals, the testbed team has designed and implemented a prototype client-server system with a front-end client written in Visual Basic 3.0 for a Microsoft Windows environment operating over a set of Open Text databases. This custom front-end and server have been designed, from the onset, as a demonstration system to study full-text retrieval and explore functions that pose problems in a Web environment. The present limitations in Web search capability include the inability to maintain state or hold open the connection to the database, difficulties in dynamically updating forms, and bandwidth limitations that preclude dynamically updating word wheels and the like. However, in the past year, Web technology has moved forward dramatically, with developments such as as Sun's Hot Java, Netscape plug-in technologies, and Microsoft's embedded OCX browser controls. Web browsers have begun to take on the characteristics of Internet operating systems, allowing multimedia, database, and desktop technologies to be embedded or enabled from within the browser. As these Web technologies continue to evolve, we can expect that the shortcomings inherent in database and retrieval applications will be addressed. In addition, the prototype DLI client is part of an overarching gateway client that provides access to remote and local information resources from a public workstation. Indeed, the integration of A&I service databases, online catalogs, locally mounted and remote periodical index databases, campus and Library maintained databases, and the full-text DLI project data is an integral part of the UIUC comprehensive digital library system. Users typically desire relevant information from multiple, sometimes disparate resources in what are often multiple formats. In particular, the linking of A&I service databases with full-text document stores is an area that demands attention and will be investigated in this project.The custom client utilizes a TCP/IP connection to our Open Text and Ovid servers and employs both OLE (Object Linking and Embedding) and DDE (Dynamic Data Exchange) methods to converse with the SoftQuad Panorama and Netscape Navigator display and rendering applications. The actual SGML documents are stored on an HTTP Server and use the Panorama/Netscape CCI (Command Interface) linkage to retrieve and display the SGML documents.The custom client has been designed to serve as a rapid prototyping platform for exploring techniques to facilitate full-text document retrieval. The client employs intelligent multimedia interface techniques, such as voice synthesis, demonstration searches using successive screen capture and voice-over instruction, and full-motion video, to provide context-specific help and assistance. The interface is designed to assist end-users with search strategy formulation and navigation through the search process. The features implemented in the initial version of the custom client include:1) Author search for articles written by an author, citing a particular author, or that mention an author's name in the body of the text; or any combination of the above;2) Author query-by-example entry form;3) Search shortcuts for limiting to Key Fields (Title, Section Title, Abstract), Title only, and Bibliography information only;4) Capability of specifying at the search term level the fields to be searched;5) Select any combination of the following fields for searching: Full-Text, Title, Section Title, Abstract, Figure Caption, Table Caption, Table Text, Body of the article, Organization, Author name, Cited Author name, Cited Article Title, Cited Journal Title;6) Search Tree modification mechanisms in which the user is given the on-the-fly option of expanding a search term result (e.g., by going from exact phrase match to searching for component words in the same paragraph) or limiting a search term result (with the number of hits threshold values modifiable) ; 7) Scrolling Boolean input form with indications for adding synonyms or related terms and modifying previous results with AND or OR operators; 8) Stemming or Exact Match searching;9) Multiword phrase proximity searching that allows exact matches, within same paragraph , within 'n' words of each other, and words appearing anywhere within the document;10) Short entry display including Author, Title, Source, Abstract, and optional Bibliography, Figure/Table Captions, and Author Affiliations with print and download capabilities; 11) Full-text SGML rendering via SoftQuad Panorama viewer;12) Links to locally mounted Engineering Index database (1987--), Open Text Web Index via the Netscape Web browser, and locally mounted databases from within the DLI client. One of the primary functions of the prototype client is to demonstrate interface and retrieval technologies that can be ported to the Web environment. Indeed, as Web server and browser software for the Windows operating system expands to accommodate application development feature sets, it is expected that some prototype modules will fold into a Web browser.
Display of Full-Text
The Testbed client software will accommodate full-text display of SGML text and mathematics (through Panorama), TeX mathematics display (converted to bit-mapped images), and PDF display.As described above, the Softquad Panorama viewer is the primary tool being used to render the SGML documents. A specific document with accompanying images, residing on an HTTP server, is selected for display by the user from client-displayed short entry search results. The client invokes the Panorama software which displays the SGML and images using CCI links to the Netscape browser. As a work-around for the raw SGML rendering problems mentioned above, some publishers have indicated a strong interest in providing complex mathematics, etc. embedded in their SGML in some other format (such as TeX or Windows Meta Files). The Testbed team has been investigating this approach, and software has been adapted that will read TeX embedded in SGML and replace it with links to GIF files. While this approach does seem to have some potential, it can lead to a unwieldy number of GIF links in the final SGML -- especially if TeX is used for all math, even simple math that can be rendered properly by SGML rendering systems. The Testbed team has developed software to bundle image files at the server and pass them to Panorama in one burst to reduce the time required for multiple HTTP connections. This has minimized the setup to display time required when viewing a full-text article.The document figures, and sometimes tables, are being provided to the Project in a variety of formats by the publisher partners. We are using the ULEAD image viewer to display these bit-mapped figures and tables.
Authentication
The authentication of a digital document becomes an issue of great importance in a wide-area networked document retrieval environment centered around the delivery of SGML objects as described in this paper. A user of this system who has retrieved the full-text (with associated images and mathematics) of a document could, if they so desired, modify the document and then pass this altered version on to another user. Indeed, there is a great emphasis in today's document-centric applications suite software packages, and will be additionally so in tomorrow's document-centric operating systems, to provide easy-to-apply workgroup tools and environments in which documents can be easily altered.. This is a situation markedly different from today's document copying environment in which photocopied articles are passed to colleagues around the room or around the county. Clearly, there needs to be put in place a system in which the recipient of a document can be certain that they have in their possession the unaltered, original version, or the author's original version with clearly differentiated annotations from either the author or subsequent reviewers or readers.There is a pressing need to establish an internationally agreed upon document authentication system that incorporates a unique digital signature for every original source document. This system, as currently envisioned, would support a document repository protocol in which each document would be assigned a digital checksum along with a public encryption key. The checksum and key would then be used against a designated authentication server on which a copy of the original document would reside. With this system in place, the recipient of any digital document would be able to verify the authenticity of the document and/or obtain an unaltered copy of the original document.More problematic is the problem of consistently indexing and retrieving the heterogeneous SGML generated from the different DTDs used by publishers. SGML's greatest strength, and, at the same time, its greatest weakness, is its flexibility. This flexibility allows the coding of different tags to represent the same field and differing tag structures to express the same article element. This requires retrieval across different DTDs to accommodate the differences in tags.Several solutions to solving these problems are being explored. They involve normalization of the SGML into a single preferred form DTD; mapping of synonymous tags into a preferred form of the tag; parallel searches carried out by an intelligent search agent with a merging of result sets.
Collaborations
We have also begun sharing information with the Stanford DLI project as part of the DLI Interoperability Experiment, which was the recipient of a DLI supplemental grant. The Stanford DLI personnel have visited us and we have formulated a plan for sharing information about our search engine and interface techniques. This will allow Stanford personnel to write an object wrapper and protocol translation software for the two projects. We also expect to utilize the Stanford proxy server within our interface software to provide access to the remote information resources available through the Stanford server.The Testbed team has entered into a Memorandum of Understanding with the University of Michigan DLI Interface team to incorporate the Search Tree retrieval techniques first developed by Karen Drabenstott for the ASTUTE Project into the UIUC Testbed interface design. This agreement is expected to foster close cooperation between the Illinois and Michigan Testbed groups, and provide a means for testing the efficacy of various interface designs across several DLI projects.Testbed personnel have also written a information retrieval system for the Museum Educational Site Licensing Project (MESL) which provides indexing, retrieval, and display of some 7,000 bit-mapped images (and accompanying textual descriptions) from Fowler Museum at UCLA, George Eastman House, Harvard University art museums, Houston Museum of Fine Arts, Library of Congress, National Gallery of Art, and the National Museum of American Art.
Publisher Issues
During the last year, UIUC DLI Testbed team have conducted two large-scale meetings and several small group information sessions with the publisher partners regarding processing issues. The Primary Publisher Workshop, as budgeted in the grant, was held in May 1995. A hands-on workshop for processing SGML for all our partners (at their expense!) on November 16-17. A meeting on August 24 included representatives of IEEE Computer Society, and a meeting September 13-14 included representatives from the American Physical Society, the American Institute of Physics, Beacon Graphics, Los Alamos National Laboratory, and the Naval Research Laboratory (the latter 3 organizations are involved with the APS on related publishing and publication archive projects).
Appendix: Status of Publisher Material in Illinois DLI
AIAA
We have received the text in SGML from the May 95 issue of the AIAA journal. The figures for AIAA Journal will be available as TIFF files in January 96. Until then, we are going to scan the figures and insert them into the text files.
AIAA is moving from converting TeX files into SGML to producing the journal in SGML format. We are still working on converting the TeX (imbedded math) in AIAA files. This is an issue with several of our publishing partners and has an impact on performance. See IEEE below.
AIP
The processing of AIP files is now at production level (weíre receiving and processing materials on a regular schedule). We have processed over 2,000 articles from Applied Physics Letters, from January 2, 1995 to September 4, 1995.
APS
APS will begin sending us more issues of Physical Review Letters . The material will include the backlog of issues from the beginning of 1995.
ASCE
In addition to the Journal of Transportation Engineering, ASCE has sent us new SGML material from the following journals:
Journal of Aerospace Engineering
Journal in Computing in Civil Engineering
Journal in Materials in Civil Engineering
Journal of Construction Engineering and Management
Journal of Performance of Constructed Facilities
We have begun working on their files and establishing procedures for production.
IEEE
The back issues of IEEE Transactions have much TeX imbedded in the SGML. We now have a program, written by one of our programmers, that converts the TeX to GIFs, leaving a marker that points to an URL. As would be expected, an article with numerous GIF files to bring over the WEB is pretty slow to retrieve and some of the IEEE Transaction articles have as many as 450 GIFs. Some of our publishers are using SGML for math and using TeX for only the most complex equations.
IEEE Computer Society
In addition to sending us SGML material from Computer, IEEE Software, and IEEE Design and Test, IEEE Computer Society has begun sending us material from the following journals:
IEEE Computational Science and Engineering
IEEE Graphics
IEEE Expert: Intelligent Systems and their Applications
IEEE Micro: Chips, Systems, Software and Applications
IEEE Parallel and Distributed Technologies
Go back to the DLI progress reports page
