OBJECTIVE
We are developing the technology to effectively handle digital libraries of scientific literature on the Net. Particular attention is paid to federating repositories of structured documents. The technology includes a Testbed of searchable journals obtained in SGML format direct from publishers, Internet software for effective search of this Testbed, and Research to enable semantic federation across repositories in different subject areas. Results will include digital library software and sociological evaluation of its use with hundreds of users searching tens of thousands of documents.
APPROACH
The Testbed efforts include obtaining SGML of journal articles in a direct pipeline from major publishers in engineering and science, then indexing these into a single federated collection, which can be searched and displayed using the structure of the documents. The large-scale Testbed is operational within the context of a major engineering library and integrated with its other online services as a production facility. The Internet client supports multiple views into term suggestion and full-text search indexes, with a transparent drag-and-drop interface between indexes. The term suggestion indexes are used to locate appropriate terms to search for in the full-text indexes. The suggestors use concept space indexes generated by the Research below. The full-text indexes are network connections to the Testbed above. The Research efforts are developing automatic indexing technology for the content of documents instead of the structure. In particular, they are investigating vocabulary switching across subject domains in engineering and science, by computing concept spaces on large document collections using supercomputers. The protocols for concept spaces are also being embedded as the fundamental infrastructure of a new network information system, called the Interspace environment, which supports information analysis by correlation across repositories.The Evaluation efforts study the users and usages of the technology developed above, with a variety of methodological techniques. Usability studies using the developed clients are done on sample populations. Contextual investigations study the total information usage of the user population beyond the text documents supported in the Testbed. Large-scale user studies, when the clients are propagated around university campuses, utilize instrumentation and surveys.
PROGRESS
The Testbed with production facilities is operational in the Grainger Library with continual streams from 5 publishers, currently comprising 18,000 complete SGML articles. The Internet client is operational and being connected to the production sources. The Research has computed concept spaces for 10,000,000 abstracts in 1000 subfields across all of engineering. The Evaluation has performed usability studies on several versions of several clients and interviewed several contextual focus groups.
RECENT ACCOMPLISHMENTS
Testbed: production pipeline from multiple publishers in physics (AIP,APS), civil engineering (ASCE), and electrical engineering (IEE, IEEE CS). All materials processed into canonical SGML and placed into a federated repository with currently 18,000 articles. Custom client integrates SGML search into all information sources within the Grainger Engineering Library.
Internet: software prototype for multiple view interface developed with integrated displays for term suggestion (manual subject thesaurus and automatic concept space) and for full-text search (bibliographic abstracts, SGML documents).
Research: concept spaces generated from 10,000 abstracts (Compendex, Inspec) across 1000 subfields of engineering and integrated into a demonstration system with full-text search and category maps to investigate technology for vocabulary switching.
Evaluation: usability tests done for all 3 clients (Testbed, Internet, Web) and laboratory context studied for several engineering groups.
Publishers: annual Partner Workshop held to build collaborations.Specialty mathematics workshop held to discuss how to handle SGML mathematics. This resulted in publisher consortium led by our partners who funded development by SoftQuad to enhance Panorama for display of equations in journal articles.
PLANS
Testbed: expanded depth and breadth of coverage in collection. This will include 2-3 years of electrical engineering (IEE) and civil engineering (ASCE) journals plus more physics (AIP), computer science (IEEE CS), aerospace engineering (AIAA), and possibly optical science (OSA) and mechanical engineering (ASME). Availability through Ovid server of complete Inspec and Compendex abstracts will supplement breadth of full-text. Low-end Web version will search these repositories from within browser.
Internet: prototype multiple view client will become usable system via connection to production Testbed databases. Term suggestion will include Inspec subject thesaurus. Concept spaces and full-text search will be available for Inspec abstracts (via Z39.50 connection to Ovid server) and for SGML articles (via socket connection to Opentext server). Visual Basic Windows PC version available for limited distribution.
Research: vocabulary switching experiments on documents (concept spaces across subject domains) using large collections generated in engineering disciplines. Multimedia semantic retrieval evaluated on images (textures within maps) linked to text (phrases within documents), in collaboration with UCSB DLI project.
Evaluation: large-scale usage study based on transaction log instrumentation of Web Testbed client. small-scale usability studies on novel functionality of Internet client. contextual ethnographic studies of information usage in engineering labs.
Publishers: annual Partner Workshop to discuss collaborations and technology transfer.
TECHNOLOGY TRANSITION
The Testbed repository will become an educational resource for the University of Illinois and other universities in the Big Ten. The University Library at Illinois is committed to continuing this resource as a campus facility beyond the DLI grant. The other CIC universities in the Big Ten have committed to using the Testbed on an experimental basis.We are collaborating with our publishing partners to help them set up their own SGML repositories. In particular, the AIP (American Institute of Physics) is paying us to clone our software and hardware setup at their home site so they can maintain their own repositories of their own journals. Our publishers have requested much continuing support that we are unable to provide within the confines of our research project on the DLI grant. Thus we have established a private corporation for technology transfer between the DLI and our partners, called IODYNE Digital Library Technologies, which will provide service and software.
Significant Event
This past grant period saw the first major experiment in semantic interoperability that showed that concept-based retrieval might be practical on real collections. Concept spaces for 1000 subfields of engineering were computed on 10,000,000 abstracts using 10 days of supercomputer time, to provide a testbed for investigating vocabulary switching. This represents the largest computation ever in information science, and a major step towards semantic interoperability, the Grand Challenge of Digital Libraries. This experiment received extensive publicity. There was a whole page news article in Science (June 7, 1996 -- Digital Libraries: Computation Cracks Semantic Barriers Between Databases), a sidebar in the technology news of Business Week, and many news articles in the trade press for high-performance computing. Since the computation was done using special allocations on the new HP Convex Exemplar at NCSA, the experiment was featured in a press release on the Illinois DLI from our partner Hewlett-Packard. Finally, the experiment was featured as the conclusion to the cover lead article in Science by PI Bruce Schatz (January 17, 1997 -- Information Retrieval in Digital Libraries: Bringing Search to the Net). The article emphasized that this was the first major crack in the semantic barrier, realizing the grand visions of JCR Licklider in 1961 for concept-based retrieval across collections covering the entire literature of science.The experiment established the technical feasibility for generating concept spaces for million item databases. Three million abstracts from Compendex and Inspec were divided into 1000 subfield categories, totaling 10 million abstracts. Each subfield collection represented a community repository and a concept space was computed for each, comprised of a co-occurrence matrix giving the frequency that terms occurred together within the collection. Earlier experiments had shown that concept spaces were effective for term suggestion for community repositories. That is, user search sessions were significantly enhanced by first navigating in the concept spaces to find appropriate terms actually in the collection, then utilizing those terms to retrieve desired documents with full-text search. This supercomputer simulation established the first testbed for investigating vocabulary switching, by providing an adequate set of document collections and semantic indexes to experiment with algorithms for navigating across concept spaces to support semantic interoperability.
Go back to the DLI progress reports page
