HPCwire

The Text-on-Demand E-zine for High Performance Computing
April 12, 1996: Vol. 5, No. 15 Circulation: 17,877

DIGITAL LIBRARY INITIATIVE TACKLES GRAND CHALLENGE OF INFO SCIENCE. by Alan Beck, managing editor

Using a week of dedicated computer time on NCSA's 64-node HP Convex Exemplar, researchers on the NSF/DARPA/NASA-funded Digital Library Initiative project were able to generate concept spaces for 10,000,000 journal abstracts across 1,000 subject areas.


QUOTE OF THE WEEK

"It's a good example of what HPC technology is good for. It lets you simulate functionality on the high-end which will be available for the masses when, in 10 years, the typical PC runs at the same speed as the Convex Exemplar."

-- Bruce Schatz, Research Scientist on the Illinois DLI project


Urbana-Champaign, Illinois -- Using NCSA supercomputers, researchers on the NSF/DARPA/NASA-funded Digital Library Initiative project have taken a first step toward effecting quick and efficient conceptual searches across widely heterogeneous sources. This task, often characterized as the Grand Challenge of Information Science, promises to be a vital key in expanding technical progress in the 21st century by allowing disparate disciplines to cross-fertilize one another.

The official IITA (Information Infrastructure Technology and Applications, the highest level technical committee for NII and a major part of the HPCC program) report of May 1995 on the Research Agenda for Digital Libraries has explicitly identified semantic interoperability as the Grand Challenge of Digital Libraries.

Although developments in one area of learning may have direct bearings on many others, different technical vocabularies often prevent salient concepts from being readily ported from field to field. Present search technologies, based on simple word identification and logical operations, do little to overcome this fundamental lack of semantic interoperability.

What is required is a way of searching rooted in concepts rather than words, since different words represent the same concept in different fields. One group of researchers believes the key to unlocking this problem lies in the implementation of a statistical technique known as "conceptual space" (i.e. spaces of intersecting concepts).

Conceptual space is produced via statistical evaluation based upon co-occurrence matrices, i.e. the frequency of two terms occurring together in the same context. Typically, two- or three-word noun phrases within the same sentence are targeted. A graph of all the terms representing all the concepts in a given subject domain is thereby generated, using the statistical frequency of occurrence within the documents.

"Noise" words such as prepositions and conjunctions are filtered out, and individual words are canonicalized. Initially, the latter was accomplished through "stemming," e.g. chopping off plural and gerundive suffixes, although the present algorithm no longer approaches the task in this manner. The computation is automatic and domain independent. The resulting graph can be used for term suggestion, i.e. given a term, it lists others commonly occurring with it within the same collection.

This constitutes a practical enhancement to information retrieval systems, since the major user problem is determining which terms to search for to locate desired concepts. Multiple concept spaces in multiple subject areas can be used to vocabulary switch, i.e. map terms in one subject area to similar terms in another.

A large-scale simulation of vocabulary switching -- utilization of the terminology inherent in one subject to search a different one -- relying upon conceptual space was recently carried out at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign by researchers Hsinchun Chen and Bruce Schatz on the Illinois DLI (Digital Library Initiative) project. Schatz is the Research Scientist at NCSA responsible for helping build and support the national HPC community in information science, of which Chen at the University of Arizona is one of the first members.

Using a week of dedicated computer time during the testing period of NCSA's new 64-node HP Convex Exemplar (10 days of CPU time overall on NCSA's HP Convex Exemplar and SGI Power Challenge), concept spaces were generated for 10,000,000 journal abstracts across 1000 subject areas in engineering and science. Approximately 1M of abstracts from Compendex (all of engineering) and about 2M of abstracts from Inspec (computer science, electrical engineering and physics) were the raw input, yielding about 10M of documents when the abstracts were partitioned into 1000 subject area collections to simulate community repositories, with each document in multiple subject areas.

This represents the largest computation thus far in information science and one of the largest ever on a supercomputer at NCSA. "Information retrieval is the most important problem in the Net now," Schatz told HPCwire. "And there's been no progress in 30 years. What's being used now is exactly what was used 30 years ago...There was no technology that scaled, that did anything better than word matching. But here's something which does work for large collections, that will actually give better functionality".

"It's a good example of what HPC technology is good for. It lets you simulate functionality on the high-end which will be available for the masses when, in 10 years, the typical PC runs at the same speed as the Convex Exemplar."

Go back to the May 2-3, 1996 Partners Workshop Page