Program Plan for Period 4
March 1998 to August 1998
DLI project, University of Illinois
Federating Repositories of Scientific Literature
Bruce Schatz, PI, schatz@uiuc.edu
contact: dli@uiuc.edu, http://dli.grainger.uiuc.edu
Research Plans
Testbed
The final phase of the design and implementation of the Web client (http://dli.grainger.uiuc.edu/deliver/) was completed and campus-wide distribution commenced October 15, 1997. Since it's distribution, usage of DeLIver (the testbed) has increased to a total of 700 registered users, proving the web client has had significant impact (200 new users within the first six weeks of distribution). DeLIver usage information will be made available via the web this year.
The Testbed will continue to increase both the depth and breadth of the digital collection. We currently have a full SGML collection of 40,000 articles from 54 journals from 5 publishers and we are adding about 2,000 articles per month to the system. For a current list of what's in the system, investigate: http://dli.grainger.uiuc.edu/journals/.
For the testbed records, we are in the process of providing forward and backward links from bibliographic citations to other items in the testbed. We are also providing links from testbed record bibliographic citations to the INSPEC and Compendex databases.
In addition to this work on dynamic linking, there are a number of other interesting areas that we will continue to investigate relating to indexing, searching, and displaying of the heterogeneous full-text repositories in our testbed. These include: the conversion of SGML to HTML 4.0 (including the use of cascading style sheets, distributed font sets, and dynamic HTML); the conversion of SGML to XML and the use of the Mathematics ML specification; the implementation in Web browsers of some of the dynamic retrieval techniques used in the custom client such as dictionary word wheels and dynamic multi-directional links; dynamic links to full-text repositories from A & I Service bibliographic databases and online catalogs; and the implementation of simultaneous multiple repository searching and advanced user navigation and gateway techniques.
The UIUC DLI will be a recipient of funding to continue the testbed. DARPA will be providing $100K per year for three years to support the continuation of the UIUC DLI testbed. The follow-on project will open the testbed to researchers working in these areas and we will continue to use the University of Illinois population as the test group. We will be asking our present partners to continue to supply us with SGML in order to attain the 'critical mass,' the depth and breadth of collection needed for some of the studies. We also plan on adding other publishers and other types of publications to the testbed, allowing broader studies and more accurately simulating the digital library.
Leads: Bill Mischo, Tim Cole
Internet
IODyne retrieval client development has been focusing on adaptation of the Z39.58 common command language syntax for user queries. This will make query construction more flexible and allow the minor query dialogs (like for title and author) to be eliminated. It will also tie user interaction closer to the IODyne search document and allow tighter association of keyword display to repositories currently attached to the search document.
Most significantly, the new query syntax and user interaction tools will allow for on-the-fly downloading and adaptation of repository configuration objects to give users maximum flexibility and control when using the same query with multiple repositories. This will be especially advantageous when using the new retrieval gateway co-developed with NCSA, Gazebo, which uses XML encoding for all aspects of client-server retrieval interaction, as well as a modular, hierarchical attribute space to enable servers to "intelligently" handle queries that they otherwise would not be able to handle. IODyne will still work with gateways built on current standards, such as Z39.50.
Design work continues on IODyne repository configuration objects (RCOs), which will allow IODyne clients to work with distributed resources in a unified way. Digital librarians will be able to assemble collections of RCOs of interest to the research communities they serve, and deliver these collections dynamically to IODyne clients over the Internet
Supported by the DLI, the Emerge group at NCSA has worked with Eric Johnson to develop the Gazebo search gateway. Gazebo allows clients to perform searches on collections of remote data sources simultaneously w/o having to reformulate the queries for the data source's protocols or configuration. Currently, Z39.50 data sources are supported and Opentext support is on the way. Gazebo translates abstract queries into these underlying protocols, submits them, and returns results to its client asynchronously. It will enable IODyne to dynamically discover the attribute space it's configured for, so IODyne will be able to perform searches on previously unknown data sources.
Mapping from the abstract attribute space into particular data source protocol parameters is currently the only supported query translation method Gazebo employs, but we're working to extend it to support additional kinds of query translation. This could include performing arbitrary computations on query terms such as vocabulary switching or coordinate translation, or structure-altering transformations such as query expansion. Since many of these services are domain-specific, design work needs to be done to see what kind of general framework these might fit into.
Emerge's distributed search architecture, which is also being adapted for use in NASA Project 30, is written up in a whitepaper by J. Futrelle, R. McGrath, and R. Plante, which can be found at http://monet.astro.uiuc.edu/~rplante/topics/P30/sysmodel.html
Leads: Eric Johnson, Bill Wendling
Research
I. Geographic Information Systems:
(1) System Integration: Significant work is under way to provide a tight integration of the initial image/text/number GIS server (demonstrated at the DARPA IM meeting). Both the breadth (collection sizes and formats) and depth (integration and display) will be significantly enhanced in the second release of the GIS prototype.
(2) Image: The Artificial Intelligence Group has continued to refine its visual thesaurus techniques. The AI Lab expects to be able to incorporate 1000 images into the thesaurus by the end of April 1998. Image analysis techniques will be optimized and parallelized on the SGI Origin2000 supercomputer at NCSA and AI Lab (a new 8-node server was purchased for AI Lab's international experimentation and local needs). (A user study was completed in December comparing image classification by the AI Labs Visual Thesaurus to the classification made by human subjects. Three tasks were assessed, 1) similarity analysis, 2) categorization, and 3) semantic segmentation. The study found the AI Lab system was able to perform as well a the human subjects at these three tasks.)
(3) Georeferenced and Numeric data: Channel 2 and 3 of the AVHRR data from NASA's Pathfinder Satellite and data from the United States Geological Survey's GNIS gazetteer have been incorporated into a Vegetation and Temperature thesaurus for California. The AVHRR testbed will be expanded in the Spring of 1998 to include the Digital Elevation Model (DEM) data and Digital Line Graph (DLG) data representing linear forms such as rivers and highways. This data will be used in the Temperature Thesaurus and the Visual Thesaurus to improve their geoscience-specific functionality.
The above new geoscience data types will be displayed within the original Java-based demo system. Several 3-D visualization techniques are under investigation for future display of 3-D DEM and DLG data.
(4) Semantic Interoperability: Dorbin Ng will be completing his Ph.D. dissertation in the next six months. The focus of his dissertation will be the semantic integration of multiple geoscience collections under a single interface, the Knowledge Manager. The Knowledge Manger allows the user to search several applications simultaneously and uses co-occurrence and similarity weights to weight and rank the output. A prototype of the geoscience Knowledge Manager was developed in October 1997 and testing and modifications will be made over the next six months. Components of the Knowledge Manager are the Textual Knowledge Source and the Tile Knowledge Source. The Textual Knowledge Source allows simultaneous searching of Petroleum Abstracts, GeoRef Thesaurus, and the GeoRef and GeoAbs collections. The Tile Knowledge Source allows access to the Visual Thesaurus.
II. Algorithms and Analysis:
(1) Ward's Statistical Clustering: A multi-link statistical clustering algorithm is under development. The technique is believed to be able to produce higher precision than the SOM method that we developed earlier. The Ward's algorithm will also be able to produce a hierarchical clustering tree for easy browsing. System fine-tuning and user study of sample clustering results is under way. Ward's clustering is planned to be a techniques used in user-centered analysis and clustering for future DL projects.
(2) MDS Clustering/Display: A 2-D MDS placement algorithm and 1-D MDS clustering algorithm has been under development. The techniques are believed to be able to generate an intuitive and graphical summary of document clusters. A comparison of results from MDS and SOM is under way. MDS is planned to be a techniques used in user-centered analysis and clustering for future DL projects.
(3) Noun phrasing and type tagging: The capabilities of the Arizona Noun Phraser will be expanded to include verb phrasing, incorporation of partial noun phrases, code optimization, and type tagging. In particular, medical and geoscience specific lexicons (e.g., NLM's SPECIALIST lexicon, and GNIS georeferenced gazetteer) will be used to help tag domain-specific concepts.
(4) Concept Space Evaluation and Integration: Andrea Houston's Ph.D. dissertation will include a qualitative and quantitative user evaluation of the AI Lab concept space techniques including recently developed JAVA-based graphical applications. The goal is to collect information from 50 subjects in the health sciences by April 1998. Subjects include cancer researchers, physicians, graduate students in the health sciences, and librarians in the Arizona Health Sciences Library. Qualitative feedback will be used in the design of a new JAVA-based interface.
Leads: Hsinchun Chen, Bruce Schatz
Evaluation
During the last year, the Social Science Teams work has fallen primarily into five major areas: 1) collaborative design of DeLIver with the Testbed Team; 2) research on various aspects of work practices in the changing digital environment; 3) implementing mechanisms for capturing online data related to testbed use; 4) developing methods for producing summative reports of testbed use; and 5) building a research community focusing on social aspects of DL design and evaluation.
The Social Science Team collaborated throughout the year with the Testbed Team in designing DeLIver, our testbed's web-based interface. The Social Science Team conducted a focus group interview with science and engineering librarians to obtain their views on the design of the system, an appropriate marketing strategy, and the manner in which DeLIver would fit with current UIUC library practices. Team members conducted usability tests for DeLIver and implemented an online survey to gather user feedback on an early web interface. They also interacted with the Testbed Team much more intensively than they had before, meeting at times on a weekly basis to discuss design decisions. Social Science Team members also produced online help documents for DeLIver.
Social Science Team members conducted research on a number of problems related to the use of digital information infrastructure. Bishop initiated a study of how people use the individual components of journal articles. Other studies look at issues related to the use of federated systems. Neumann and Star studied the phenomenon of information convergence. Star, with Karen Ruhleder, is studying how people organize material in their workspaces. Neumann and Ignacio began a study of how people learn to use the disparate set of computer tools that confront them. Michael Twidale launched a study of collaborative practices among students in learning to use DeLIver.
A great deal of effort has been spent in the last year to instrument DeLIver in order to authenticate, register, and log the behavior of system users. A registration form to collect demographic data about users went through several design iterations. Appropriate processing for limiting access to eligible users has also gone through several iterations (work primarily accomplished by the Testbed Team, but with assistance from the
Social Science Team). Transaction logging for the custom client was stabilized so that monthly reports on use can be produced; transaction logging for DeLIver has been implemented by the Testbed Team and now the Testbed and Social Science Teams are working to develop effective processing mechanisms for storing and analyzing that data. The move from the custom client to the web client has necessitated a great deal of data migration work related to information gathered about users.
Another area of extensive work for the Social Science Team has been the preliminary design of procedures and instruments for gathering summative user evaluation data. Surveys on use and satisfaction have been drafted, and tentative plans for studying use in context have been formulated.
The final major area of work this past year encompasses efforts by Social Science Team members to contribute to the development of a research community devoted to studying social aspects of DL design and use. Team members have been active participants in various workshops and conferences, and have tried to encourage and facilitate discussion and collaboration related to human-centered design and evaluation across the six DLI projects. Ann Bishop is working with Nancy Van House (Berkeley) and Barbara Buttenfield (Santa Barbara) to produce an edited volume related to this area of study, with seed funds from the UIUC DLI.
In the coming year, the Social Science Team will collect summative evaluation data from DLI testbed users through a survey and other means of gathering feedback from users (such as focus groups and observations of use in context). They will continue to work closely with the Testbed Team to capture and analyze records of online user activities. Social Science Team members will produce final reports describing and discussing use of our DLI testbed. They will also continue their research into work practices in the digital environment and prepare papers on their findings..
Leads: Ann Bishop, Leigh Star
Financial Report
see attached budget and itemizations for main Grant to University of Illinois and for subcontract to University of Arizona
Management Report
Organization Chart
Principal Investigator: Bruce Schatz (CANIS)
Testbed:
Search Indexing: Bill Mischo (Grainger Library)
Collection Development: Tim Cole (Grainger Library)
Internet:
Multiview Client: Eric Johnson (CANIS)
Multiprotocol Gateway: Joe Futrelle (NCSA)
Research:
Semantic Retrieval: Hsinchun Chen (Univ. of Arizona)
Interspace Environment: Bruce Schatz (CANIS
Performance Evaluation: Roy Campbell (CS)
Evaluation:
User Studies: Ann Bishop (GSLIS)
Context Studies: Leigh Star (GSLIS)
Testbed:
Programmer (Collection): Bob Ferrer (Grainger)
Programmer (Search): Tom Habing (Grainger)
Programmer (Display): Donal OíConnor (Grainger)
Students: Han Wen Hsiao (Grainger)
Internet:
Programmer (Client): Eric Johnson (CANIS)
Programmer (Gateway): Bill Wendling (NCSA)
Research:
Programmer (Algorithms): Dorbin Ng (Arizona)
Programmer (Environments): Kevin Powell (CANIS)
Programmer (Performance): Bob McGrath (NCSA)
Students (Information Science): Bill Pottenger, Conrad Chang (CANIS)
Students (Computer Science): Yongchen Li (CS)
Evaluation:
Students: Eric Larsen, Laura Neumann, Cecelia Merkel, Emily Ignacio, Bob
Sandusky (all GSLIS)
Coordination:
Partners: Susan Harum (CANIS)
NSF: Ben Gross (CANIS)
CANIS = Community Architectures of Networked Information Systems, Graduate
School of Library and Information Science (GSLIS)
Grainger = Grainger Engineering Library Information Center, University Library
NCSA = National Center for Supercomputing Applications
GSLIS = Graduate School of Library and Information Science
CS = Department of Computer Science, University of Illinois at Urbana-Champaign
Partners List
Publishers:
AIP American Institute of Physics (Applied Physics)
APS American Physical Society (Theoretical Physics)
AAS American Astronomical Society
ASCE American Society Civil Engineers
ASAE American Society Agricultural Engineers
ASME American Society Mechanical Engineers
AIAA American Institute Aeronautics & Astronautics
IEEE Institute of Electrical and Electronics Engineers
IEEE CS IEEE Computer Society
IEE Institution of Electrical Engineers (British)
EI Engineering Information (Compendex)
OSA Optical Society America
John Wiley
Elsevier Science
Academic Press
AAAS American Association Advancement Science
Software and Hardware:
SoftQuad
OpenText
Hewlett-Packard
Microsoft
OCLC
DLI Projects:
Santa Barbara (GIS semantic retrieval)
Carnegie-Mellon (NetBill charging software)
Stanford (Search Interoperability)
Michigan (User Interfaces)