Quarterly Report for 4th Quarter 1997 (November '97 - February '98)
Digital Libraries Initiative
University of Illinois at Urbana-Champaign
Federating Repositories of Scientific Literature
Bruce Schatz, PI, schatz@csl.ncsa.uiuc.edu
contact: thabing@uiuc.edu, http://dli.grainger.uiuc.edu/
SIGNIFICANT EVENTS
TESTBED
OVERALL
Usage of DeLIver (the testbed) has increased to a total of 700 registered users, proving the web client has had a significant impact (200 new users within six weeks of the initial distribution). We are also experimenting with ways to encourage use with registration. We will make DeLIver usage information available from a web page soon.
The Testbed will continue to increase both the depth and breadth of the digital collection. We currently have a full SGML collection of 40,000 articles from 54 journals from 5 publishers and we are adding about 2,000 articles per month to the system. For a current list of what's in the system, investigate: http://dli.grainger.uiuc.edu/pubs/.
The UIUC DLI will be a recipient of funding to continue the testbed. DARPA will be providing $100K per year for three years to support the continuation of the UIUC DLI testbed.
SPECIFIC ACTIVITIES
For the testbed records, we are in the process of providing forward and backward links from bibliographic citations to other items in the testbed. We are also providing links from testbed record bibliographic citations to the INSPEC and Compendex databases.
EVALUATION
OVERALL ACTIVITIES
During the past quarter, the Social Science Team has largely continued the activities begun last quarter, as opposed to initiating any new data collection activities. Team members have been involved in trouble shooting usability problems with DeLIver; continuing their analysis of data collected earlier in the project; the iterative development of user registration, authentication, and transaction logging procedures for DeLIver; the development of a summative user survey instrument; and ongoing research related to work practices in the digital era.
SPECIFIC ACTIVITIES
This quarter saw the first significant public use of the testbed, via the web-based DeLIver service. Migration to public access on the web led to some unforeseen programming work. The method of how people were to be registered changed about the time of the rollout, which necessitated extensive re-coding. In addition, a security problem in the registration web server was uncovered and has been addressed.
The public launch in October 1997 gave us our first opportunity to diagnose and address access problems experienced by web users of the DLI testbed. The Testbed and Social Science Teams collaborated in trouble-shooting usability problems--that is, in diagnosing and addressing access barriers--associated with DeLIver. Upon discovering that a significant number of potential DeLIver users abandoned their access attempts when confronted with our user authentication and registration forms, the teams worked together to identify probable reasons and remedies. The result was iterative changes to our registration and login forms, additions to our marketing strategy, and the development of new insights into how to measure and understand the nature of use of experimental DL systems.
Team members again re-worked the registration form that all public users are required to fill out before using DeLIver. The layout of the registration form has been streamlined and made more appealing, though the content has changed very little. Nonetheless, this action forces a data migration event between old and new formats; this migration work began this quarter.
Team members continued working with the Testbed Team to redesign the user authentication procedures represented by our log-in form. The two teams also continued to set up storage and analysis procedures for DeLIver transaction logs.
In addition to the on-the-fly analysis of user behavior represented by our work to identify access barriers to DeLIver, two researchers continued their own user studies as independent affiliates of the DLI Social Science Team. Michael Twidale continued a study, which draws students from courses in which DeLIver has been demonstrated. Students are invited to sign up for individual help sessions with Twidale, during which they try using the system to retrieve literature for some current assignment. These usability sessions reveal basic problems with DeLIver and also support Twidale's own development of new interfaces that include visualizations of search. Najmuddin Shaik, a PhD student in Educational Psychology, concluded his preliminary usability tests (employing the cognitive walkthrough methodology) for the DLI IODyne client. The longer-term aim of this study is to produce a new mock interface that explicitly represents typical search paths in the system.
Team members revised preliminary drafts of user survey instruments that will provide summative data on the extent and nature of testbed use, the level of user satisfaction with the system, and reasons for use and nonuse. Ann Bishop and Emily Ignacio consulted with Barbara Buttenfield and Linda Hill of the Alexandria Digital Library about developing reliable registration and survey instruments, and devising means of encouraging users to complete them. These conversations resulted in changes to instruments employed by both the Illinois and Santa Barbara groups.
In this quarter, Social Science Team members continued work on their studies of work practices and the changing nature of digital information infrastructure. These include research on the use of document components, the organization of office workspaces, the manner in which people deal with new computer-based information systems, and information convergence. Several of these studies have produced abstracts or papers submitted to the 1998 Annual Meeting of the American Society for Information Science, ACMs DL 98 conference, and IEEE's Socioeconomic Dimensions of Electronic Publishing workshop.
In addition, Ann Bishop continued her work with Barbara Buttenfield (Alexandria Digital Library) and Nancy Van House (Berkeley DLI project) to produce an edited volume tentatively called The Use of Digital Libraries: Social Aspects of Design and Evaluation. Five authors have committed chapters and four have made tentative commitments. Authors and those providing commentary will meet at the DL 98 conference to discuss draft chapters; in addition, a DL 98 panel presentation related to the book has been proposed. The Social Science Team also revamped its project web pages this semester, in an effort to improve access to our work for other interested practitioners and researchers.
RESEARCH
OVERALL ACTIVITIES
The UA/MIS Artificial Intelligence Lab headed by Dr. Hsinchun Chen will be getting a SGI Origin 2000 supercomputer in the next few months. This piece of equipment will facilitate the processing of large collections by the AI Lab.
The AVHRR testbed will be expanded to include the integration of Digital Elevation Model (DEM) data and Digital Line Graph (DLG) data. DLG represents linear forms such as rivers and highways as lines from point to point.
SPECIFIC ACTIVITIES
The AI Labs Visual Thesaurus to the classification made completed a user study in December comparing image classification by human subjects. Three tasks were assessed, 1) similarity analysis, 2) categorization, and 3) semantic segmentation. An expert evaluated results with three years experience in analyzing remote sensing data. The study found the AI Lab system was able to perform as well as the human subjects at these three tasks.
Analysis and Visualization:
1) WARD's ALGORITHM-A preliminary test of the ability of Ward's Algorithm to cluster similar objects in a collection was conducted in December. Both the size of the collection used and the number of subjects studied will be expanded. The experiment should be completed this spring.
2) MDS-60 subjects, mostly MIS and ECE students, graduates and undergraduates, where used to explore the differences between a Multi-dimensional Scaling Display and a randomly generated display. Three important (and statistically significant) differences were found. It has been concluded that MDS speeds up and improves the quality of manual classification of documents and that the MDS display agrees with subject perceptions of which documents are similar and should be displayed together.
3). DIMENSIONAL SPACE TESTING-The usefulness and user preferences for 1D (Linear), 2D (Flat), and 3D (VRML) SOM displays are currently being evaluated. Users seem to prefer the 1D and 2D displays. The 1D display is similar to a list and the 2D display provides users with relationships between categories. While some users found navigation in the 3D space difficult, while others liked the futuristic display and where able to appreciate its potential.
Development and Evaluation of Noun-phrase Parsers
Research Assistant Qin He and FTEs Nuala Bennett and Conrad Chang summarized the results of their noun-phrase parsing experiments in CANIS Technical Report 1.
Automatic Concept Space Validation
Research Assistant Dmitry Zelenko finished development of a methodology for Concept Space comparison and validation. He then conducted experiments comparing Concept Spaces that were generated from the same collection of documents based on two different algorithms. The experiments, in their turn, have shown how to validate algorithms for Concept Space generation. Dmitry has also outlined a methodology of semi-automatic validation, which attempts to bring together the best of automatic validation and precision/recall studies. Dmitry has summarized his work in a report.
Semantic Interoperability GIS
Research Assistants Dorbin Ng, Bin Zhu, and Marshall Ramsey at the University of Arizona designed and implemented a Geographic Information System (GIS) which demonstrates information retrieval across multimedia sources which include text, images, and structured numerical data.
The GIS is a web-based system using Java applets in the user interface. The Arizona team built 14 servers in C to support data retrieval through common gateway interfaces (CGIs). Two web sites, one is in Arizona and a mirror in Illinois, have been made available:
http://ai.bpa.arizona.edu/gis/demo/
http://geox.canis.uiuc.edu/~tng/gis/
Three different types of information were analyzed: textual, images, and structured numerical data. For the textual collections, four Concept Spaces were generated. In addition, a thesaurus generated by human experts was converted to an online searchable format. The textual source of data was obtained from two organizations: GeoRef Information Service (GeoRef) at the American Geological Institute (AGI), and Petroleum Abstracts (PA) from the Petroleum Abstracts Service at the University of Tulsa. One of the main characteristics of the GeoRef collection is that geographic coordinates are recorded. GeoRef has two sub-collections: GeoRef/Ref (200,000 records) and GeoRef/Abs (30,000 records). Three Concept Spaces were computed, first from the two sub-collections separately and then from the entire collection. As noted, the GeoRef Thesaurus consisting of approximately 26,000 geological terms was converted to an online form. The last Concept Space was computed from Petroleum Abstracts, which has approximately 5 million records. The total size of the textual servers came to approximately 2 GBytes.
To create an index on a content-addressable image collection, Kohonen's self organizing map (SOM) neural network algorithm was used to automatically categorize approximately 180,000 tiles (sections of images 128x128 pixels in size) from 28 aerial photographs (each photo ~32 Mbytes in size). All tiles were categorized into one of 200 classes pre-defined by a 10x20 grid. Representative tiles were selected automatically and used in the Java interface. The total size of the image servers and data is four GBytes.
To create an index on numerical data from an Advanced Very High Resolution Radiometer (AVHRR), Kohonen's self organizing map (SOM) algorithm was again employed to automatically categorize four channels of data. The resulting categories were then labeled with temperature and vegetation density, and these labels used in turn to form queries. Search results are displayed in a map of California using pink dots to identify locations with the given combination of temperature and vegetation density. The size of this server is 0.5 Mbytes.
After formulating queries in each of the above three areas (text, image, numerical data), the GIS supports the integration of all three queries into a single, multimedia query. The integration of these queries results in the identification of locations in the area of California covered by the current databases in which all three conditions are met: textual match of the selected terms, image match of the chosen tile from an aerial photograph, and vegetation & temperature match from the chosen ranges in the AVHRR data. The system thus demonstrates true semantic interoperability across textual, image, and numerical data.
In order to validate the performance of the automatic categorization of image tiles in the GIS, approximately 50 subjects participated in an experiment to perform similarity analysis, segmentation, and categorization of image tiles drawn from 20 different aerial photographs. The similarity analysis involved locating tiles in an image similar to given reference tiles. The segmentation process involved grouping adjacent similar tiles, and the categorization involved a global grouping of similar tiles. An expert in the remote sensing field evaluated the results of the experiment, and the conclusion of this experiment is that the GIS did at least as well as human subjects in accomplishing these three tasks. A technical summary of the techniques employed in creating the visual thesauri is available here.
INTERNET
IODyne retrieval client development continues. Focus now is on adaptation of the Z39.58 common command language syntax for user queries. This will make query construction more flexible and allow the minor query dialogs (like for title and author) to be eliminated. It will also tie user interaction closer to the IODyne search document and allow tighter association of keyword display to repositories currently attached to the search document.
Most significantly, the new query syntax and user interaction tools will allow for on-the-fly downloading and adaptation of repository configuration objects to give users maximum flexibility and control when using the same query with multiple repositories. This will be especially advantageous when using the new retrieval gateway co-developed with NCSA, Gazebo, which uses XML encoding for all aspects of client-server retrieval interaction, as well as a modular, hierarchical attribute space to enable servers to "intelligently" handle queries that they otherwise would not be able to handle. IODyne will still work with gateways built on current standards, such as Z39.50.
Design work continues on IODyne repository configuration objects (RCOs), which will allow IODyne clients to work with distributed resources in a unified way. Digital librarians will be able to assemble collections of RCOs of interest to the research communities they serve, and deliver these collections dynamically to IODyne clients over the Internet.
PUBLICATIONS
Elisabeth Bayle, Rachel Bellamy, George Casaday, Thomas Erickson, Sally Fincher, Beki Grinter, Ben Gross, Diane Lehder, Hans Marmolin, Brian Moore, Colin Potts, Grant Skousen, John Thomas. "Putting It All Together: Pattern Languages for Interaction Design" workshop in CHI 97: Human Factors in Computer Systems, Atlanta, GA March 22-27 1997. In January 1998 SIG CHI Bulletin Bayle et al.
Nuala A. Bennett, Qin He, Conrad Chang, Bruce R. Schatz. "Concept Extraction in the Interspace Prototype", technical report.
Bishop, Ann P. Digital Libraries and Knowledge Disaggregation: The Use of Journal Article Components. Submitted to ACMs DL 98 conference.
Bowker, Geoffrey and Susan Leigh Star. How Classifications Work. Cambridge, MA: MIT Press, in press.
H. Chen, Y. Chung, M. Ramsey, C. Yang, P. Ma, and J. Yen, ``Intelligent Agents on the Internet,'' Proceedings of the 30th Annual Hawaii International Conference on System Sciences (HICSS-30), Maui, Hawaii, January 7-10, 1997.
H. Chen, T. Smith, M. Larsgaard, L. Hill, M. Ramsey. "A Geographic Knowledge Representation System for Multimedia Geospatial Retrieval and Analysis'' accepted by International Journal of Digital Libraries.
H. Chen, M. Ramsey and B. Zhu. "An Advanced Visual Thesaurus for Browsing Large Collections of Geographic Images'' submitted to JASIS Perspectives Issue on Visual Information Retrieval Interfaces.
Chung, Yi-Ming, Pottenger, & W.M., Schatz, B.R. "Automatic Subject Indexing using An Associative Neural Network", technical report.
Plant, McGrath, Futrelle, "A Model for Cross-Database Searching of Distributed Astronomical Information Resources."
Plant, McGrath, Futrelle, "A Prototype Profile Set for Astronomy"
Robert E. McGrath, "HDF Java Overview" Mini-Workshop on Java Tools for Science Analysis JPL, 19 November, 1997.
Sandusky, R.J., & Powell, K.R.: Design for Collaboration in Networked Information Retrieval, AAAS-98, Digital Libraries Session, Philadelphia, February 1998.
Schatz, Bruce. "The Interspace of the Twenty-first Century: Information Analysis in the Net" in Proceedings for the International Symposium on Research, Development and Practice in Digital Libraries: ISDL'97, Tsukuba Science City, Japan, November 18 - 21, 1997.
Schnase, John L., Meredith A. Lane, Geoffrey C. Bowker, Susan Leigh Star and Abraham Silberschatz, " Building The Next Generation Biological Information Infrastructure," Second National Academy Of Sciences Forum On Biodiversity, In Press.
Star, Susan Leigh, ed. Social Science, Information systems and Cooperative Work : Beyond the Great Divide (with Geoffrey Bowker, Les Gasser, and William Turner), Lawrence Erlbaum Associates, 1997.
Star, Susan Leigh, Geoffrey Bowker, and Laura Neumann, "Transparency At Different Levels of Scale: Convergence between Information Artifacts and Social Worlds," Journal of the American Society for Information Science, in press.
Susan Leigh Star, Grounded Classifications: Grounded Theory and Faceted Classifications," Library Trends, to appear.
PRESENTATIONS
Gross, Ben. "The Interspace of the Twenty-first Century: Information Analysis in the Net" presentation for the International Symposium on Research, Development and Practice in Digital Libraries: ISDL'97, Tsukuba Science City, Japan, November 18 - 21, 1997.
Ignacio, Emily and Bishop, Ann P. "From Usability to Use: Measuring the Success of Testbeds in the Real World", 6, 1998.
Kantor, Paul. "Observation and Measurement in Evaluating Digital Libraries", Emerging Technologies in Digital Libraries Seminar, UIUC Department of Computer Science, February 12, 1998. http://www.canis.uiuc.edu/seminar.html (Spring 1998, lecture 3).
Ng, Dorbin. "Semantic Interoperability for Geographic Information Systems'', DLI '98 Berkeley All Project Meeting, January 4-5, 1998.
Mischo, Wm. "A Production Focus: User Perspectives", Symposium on Building Digital Collections University of Iowa Libraries, December 12, 1997. http://magni.grainger.uiuc.edu/uiowadl/agenda.htm
Pottenger, Wm. And Demitry Zelenko. "Automatic Methods for Determining the Semantic Difference between Collections of Documents", Emerging Technologies in Digital Libraries Seminar, UIUC Department of Computer Science, February 19, 1998
Sandusky, Robert J. "Infrastructure Management as Cooperative Work: Implications for Systems Design", presented at GROUP '97, the International ACM SIGGROUP Conference on Supporting Group Work, November 16-19, 1997, Phoenix, AZ.
Schatz, Bruce. "UIUC Digital Libraries Initiative Project Status and Retrospective ", AAAS-98, Digital Libraries Session, Philadelphia, February 1998
Schatz, Bruce. "UIUC Digital Libraries Initiative Retrospective - We Did What We Promised", DLI Project-Wide Meeting, Berkeley, CA, January 5, 1998.
Star, Leigh. "Classification and Infrastructure", Paper presented to Colloquium, School of Information Management Studies, University of California, Berkeley, December, 1997.
Star, Leigh. "Things Perceived as Real Are Real in their Consequences," Paper presented to the Society for the Social Studies of Science, Tucson, October, 1997.
Star, Leigh. "Filiations, Infrastructure and Nursing Work," Presented to the American Society for Information Science, Washington, D.C., November, 1997.
Visitors
November 15, 1997
Mark Lester
Global Marketing Manager
Engineering & Technology
Elsevie
November 25, 1997
Alla Aslitdinova
Academy of Sciences of Tajikistan
Dushanbe, Tajikistan
Sugunavathy Chinnakalappagari
ANGR Agricultural University
Hyderabad, India
Alexei Falaleev
Vladivostok State University of Economics
Vladivostok, Russia
Eun-Gyoung Han
Hansung University
Seoul, Korea
Barchinoy Kholbekova
Educational Advising Center
Tashkent, Uzbekistan
Suk-Young Kim
Korea Institute of Industry & Technology Information
Seoul, Korea
Md. Abdul Matin
Bangladesh Public Administration Training Center
Savar Dhaka, Bangladesh
Ilonka Matute
Empresa Electrica de Guatemala
Guatemala City, Guatemala
Abdullahi Musa Ibrahi
Bayero University Kano
Kano, Nigeria
February 3, 1998
Lourdes Oliva Callis, Universitat de Girona, Biblioteca Proces Tecnic -
Catalogacio, Girona-Catalunya, Spain
Elif Kaynak, National Academic Network and Information Centre, Turkey
February 13, 1998
Dr Paul Kantor, SCILS, Rutgers, the State University of New Jersey
Appendix A
As of this month, we have approximately 40,000 articles from five of our publishing partners processed and indexed and we are adding about 2,000 articles per month to the system. For a current list of what's in the system, investigate: http://dli.grainger.uiuc.edu/journals/.