Quarterly Report for 1st Quarter 1998 (March '98 - May '98)
Digital Libraries Initiative
University of Illinois at Urbana-Champaign

 

Federating Repositories of Scientific Literature
Bruce Schatz, PI, schatz@uiuc.edu
contact: dli@uiuc.edu

http://dli.grainger.uiuc.edu/

 

SIGNIFICANT EVENTS

TESTBED

This quarter we held our fourth annual Partner's Workshop in conjunction with the GSLIS Annual Workshop. This yearís topic focused on the successes and failures of the UIUC DLI, with additional presentations by representatives from the National Science Foundation, Xerox Parc, OCLC, and Virginia Tech. A Collaborative Partnerís Program was proposed between the Testbed team and DLI partners for technology transfer and the continuation of the Testbed beyond the end of the DLI. Details of the program were agreed upon and support was demonstrated by current DLI partners. Presentations from the workshop, as well as some full-text papers from the workshop, can be found at: http://dli.grainger.uiuc.edu/spring98workshop/.

The Testbed team installed and got up and running a second version of the AIP repository (with index) on AIP equipment in Long Island. Next quarter the Testbed team will be customizing and differentiating the AIP copy of repository/index from the UIUC copy and also testing various multiple repository configurations.

Links were incorporated in DeLIver (http://dli.grainger.uiuc.edu/deliver/) to AIP and APS serial link management software (links to the INSPEC database that AIP maintains and the APSSPIN database). In addition, the Testbed team wrote and tested detailed transaction log software incorporating logs from both database servers and web servers, to provide session level information on user searching behavior. From this transaction log data, detailed information of the user searches has been derived.

Currently in DeLIver: http://dli.grainger.uiuc.edu/journals/

DeLIver usage statistics: http://dli.grainger.uiuc.edu/statistics/

EVALUATION

During the past quarter, the Social Science Team has largely continued the activities begun last quarter in addition to initiating new data collection activities. Team members have been involved in development of user registration, authentication, and transaction logging procedures for DeLIver. This quarter, we finished the development of a summative user survey instrument and distributed it to registered users and conducted hybrid interviews with registered users, covering both usability issues of DeLIver as well as how DeLIver fits into their work practices.

Public use of the Testbed has continued to grow, and the initial logging on and registration issues seem to be worked out. Work has continued, in cooperation with the Testbed Team, on developing a useful SQL database for storing and querying transaction log and registration data. Work has continued on learning to access and manipulate this data, and a preliminary list of research questions has been developed. Several reports have been generated from the transaction log data.

Team members finalized a user survey instrument that will provide summative data on the extent and nature of testbed use, the level of user satisfaction with the system, and reasons for use and nonuse. The survey was sent in two waves, to a grand total of 951 people. We are pleased with the 23.8% response rate. The data will be entered and analyzed and the primary analysis of data completed this summer, but a preliminary look at completed surveys has already proved fruitful.

An exit poll was developed and implemented this quarter as well. It appears after a specified interval during search sessions, and asks users about what their impressions of the system are, what they are currently doing with the system, and how they are progressing with their task.

One student cooperated with our Team this semester on a set of "situated usability" interviews. A total of 12 DeLIver users discussed both their impressions of the system and any usability issues that they had noticed as well as how DeLIver fit with their work practices. These interviews provided an opportunity to gather in- depth information about how and why DeLIver was or was not being used by a wide variety of users from different targeted disciplines. These data were analyzed and synthesized into a report which was circulated to different members of the DLI project, including the Testbed Team.

Ann Bishop continued her work with Barbara Buttenfield (Alexandria Digital Library) and Nancy Van House (Berkeley DLI project) to produce an edited volume titled The Use of Digital Libraries: Social Aspects of Design and Evaluation. Ten authors have committed chapters and submitted abstracts, outlines, or drafts. Investigator Leigh Star is one of the contributing authors. A number of the authors will be meeting at DL 98 to discuss further progress.

Our team has been quite successful this quarter in furthering and presenting to the public our basic research on article component use and on how people interact with new, hybrid information systems like our DL. The Social Science Teamís web site is up-to-date and has added links to several of our papers or presentations. Laura Neumann and Ann Bishop presented a talk titled "From Usability to Use: Measuring Success of Testbeds in the Real World" at the 35th Annual GSLIS Clinic held at the University of Illinois in March. Ann Bishop presented "Studying Use and Users: Approach, Findings and Lessons" at the Advances in Digital Libraries Conference and "Understanding Use in the Real World" at the at the IEEE Socioeconomic Dimensions of Electronic Publishing Workshop in Santa Barbara in April. Neumann and Ignacio also had a paper accepted for the ASIS 98 conference. It is titled "Trial and Error as a Learning Strategy in System Use," and is about how new users learn how to use digital libraries. Leigh Star presented a talk in Paris, and presented an overview of digital library and LIS programs in the US to Jean Ganacio, Cognitive Science Director of CNRS, and representatives of France Telecomm, LIMSI (the major HCI group in France) and the GIS group at the Universite de Paris. She met extensively with Bill Turner, LIMSI/CNRS information directorate, about future collaboration between CNRS and UIUC on digital libraries and the development of information science projects. We have discussed the creation of a final report from our team and plan to primarily work on synthesizing data for this over the summer.

RESEARCH

 

Development and Evaluation of Noun-phrase Parsers

Tests were conducted on the Arizona noun phraser (AZ Phraser) for accuracy. The tests were performed on MEDLINE abstracts and the results of the AZ Phraser were compared with those of Lingsoft's NPtool. Many improvements have been made in the AZ Phraser since last tested. It is considerably faster than previous versions, and it now produces interim phrases. The extracted noun phrases have improved precision, although NPtool still has better recall.

 

Multimedia Concept Extractor

The Multimedia Concept Extractor (MCE) continued to be developed and debugged. In particular, development tools were employed to eradicate memory leaks, and in addition, re-implemented portions of the MCE to improve memory management when parsing large collections of documents. The AZ Phraser was also integrated into the MCE, extending the number of noun-phrase extraction algorithms implemented in the Interspace Prototype to three. Extraction efficiency was improved by optimizing routines to perform input at the collection level as opposed to processing documents field by field. Finally, in preparation for integrating concept extraction from images into the MCE, investigation was begun of the image segmentation algorithm employed by our Arizona partner.

 

Category Map Service

Category Map service continued to be updated with new optimizations, including but not limited to the following: neighborhood identification optimized using a lookup table; refactorization of object model; optimizations in memory allocation/deallocation. Stubs were added to support texture processing, and made initial preparations for the National Computational Science Alliance conference. Development continued of advanced cluster identification algorithms, and successfully integrated cMap with the Interspace Prototype Domain manager.

 

Tier 1 Analysis Environment

Development continued of the Interspace Services Technology Demonstrator (ISTD). ISTD is now operational and accessing results of Interspace Services computations which have been imported into the Versant object store. At this time ISTD functionality is primarily a reflection of inherent Interspace Service object characteristics. Value-added capabilities such as experimental forms of Vocabulary Switching will be added in the final quarter to complete Phase I of ISTD development. With Phase I ISTD development nearing completion, planning for Phase II ISTD development is also under way.

 

Concept Space Service Development

This quarter saw the first successful link of the Concept Space generation service, the MCE, the Category Map generation service, and the Domain Manager.

 

Domain Manager Development

Initial implementation of a database location service, which will be used by the Interspace Kernel was completed. A series of experiments were performed to characterize the performance of the Prototype as a whole using collections of various sizes as input. Based on the results of this performance characterization, the object model was extensively optimized, significantly improving performance. Highlights of the optimization include reduced complexity of data structures, use of red-black trees, and optimization of memory management in Versant by "pinning" objects in (and "unpinning" objects from) physical memory based on temporal locality of reference.

 

In terms of the Analysis Environment, indexes were also implemented on the representation of concepts and on the field types associated with a concept in a given document.

 

This quarter also saw the first complete domains generated fully automatically by the Domain Manager.

 

Concept Assigner

Development continued of the Concept Assigner based on Versant C++. Work this past quarter included the integration of modified Interspace Services Object Model into the Concept Assigner, and integration of the Assigner with the Interspace Prototype Domain Manager.

 

In order to evaluate the performance of the Concept Assigner, a precision/recall study was conducted. Fourteen graduate students, half from Computer Science and half from Library and Information Science, participated in the evaluation. The document database consisted of records drawn from the Compendex 723.1.1 classification, Programming Languages. A total of 76 documents were evaluated in the experiment.

 

Participants in the study were given freedom to choose a subject domain of expertise within Programming Languages. A minimum of five documents dealing with the subject of their choice were evaluated. For each document, subjects were asked to first examine the title and abstract carefully and, based on their domain expertise, provide up to 20 index terms appropriate to the abstract. Subjects were free to choose such terms based on their own recall of relevant terms, the text of the title or abstract, etc.

 

Secondly, subjects were asked to evaluate a lexically ordered list of index terms consisting of terms assigned by Compendex professionals combined with terms generated automatically by the Concept Assigner. For each indexing term, subjects were required to judge the term as highly relevant, relevant, neutral, or not relevant to the "aboutness" of the document.

 

Term precision and term recall were used to measure the quality of each of four indexes: Controlled Vocabulary (CV), Free Text (FL), Concept Assigner (AUTO), and user generated (USER). Term precision is defined as the percentage of terms judged relevant out of the total number of terms in the given index. Term recall is defined as the percentage of terms judged relevant out of the total relevant terms. For the purposes of this study, we have defined total relevant terms as the sum of the relevant terms from each of the four indexes CV, FL, AUTO, and USER. Table 1 below summarizes the results of this study.

 

  CV FL AUTO USER
Precision 0.610 0.728 0.495 1.000
Recall 0.177 0.183 0.584 0.319

The fourth category in Table 1 represents user-selected keywords, and thus has the highest precision possible (100%). Of the remaining indexes, terms in the FL field were rated more precise (72%) than those in the CV field (61%). The automatically generated index resulted in a precision lower than the other three indexes (50%).

 

The three precision results for the CV, FL, and AUTO indexes are within the range expected for good retrieval performance [Jones 1981]. In addition, we note that the precision of the CV and FL indexes is indicative of the competitive performance of natural language terms in indexing.

 

As expected, the recall of the Concept Assigner is higher than the three human-generated indexes (CV, FL, USER). A recall of 58% is more than thrice the recall of the professionally generated CV and FL indexes, and about twice that of the domain expert subjects. Of the four indexes, only the automatically generated keyword indexes fall within the range expected for good retrieval performance [Jones 1981].

 

A full report on the Concept Assigner is available here on the CANIS Web Pages.

 

[Jones 1981] K. S. Jones. Information Retrieval Experiment. Butterworths.

 

Semantic Interoperability GIS

The AI Lab at the University of Arizona continued development of the Geographic Information System (GIS). The GIS demonstrates semantic interoperability across multimedia sources including text, images, and structured numerical data.

 

This past quarter, the system has been redesigned with two major improvements. The first is to the user interface (UI). The new design calls for an integrated interface for users to input data of various types. Issues in consistency and ease of use are also addressed in the new design.

 

The second improvement is in the integration of knowledge sources. A tight integration and interoperation among underlying knowledge sources - text, image, and numeric data - is considered desirable from the perspective of extensibility and maintainability of the system. In the proposed design, semantic interoperation between media types will be transparent to GIS users. However, within this framework we have allowed for users to have control over the media type in which results are displayed.

 

In terms of current capabilities for visualizing aerial photos, a number of performance tests were conducted using varying resolutions and dimensions of thumbnail sketches of image tiles. Experiments have indicated that it is possible to reduce the size from O(10MB) to O(10KB) while still preserving viewing clarity.

 

In the area of automatic categorization of images, work continued on the optimization of feature extraction. The most time-consuming step in the process of image analysis has been feature extraction. Attempts to optimize the implementation of the algorithm resulted in no increases in efficiency. The current implementation analyzes approximately 100 images per day (aerial photos ~32 MBytes each) on a 32-node SGI Origin2000 supercomputer. In order to improve performance, the application has been ported to and is currently being tested on a 312-node Cray T3E at a DoD MSRC. It is projected that the feature extraction process will be at least 20 times faster on this architecture.

 

The normalization algorithm for analyzing a set of image tiles was also improved. This algorithm is applied to the feature values of each tile. Initial results show that the new algorithm improves the automatic categorization of image tiles when using Kohonen's SOM.

 

Finally, investigations were begun into the use of the NCSA Hierarchical Data Format (HDF) as a basis for creating a persistent object store for the various knowledge sources. HDF has been used extensively by NASA, among other organizations, for storage and retrieval of large image datasets, and it is expected that much benefit will come of making use of the High Performance HDF 5 Persistent Object Store developed at CANIS Lab.

INTERNET

Changes to the architecture of IODyne were planned to allow for use of Z39.58 Common Command Language search group syntax by searchers and for use of Gazebo, the new retrieval gateway implemented by the NCSA EMERGE group. Changes were also planned to allow for simultaneous, asynchronous communication with multiple servers, including Gazebo, Z39.50, and OpenText.

PUBLICATIONS

Bishop, Ann Peterson. (in press). "Understanding Use in the Real World." In Proceedings of the IEEE Socioeconomic Dimensions of Electronic Publishing Workshop, Santa Barbara, CA. Piscataway, NJ: IEEE.

Bishop, Ann Peterson. (in press). "Digital Libraries and Knowledge Disaggregation: The Use of Journal Article Components." In DL ë98: Proceedings of the 3rd ACM International Conference on Digital Libraries. New York: ACM.

H. Chen, J. Martinez, A. Kirchhoff, T. D. Ng, and B. R. Schatz, "Alleviating Search Uncertainty Through Concept Associations: Automatic Indexing, Co-occurrence Analysis, and Parallel Computing," Journal of the American Society for Information Science, Special Issue on "Management of Imprecision and Uncertainty in Information Retrieval and Database Management Systems," Volume 49, Number 3, Pages 206-216, 1998.

Neumann, L. J. and E. Ignacio. (in press). "Trial and Error as a Learning Strategy in System Use," to appear in the American Society for Information Science, Annual Conference, October 26- 29, Pittsburgh, PA.

Neumann, L. J. and A. P. Bishop. (in press). "From Usability to Use: Measuring the Success of Testbeds in the Real World," in Proceedings of the 35th Annual GSLIS Clinic, March 22- 24, 1998. Urbana, IL.

M. Ramsey, T. Ong, and H. Chen, "Multilingual Input System for the Web -- an Open Multimedia Approach of Keyboard and Handwritten Recognition for Chinese and Japanese," Proceedings of IEEE Advances in Digital Libraries Conference (ADL '98), Santa Barbara, CA, April 22-24, 1998.

GRANTS AWARDED:

Chen, H. NSF award of $274,163 for the grant, "An Intelligent CSCW Workbench: Analysis, Visualization, and Agents. June 1, 1998 to May 31, 2001.

PRESENTATIONS

Bishop, Ann P. "Studying Use and Users: Approach, Findings and Lessons. Advances in Digital Libraries", 1998 IEEE Santa Barbara, CA, Apr. 24, 1998.

Bishop, Ann P "Understanding Use in the Real World. Socioeconomic Dimensions of Electronic Publishing Workshop, IEEE". Santa Barbara, CA, Apr. 24, 1998.

Chen, Hsinchun. " Semantic Retrieval Issues for Digital Libraries ", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, Urbana, IL, March 23, 1998.

Chen, Hsinchun. "Intellectual Capital and knowledge Management: A Perpetual Self-Organizing (PSO) Approach, Creative Approaches in Project Management NASA Meeting, Hagerstown MD, April 20-24, 1998.

Chen, Hsinchun. "Multilingual Input Systems for the Web -- an Open Multimedia Approach of Keyboard and Handwriting Recognition for Chinese and Japanese", Advances in Digital Libraries '98, Santa Barbara, CA, April 22-24, 1998.

Gross, Ben. "The History of the First Digital Library Initiative: Reflections and Implications for the DLI Phase II Program", March 13, 1998, Rutgers University.

Gross, Ben. "Information Analysis Environments for Digital Libraries", Workshops on High Bandwidth Applications Using Internet 2 / vBNS: Workshop on Digital Libraries, Ohio Supercomputing Center, Columbus, OH, March 26, 1998.

Harum, Susan. "Establishing Partnering Relationships for Digital Libraries", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, March 23, 1998, Urbana, IL.

Johnson, Eric. "IODyne: an Interface for Distributed Repositories", NCSA Campus Day, May 7, 1998, Champaign, IL.

Mischo, William. "UIUC DLI Project -- Testbed Overview", Common Solutions Group, Pennsylvania State University, May 25, 1998.

Mischo, William and Timothy Cole. "UIUC DLI Project -- Testbed Overview", Notre Dame University, Notre Dame, IN, May 4, 1998.

Mischo, William and Timothy Cole. "UIUC DLI Project -- Recap & Lessons Learned", Notre Dame University, Notre Dame, IN, May 5, 1998.

Mischo, William and Timothy Cole. " Trends in Academic Librarianship & Digital Library Technology", Notre Dame University, Notre Dame, IN, May 5, 1998.

Mischo, William. "UIUC DLI Project: Testbed Overview", Illinois Association of College Research Libraries (IACRL), April 23, 1998, Matteson, IL.

Mischo, William. "Processing and Access Issues for Full-text Journals ", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, March 23, 1998, Urbana, IL.

Mischo, William. "UIUC DLI Project: Testbed Overview", Oxford University Press, April 2, 1998, Oxford, UK.

Mischo, William. "UIUC DLI Project: Testbed Overview", Workshop on Scientific Publishing, International Council for Science (ICSU), March 30, 1998, Oxford, UK.

Mischo, William, and Schlembach, Mary. "UIUC DLI Project: Testbed Overview", Northern Illinois Library System, Feb. 20, 1998, DeKalb, IL.

Mischo, William, and Cole, Timothy. "A Production Focus - User Perspectives", Symposium on Building Digital Collections, University of Iowa Libraries, December 12, 1997, Iowa City, Iowa.

Neumann, Laura. "From Usability to Use: Measuring the Success of Testbeds in the Real World ", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, March 23, Urbana, IL.

Pottenger, Wm. "Computing the Future of Medical Informatics", Alliance í98, April 27, 1998, Champaign, IL.

Star, S.L. "Infrastructure and Practice" to the Maison de Sciences de l'Homme Information Science Ph.D Colloquium," Paris, May 11, 1998.

Star, S.L. Lecture series to Program in Technology and Social Change, Tema Institute, Linkoping University, Linkoping, Sweden, April, 1998.

Star, S.L. "Things Perceived as Real are Real in their Consequences: Categories as Technology," Inaugural Campuswide Lecture, Science,Technology and Society and "Layers of Silence, Arenas of Voice: The Ecology of Visible and Invisible Work," to the Science, Technology and Society Faculty, Iowa State University, Ames, March, 1998

Schatz, Bruce "The Technologies of Scalable Semantics: Digital Libraries in the 21st Century", Alliance í98, April 27, 1998, Champaign, IL.

Schatz, Bruce. "UIUC DLI Project Retrospective", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, March 23, Urbana, IL.

Wedgeworth, Robert. "Transferring Digital Technologies to the Field: Legal, Technical and Organizational Issues", UIUC DLI Partners Worskhop/GSLIS 35th Annual Workshop: Successes and Failures of Digital Libraries, March 23, Urbana, IL.

Visitors

April 10

Marina Tyapkina, Information Specialist, Catholic University of America, Russia.
Elif Aytek Kaynak, Information Specialist, Scientific and Technical Research Council of Turkey, national Academic Network and Information Center

April 23

Anna Abrahamian, American University of Armenia, Armenia
Galina Abramova, MIRAS University Library System, Kazakhstan
Inna Babayeva, Small & Medium Enterprise Development Agency, Azerbaijan
Nada Bezic, Croatian Music Institute, Croatia
Lenka Danevska, Central Medical Library, Macedonia
Gansukh Ganjav, National Library of Mongolia, Mongolia
Irina Kuznetsova, Samara Region Research Library, Russia
Miroslawa Modrzewska, Gdansk Technical University, Poland
Miriam Peknikova, Comenius University, Slovakia
Zijad Sarajic, Public and University Library of Tuzla, Bosnia-Hercogovina
Merita Selmanllari, Scientific University Library, Albania
Maia Vasadze, Tbilisi State University Library, Georgia
Alexei Falaleev, Vladivostok State University of Economics, Russia
Marta Hernandez, Supreme Court Library, El Salvador
Elif Aytek Kaynak, ULABKBIM - National Academic Network and Information Centre, Turkey
Barchinoy Kholbekova, Educational Advising Center, Uzbekistan
Lybny Oziel Mejia Romero, Instituto Guatemalteco Americano, Guatemala
Hasnaa Mahgoub, Menoufia University, Egypt
Abdullahi Musa Ibrahim, Bayero University Kano, Nigeria
Maria Teresa Norori Paniagua, Instituto Nicaraguense de Fomento Municipal, Nicaragua
Priya Prak, Royal University of Phnom Penh, Cambodia

May 5

Ben Jeffryes, research manager, Schlumberger

May 11

Mike Casey, Head of Electronic Publishing, Kluwer