Our project continues on its goal of bringing "Search to the Net". The strategy is to develop technology in the testbed, evolve it into an effective digital library with a large collection and many users, perform sociological evaluation of the usage and its context to understand what is effective, then develop the infrastructure to propagate a generic version of this software into the Net. Concurrently, longer-term research is developing technology with deeper semantic retrieval and prototyping embedding this into complete new information systems.
We have refined this goal into a more concrete plan of developing "Semantic Federation of Distributed Repositories of Scientific Literature". The digital library is articles from engineering and science journals and magazines. The articles are completely marked up in SGML and this document structure is used to improve the search. The repositories are obtained in a continuous pipeline directly from the publishers into a production library setup. They are completely indexed to support full- text search. Currently, all repositories are maintained within our testbed setup, but we have already begun training the publishers to directly maintain their own repositories.
The semantic federation in the testbed is achieved by integration of search and of display. For search, we have defined a common set of tags into which the tags from each publisher are transformed (normalization into a canonical set). For display, we support multiple views to enable search comparing different indexes and different collections simultaneously. The technology research is investigating deeper semantics, by processing the content instead of the structure. To develop scalable technology, our efforts are concentrating on co-occurrence frequency to record the context of terms within documents in the collections.
The more detailed descriptions below give the status for the Testbed (collection and infrastructure), the Technology (research), and the Evaluation (sociology). The appendixes give the status of our partners and our papers and presentations for this quarter.
We (Hsinchun Chen and Bruce Schatz) are editing a special issue of IEEE Computer on Large-Scale Digital Libraries, which we expect to become a standard external package for explaining the field in general and the DLI in particular. The goal is to include the DLI projects and a few others (such as the CSTR projects) with self-contained articles. The submissions are due September 1 with the issue to appear May 1996. We have been in contact with all the other DLI projects and expect papers from them. Since the IEEE Computer Society is one of our principal publisher partners, we also expect to provide a significant electronic demonstration component in addition to the print version.
Testbed activities have continued to focus on: 1) streamlining document transfer procedures and setting up document servers for the APS, AIP, IEEE Computer Society, IEEE and ASCE full-text SGML; 2) enhancing software procedures for more efficient SGML processing (in particular, to capture images, handle entities, and add necessary metadata to each document; 3) enhancing software that normalized the heterogeneous publishers' DTD's for cross-publisher indexing, search, and retrieval; 4) modifying SGML indexing software (utilizing the ability of SGML to indicate content and structure of a document) and insert the data into an SQL database structure; 5) testing the efficacy of the SQL database structure; 6) designing and testing user interface software modules for full-text retrieval; 7) exploring search tree searching and retrieval models; and 8) constructing style sheets for the SoftQuad Panorama viewer that are consistent with the published articles. We have hired a full-time database administrator, Robert Ferrer, to maintain the publisher pipeline. The testbed principals continue to be Bill Mischo (search) and Tim Cole (collection).
Concentration of the work in the Grainger Library continues to be on establishing the indexed SGML database for the digital library collection. The current prototype, which will shortly begin limited production, is based on a full-text engine built on top of the Microsoft SQL server running on an IBM RS6000. This is being used on an experimental basis.
The production system will use OpenText, which was chosen after careful consideration of all available large-scale full-text SGML search packages, largely due to its superior APIs for handling raw SGML. The OpenText database management software has been purchased as a production system for the full-text journal testbed. Work has begun on database loading and client configuration. We expect to utilize the OpenText APIs in a client-server retrieval environment.
The production machines will be computers purchased as part of a $1M equipment grant to the DLI project from Hewlett-Packard. The indexes and search engine will reside on a J200 server running OpenText and the documents themselves stored on 735/100 workstations running http servers.
The DLI Project has also become a member of the Electronic Book Technologies (EBT) University Grant program. We will be receiving the full EBT software suite at no charge and will be evaluating its effectiveness for retrieval and display of heterogeneous SGML documents.
The Testbed team has entered into a Memorandum of Understanding with the University of Michigan DLI Interface team to incorporate the Search Tree retrieval techniques first developed by Karen Drabenstott for the ASTUTE Project into the UIUC Testbed interface design. This agreement is expected to foster close cooperation between the Illinois and Michigan Testbed groups, and provide a means for testing the efficacy of various interface designs across several DLI projects.
In a separate collaboration with other DLI projects, we have become one of the partners in the Interoperability experiment, which was the subject of a DLI supplement grant to Stanford. We wrote a letter of support for this grant proposal and have had several extended conversations at the PI level about what is involved. Our role is to provide information about our search engine interface so that Stanford can write an object wrapper for this and enable syntactic protocol translation across them and us and the other participant Michigan DLI.
Testbed personnel have also become involved in the Museum Educational Site Licensing Project (MESL), funded by the Getty Foundation, which will provide indexing, retrieval, and display of some 7,000 bit-mapped images (and accompanying textual descriptions) from six major art museums across the country. This is in support of digital library projects in the arts and humanities at the University of Illinois, which hope to use the technology being developed by the DLI for engineering and science. The MESL project will also be the focus of the image processing research to be performed as part of the DLI supplement grant to our project (principal investigator Tom Huang in the Beckman Institute in collaboration with Beth Sandore in the University Library).
Work on building a Net interface to the digital library testbed continued in design and experiment phases. This work consists of a multiple view user interface with session control being developed by Eric Johnson in the Library and a stateful gateway for SQL/Z39.50 being developed by Jason Ng in NCSA. For the former interface, design is continuing on an architecture for multiple simultaneous connections to distributed repositories. Much of this work has involved genericizing the current interface prototype so that it can submit queries appropriate to each respective repository. A HTML forms client is still being planned, but the major thrust of the work will now be toward development of the "Java" client, which will support multithreaded multiple connections with multiple views on all major platforms (PCs, Macs, X Windows). For the latter gateway, preliminary connections to the SQL server and to the OpenText engine have been written and work is continuing to design and implement a complete state saving and session recording facility.
This quarter also saw the initiation of an active discussion group on generic repository design. This group meets weekly to discuss issues, analyze papers, and host visitors. The group is hosted by Dan LaLiberte in NCSA and regular attendees include members of all technology parts of the DLI and related local projects, including the testbed collection (Cole), Net interfaces (Johnson), Net gateways (Ng), Web servers (Frank), object stores (McGrath), analysis environments (Powell), among others. This activity represents the initial stages of defining, designing, and implementing a 'repository' that extends the notion of a server to include all the associated services that go along with serving documents. These services include depositing, verifying, registering, indexing and searching documents, and notifying people or agents when events of interest occur. The architecture will have public protocols between modules so that independent developers and vendors can supply value-added replacement implementations of each module. Upon this infrastructure, we can begin to build many additional inter-repository services, such as link integrity checking, shared or delegated responsibilities for indexing and searching, and server-directed replication for scalability and reliability.
The Technology Research efforts continued as a number of independent longer- term activities intended to develop fundamental new infrastructure which can be used in next generation digital libraries, including the testbed as appropriate. These include the major issues of future digital library research as defined by the IITA workshop report, for which PI Schatz served as the chair of the group on the library perspective. In particular, the focus is on scalable semantic retrieval (co-PIs Hsinchun Chen and Bruce Schatz) and on scalable distributed objects (co-PIs Roy Campbell and Charlie Catlett).
** Semantic Retrieval **
The semantic retrieval research based at the University of Arizona (Hsinchun Chen) continued to use the NCSA supercomputers to generate concept spaces. The major push is towards fully automatic procedures that work in multiple subject domains, including non-technical ones.
A computer science concept space was created using NCSA's SGI Power Challenge and 400,000 INSPEC computer engineering abstracts. Several papers on this work are currently under review and a WWW server called CSQuest was created, at http://ai.bpa.arizona.edu/cgi-bin/csquest . CSQuest contains a searchable computer engineering concept space (thesaurus) generated automatically using the NCSA 16-node SGI Power Challenge supercomputer. It includes about 280,000 computer engineering terms and 10M links. Users can use this server to identify other relevant search terms when searching Internet computer engineering servers or homepages (e.g. CS Technical Report sites).
To push towards scalable semantics, the same 400,000 INSPEC abstracts have been processed using automatic indexing only (no object filtering). Initial results are promising and a user evaluation experiment is under way. An incremental version of the concept space algorithms is under design and will be tested in September, 1995. The goal is to do fully automatic processing with no domain-specific knowledge.
To build a concept space for the remainder of our SGML document collection (the above was computer science and electrical engineering), a 300,000 abstract collection of physics abstracts acquired from INSPEC is near completion for processing. A physics concept space will be created in October 1995.
Other concept spaces were generated to test the algorithms and processing. A collection of about 40,000 full-text CS technical reports was obtained from the CSTR sites. A CSTR concept space will be created and evaluated in August/September, 1995. A collection of about 13,000 full-text, tagged PTO (patent and Trademark Office) computer-related patents (no copyright issue) was obtained. A CS-PTO concept space will be created and evaluated in September, 1995.
Finally, to test whether the concept space functionality is also effective in less structured domains, we are building concept spaces for Web documents on the Net (arbitrary topics and terminology and quality, unlike the journal articles in engineering collection).
Using 10,000 entertainment homepages acquired from Lycos, we developed a prototype server which can potentially organize all homepages (URLs) on the Net. We are in the process of acquiring a 1+M URL collection and developing a complete and automatically created/updated Internet subject directory using the NCSA Convex Exemplar. see http://ai.bpa.arizona.edu/entertainment2/
ET-R-US is a prototype WWW server which adopts a neural network approach to organizing Internet entertainment homepages (10,000 URLs). Our system indexes Internet homepages (like Lycos) and provides a graphical, multi-layered display of subject categories (like Yahoo). The server demonstrates the feasibility of designing a concept- based automatic categorizer suitable for organizing all Internet homepages (5.6+M). Our ongoing effort involves using NCSA's Convex Exemplar supercomputer to organize 1+M URLs on the net.
A genetic algorithms based "intelligent" spider is under development. The spider will actively search the Internet resources based on a searcher's preference. The search will be heuristics-based and goal-driven and will consume significantly less resources on the Net (unlike the current "dumb" BFS/DFS based spiders on the Net)
** Analysis Environments **
The semantic retrieval research based at the University of Illinois continued its design of a prototype Interspace environment (Bruce Schatz).
The encouraging results with concept spaces leads us to believe that a complete information system can be constructed with semantic retrieval at its foundations. Since supercomputers can be used as a "time machine" for everyday future processing, ordinary personal computers will be able to generate similar concept spaces 5-10 years hence, which will be essential infrastructure for the information systems possible in the Net of the twenty-first century. We are designing prototype systems for community repositories on the Net that can be readily accessed by researchers outside the community. These prototypes will demonstrate that a practical solution exists to building "analysis environments", where researchers can solve problems by correlating information from multiple sources across the network.
Information systems in the twenty-first century will directly support correlation of information across community repositories. Thus a user will deal with the Interspace rather than the Internet. The fundamental interaction will be in intersecting concept spaces of related terms across subject domains, which are extracted from information spaces of interlinked objects comprising community repositories. The Net will then enable Information Analysis rather than merely Document Transfer as at present. Every individual will have their own spaces and every community as well. (The term Interspace indicates interconnection of spaces, just as Internet indicates interconnection of networks.)
Schatz had a white paper published describing this Interspace vision at the CIC Forum on America in the Age of Information (http://interspace.grainger.uiuc.edu/america21.html). The CIC (Committee on Information and Communications) is the highest-level committee on NII for the federal government (1 of 9 that sets science policy) and the parent committee of the IITA that produced the report describing the future agenda of digital library research. (Schatz chaired the working group on the Library Perspective for the IITA and was an active participant in producing the report.) He also gave an invited talk at the American Chemical Society workshop on the Future of Scientific Information, specifically on how concept spaces embedded into information systems will enable ordinary people to perform their own indexing so that their personal information can be located by others on the Net (A& I for the masses). (the viewgraphs are available at http://interspace.grainger.uiuc.edu).
The prototype Interspace environment for the DLI will embed concept spaces into the infrastructure of an information system. The basic retrieval will be semantic matching to support information analysis. Navigation paths of relevant objects will be selected by the user and recorded by the system, then matched to related paths across community repositories using semantic retrieval on concept spaces.
An architecture document describing the goals of the Interspace environment (see http://interspace.grainger.uiuc.edu) has been written by Kevin Powell, supported on the DLI grant. The architecture of the Interspace environment is based fundamentally on objects and the grouping of objects into semantically related collections. The main body of the architecture deals with grouping objects in the space -- user classification, semantic matching, recording paths, analyzing collections.
We have begun the planning for the first prototype implementation of this architecture. This implementation is intended to be the Information Systems Research for both the DLI grant and for the Illinois NASA CAN grant (Schatz has a similar co-PI position on each of these grants). To be the project leader for the implementation, Charles Herring has been hired on the CAN grant. He was a senior scientist from CERL, who is an expert on object-oriented languages and environments. We are currently investigating the underlying tools (leaning towards Smalltalk and ObjectStore for the object environment and store respectively).
Our prototype will only implement a part of the full architecture. We will actively collaborate with other researchers on next generation research infrastructure to insure that all the parts are covered. For example, taking the existing world of the Net and encapsulating it into uniform objects with uniform operations, creating wrappers for legacy applications, is being done by collaboration with the Infobus interoperability effort of the Stanford DLI project. We are concentrating on correlation of objects rather than objects per se. Thus the providing of access for data and programs across the Net to support transparent fetching and safe execution is being addressed through the project described below on object repositories. This project is being executed through arrangement with co-PI Catlett by Bob McGrath and Nancy Yeager at NCSA, who are working in collaboration with CNRI and Cornell on building a secure digital object store, satisfactory to support real-world terms and conditions. (see details below)
** Computer Science Research **
In addition to the above information science research, we have also been undertaking computer science research into digital libraries. The goal here is to establish a sufficient base for objects in the Net so that the semantics can then be embedded on top. The IITA workshop identified objects and semantics as the major research topics in the digital library agenda. Our information science research above is concentrating on semantics, while our computer science research below is concentrating on objects. As with the information science research, our computer science research has a theoretical component to understand new algorithms and a practical component to understand embedding these algorithms into full-scale research prototypes.
On the computer science research, the theoretical component is supervised by co-PI Roy Campbell, a senior professor in the Computer Science Department at the University of Illinois, who has extensive experience with object-oriented operating systems and distributed networks. Initial reports on the progress of graduate students supported by the DLI is given below. The practical component is supervised by co-PI Charlie Catlett, manager of the Computing and Communications Group at NCSA. Initial reports on the progress of research programmers associated with the DLI is given below. The DLI is fortunate to gain the participation, funded by NCSA, of programmers who are experts in scalable WWW servers, Bob McGrath and Nancy Yeager, authors of a forthcoming book on the principles and practice of maintaining web servers.
* Distributed Object Infrastructure *
Research on Distributed Naming, the Web, and Digital Libraries is being carried out by graduate student Varna Puvvada (under co-PI Campbell). This research studies a distributed naming scheme for Digital Libraries based on naming objects on the Web using Uniform Resource Names (URNs). There are several schemes proposed in the literature and only one scheme (The Handle System by CNRI) implemented so far. Varna worked at CNRI this summer (1995) on a new implementation of the handle system.
The client libraries and server have been redone, see below. Changes to the caching server are in progress. The Digital Library applications require a solution to the problem of reverse mapping and this needs further study.
CNRI (Corporation for National Research Initiatives) is sharing the source code of the entire handle system with NCSA and our research project in order to enable completion of the work on this system. This is part of an extensive collaboration between CNRI and the Illinois DLI project, as discussed further below.
The CNRI Handle System originally supported only UDP communication. The choice of UDP was a technical decision for improved performance. With this choice there was a problem with firewalls. Now the Handle System supports both TCP and UDP communications at the server side in such a way that the performance of the server is not deteriorated by the TCP query requests. The routing in this system is implemented by a hash table. The format of the hash table is now changed to support secondary servers and local handle servers. New data types were introduced for the system to be able to support local handle servers. These local handle servers provide autonomy as well as delegation of authority to organizations using the handle system for naming their objects.
Initially the system supported data lengths which would fit only a single UDP packet whose length was chosen to be 512 bytes (which is the minimum that any network should support). This was a very unreasonable constraint on the size of a data item which would easily cross this boundary as the system found more and more applications. Changes were made to the data format to support arbitrary length of a data item.
* Object Retrieval Performance *
Research on Client-Servers, Performance and Retrieval in Digital Library Systems is being carried out by graduate student Y. Li (under co-PI Campbell). This work has already produced two papers at major conferences. The focus is on the design of high performance digital library systems and the retrieval of digital objects. We have investigated the issues of digital object representation and the representations impact on the delivery scheme [conference on information and knowledge management]. We propose a new process/thread scheduling scheme for the digital repository server [conference on parallel and distributed processing techniques and applications]. Currently we are working on the operating system side. We are now trying to design and implement an extensible kernel that can easily be tuned to digital library application's specific needs.
An abstract of the work on Dynamic Retrieval of Remote Digital Objects is as follows. A hierarchical representation and dynamic retrieval scheme for digital objects is presented. When the server load and network traffic are heavy in a digital library system, clients have to wait a long time to view the document because of transmission time, especially if the document is large. In order to reduce the client waiting time, we present a new representation and retrieval scheme. It is based on the fact that during a given period of time, the client only focuses on a small part of the document. We store and deliver the document in a certain order according to the client request. The client can view a part of the document as soon as it arrives, while the remaining parts are being transmitted. When the client wants to view a part that has not yet arrived, the server delivers this part immediately so that the client always encounters a short waiting time. This scheme provides the server with the ability to serve more clients without much performance degradation. Experiments show that the client waiting time is greatly reduced using our scheme when server load is heavy.
An abstract of the work on A Dynamic Priority-based Scheduling Method in Distributed Systems is as follows. A dynamic priority-based scheduling method (DPSM) is presented to improve the performance of client/server distributed systems. The DPSM is based on a hierarchical representation and dynamic retrieval scheme whose goal is to reduce client waiting time. The basic idea of the hierarchical representation and dynamic retrieval scheme is to partition a document into parts and always deliver the part the client requires most urgently. Thus there are two kinds of events at the server site. The first involves delivering the part of a document that a client is waiting on. The other is delivering a part the client is not waiting but may want later on. We distinguish these two kinds of events and give the former a higher priority in the scheduling of CPU, disk, and network devices. We performed various simulations to test the DPSM. Simulation experiments show that DPSM always outperforms non-DPSM. The maximum reduction in total client waiting time reaches 99% with an average reduction of around 50%.
* Secure Object Repositories *
The practical complement to the above research in our computer science research on objects is a design and implementation effort to build a secure object repository infrastructure. This project is a collaboration between CNRI (the handles group), Cornell (the Dienst group), and NCSA (the Web server group). The NCSA component is partially supported by the Illinois DLI grant. Our expertise is in scalability and security. The responsible programmers on our end, Bob McGrath and Nancy Yeager, managed the NCSA WWW server during its super-exponential growth period and are the authors of a forthcoming book on the Principles of WWW Servers. This project thus attempts to design and implement the fundamental object infrastructure for the terms and conditions required for large-scale digital libraries. It draws on the experience of CNRI (building from the Kahn/Wilensky paper and the Handle System), Cornell (building from the CSTR project and Dienst), and NCSA (building from the WWW Servers and security issues).
The World-Wide Web made browsing the Internet universally available. While service providers are working through the wave of change wrought by this innovation, we must move on to the challenges of the next necessary innovations, which will be met by ubiquitous interoperating repositories.
A repository at the highest level is composed of a wide variety of services. Of these set of services there is a core set of infrastructure services that all repositories must have: Storage and Resource Management (e.g. deposit, replicate), a scheme for Unique Permanent Naming of Documents, Authentication services, and Accounting and Payment services.
Once the basic repository is in place, it is anticipated that others will be able to build higher level repository services on top of the existing infrastructure. These higher level services will include Indexing and Search. These facilities must scale up to accommodate billions of objects in millions of repositories spread across the Internet.
Working in collaboration with CNRI, the CS department of Cornell and Xerox Corporation (at Cornell), NCSA will design and implement prototypes of three key components of a repository:
* A Name Service based on the CNRI Handle server.
* An Interoperating Secure Object Store (ISOS), based on the Kahn/Wilensky paper.
* A prototype Certificate Authority to provide the needed authentication services.
This system will store digital objects of all types and ensure that contractual "Terms and Conditions", including copyrights and payment, are correctly enforced.
Varna Puvvada, a DLI funded graduate student, worked as a summer intern at CNRI. She developed software to promote the scalability of information servers: in particular, CNRI's Handle server.
The handle server system is a name service which 1) assigns permanent unique location independent names (Uniform Resource Names or URNS) to documents on the Internet, 2) resolves URNs to location dependent identifiers, Uniform Resource Location (URL).
Varna's contribution this summer was to makes the handle server easier to use from behind firewalls. She developed a TCP/IP interface to one of the "handle" server daemons. The handle server was previously only accessible via UDP.
This coming year Varna will be working on a "reverse lookup" administrative tool for handles. This tool will, when given a URL find its URN.
Faculty and students in the Graduate School of Library and Information Science, Sociology, and Economics are continuing with their user-based research related to the DLI. The main activities in this quarter have been some initial development of a conceptual framework to serve as a theoretical base for their work, the completion of an online user registration form, the development of specific procedures for capturing and analyzing Mosaic transaction log data, observations of the use of existing digital resources at the Grainger library, and progress in the integrated analysis of findings from all of their activities.
One of the major conceptual challenges that the evaluation team has been working on over the last quarter is triangulation. Triangulation means taking multiple views of the complex phenomenon of building infrastructure, and using multiple methods to do so. But the project itself involves a kind of triangulation of different concerns in order to make a workable system, including those of designers, librarians, users, and publishers. Triangulation is not simply a matter of adding views or data points together. Data deriving from different methods, or design decisions taken from different viewpoints, or usage by those with different needs, always means a negotiated process. Rather than try to resolve these difficult questions by fiat or formula, the evaluation team has begun to conceptualize its own work in terms of different viewpoints, each with its own epistemological focus. These are the viewpoints of users' work, the analysis framework, and study methods, with a fourth "meta view" as the process of integrating and triangulating these views, which are described briefly below. Each of these views also scales, from focusing on digital library use at the individual level (encompassing more cognitive tasks), to wider net- or Web-wide phenomena.
At the individual work level, tasks such as actual browsing of information and information retrieval are important. Analytically here, the important questions concern those cognitive/conceptual changes for users in moving to digital form, including such anticipated changes as different metaphors and images (e.g., "navigating" in cyberspace vs. "wandering" the stacks). Moving from the individual in front of the screen, the next level we encounter is that of the flow of work, where the digital library is embedded within a work space. Analytically, the questions move here to understanding the links between an individual's workspace and work flow and the features of the system. One example of this sort of question concerns how an individual's "ethno-classification" and personal library and filing system is affected by the digital library. In addition to work spaces and work groups, our project has interesting institutional implications, including occupations and the institutions of the extant physical libraries affected. Analytically, the challenge for this level of focus is to understand the changing distribution of skills posed by the virtual library environment and to understand the nature of organizational transformations. What will happen to the ways librarians, engineers, and publishers currently organize their work and operations at the institutional and professional levels? At the widest level of scale with which we are working in this project, the digital library interfaces with the World Wide Web, and involves a large number of people (whose ties to each other and to the information in the system may be looser than those influenced by shared proximity, tasks, goals, or institutions) in complex cognitive tasks such as information retrieval. Analytically, here, there are many questions in the realm of what we call "sociology of infrastructure," that is, what is the nature of large-scale changes in work and cognition afforded as the entire information infrastructure begins to change?
The evaluation team is currently developing two primary mechanisms for automatically collecting DLI usage/user data. First, many of the DLI software components are being instrumented to collect detailed transaction logs of each user session. Second, an automated DLI user registration process will collect demographic information about each DLI user and provide a confidential mechanism to link the data in the transaction logs to individual DLI users. These data are being collected to serve two primary purposes. The first purpose is to provide various sorts of management data. This includes summary data to the project management and the outside world concerning the number of users and their aggregate behavior as well as providing system performance data to ensure that the DLI is operating within acceptable tolerances. The second objective met is the collection of detailed data on the individual DLI users and their individual behavior. During this quarter, the researchers developed and began pilot testing a World Wide Web user registration form. The form collects contact information as well as information about the user's professional background and the extent of the user's familiarity with common computing and communications systems.
Work in system instrumentation this quarter has resulted in the completion of specifications for collecting and analyzing transaction log data from Mosaic. The evaluation team is currently working with NCSA staff to de-bug the instrumented version of Mosaic prepared according to their mutually agreed upon specifications. Next quarter, they aim to begin user testing with an instrumented Mosaic. The goal will be to gain experience with collecting these data before incorporating an instrumented Web browser into the DLI prototype. In addition, they will use the instrumented Mosaic to begin collecting data on general Web use by members of the engineering community. The evaluation researchers have also been working on plans for instrumenting each of the other software components (e.g., database management systems, database search engines, thesauri, and SGML and other data format viewers) that together will comprise the DLI system. During any single DLI session, several loosely coupled systems are brought into play in order to search for, retrieve, and display documents for the user. By logging the user's interactions with each of these systems, the researchers can focus on one of the systems in isolation or all of the systems as components of a digital library. A higher level methodological question they hope to address is the extent to which these types of transaction logs can be used effectively to understand the behavior, needs, etc., of individual users instead of using methods such as interviews and field observations.
The major data collection effort of the evaluation team this past quarter has been to observe patrons using existing computer systems related to the retrieval of fulltext material: Mosaic, Engineering Index on CD-ROM, the expanded online catalog, and IEEE's fulltext journal system on CD-ROM. The purpose of these observations is provide information requested by the DLI testbed team, who noted that some of their design decisions would be influenced by knowing such things as:
--What search strategies/keys are used and why?
--What mistakes are made? What are biggest problems/barriers to use?
--How do users get help? What do they do when they're stuck?
--What content is sought? (what journals, fields)
--When and why is fulltext retrieved?
--When do people print? Do they print pieces or entire articles?
--What will people do with the material they've retrieved?
--How satisfied are users? What do they see as advantages?
--How much do people understand about the system and its use (how do they conceptualize the system?)
--What work tasks is the system supporting?
Answers to these questions will also contribute to longer term sociological research on the changing nature of information infrastructure in engineering. Members of the evaluation team have observed a number of Grainger patrons and transcribed the observation logs of their search sessions. Each observation session also includes asking each system user the following questions: a) What was your purpose in using the system? What will you do with the results? b) What did you like best about the system? c) What didn't you like? What problems did you encounter? These observations were begun recently and the researchers have not yet collected a great deal of data. They have developed a content analysis scheme and begun coding their initial results. Preliminary analysis reinforces the findings from earlier focus group interviews, individual interviews, and reference desk observations: patrons do not usually have a good grasp of either the content or nature of the systems that they are using; they employ a more or less random searching strategy with general information (i.e., subject searches with uncontrolled vocabulary); and they have little patience for reading instructions or asking for help when they get stuck. When patrons did not know what a database contained, they would simply enter their subject search terms and see what was returned rather than trying to find a source of information to tell them if the contents of the database were relevant to their search subject.
Another major area of social science research activity this quarter has been in the area of data analysis. Members of the evaluation team have begun the integrated analysis of data from all of their research efforts, using the grounded theory approach. Grounded theory involves reviewing all sources of data, coding the most frequent occurrences of actions and perceptions within the data and asking under what conditions these actions emerge. These coding schemes are then abstracted into memos that identify and discuss emerging themes and concepts, which are then applied to data sources (previous and new) and re-examined. This is an iterative process directed at inductively generating concepts and theories directly from the data collected.
The short term goal of the evaluation team's work is to not only provide feedback to the testbed team on issues that they specifically ask about, but also to watch for "the unexpected." For those observations and interviews most directly related to user interactions with library systems, summaries of coded data are given to the testbed engineers immediately so that they benefit from direct user input as they design the digital library testbed. The coding schemes used to produce these summary reports, such as the content analysis scheme developed for the Grainger observations, are tuned to the specific goals and results of that particular data collection activity. As more data are collected-- from interviews, observations, transaction logs, user surveys, etc.--individual coding schemes are expanded, refined, and integrated and can be applied to other data sources as well, and more integrated and analytic feedback can be presented to system designers. Further, the integration of all of the evaluation data in this manner helps the researchers generate theories about information infrastructure, knowledge gathering and sensemaking, and communication in the engineering community.
A second major data analysis activity pursued during this quarter is the exploration of Hyper-G as a tool to support the researchers' grounded theory work. Other standard software packages for qualitative data analysis do not seem to provide the flexibility and power desired. Further, they are inadequate for dealing with the multimedia data the team is generating.
The evaluation team has continued planning for the first stage of deployment of the prototype DLI testbed, which will provide the opportunity to collect data on system use through both controlled usability tests and more naturalistic observations of use. As noted in last quarter's report, several public sites for deployment at the University of Illinois have been identified. The sites have been selected to allow the study of system use within both library and workteam environments, and across several engineering disciplines, by various segments of the academic engineering community, such as faculty members, graduate and undergraduate students, and librarians. We hope that the next two quarters will see both usability testing of the prototype along with the collection of data about system usage in these actual work and learning situations through observations, interviews, and the solicitation of user feedback.
Co-PI Bishop continues as the evaluation team leader and Star continues as the technical lead for ethnography and research. In addition to their responsibility for evaluation research related to the University of Illinois project, they also play an important role in promoting evaluation efforts across the other DLI projects. Bishop serves as chairperson of the DLI-project- wide working group on user-centered evaluation and of the upcoming Allerton Institute on "How We Do User-Centered Design and Analysis of Digital Libraries: A Methodological Forum." As hoped, this small invited meeting has attracted participants from the six DLI projects and from other major digital library projects, as well as other renowned researchers from sociology, anthropology, psychology, computer science, and library and information science. Bishop also serves on the program committee of DL96, an interdisciplinary conference sponsored by a number of organizations, including the ACM. During this quarter, Bishop participated in a panel on the DLI initiative at the ACM's Computer Human Interaction (CHI-95) conference, conducted a session on digital libraries as part of a faculty institute at Skidmore College in Saratoga Springs, New York, and hosted an expert forum on user research at the DL95 conference in Austin. Emily Ignacio and Laura Neumann, research assistants on the evaluation team, participated in DL95 and are collaborating with Bishop and Star on Allerton activities. Neumann also took the lead in developing a homepage for the DLI evaluation team that facilitates the dissemination of information related to their work to the public. Star serves on the steering committee for the Allerton Institute. She has participated in several national meetings related to furthering humanities and social science efforts for digital libraries.
Bishop, A., et al. (in press). Building a Digital Library for the Academic Engineering Community: Implications of User Research for Higher Education. In Higher Education and the NII: Proceedings. Washington, DC: Coalition for Networked Information.
H. Chen and T. Ng, "An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-bound Search vs. Connectionist Hopfield Net Activation," Journal of the American Society for Information Science, Volume 46, Number 5, Pages 348-369, June 1995.
H. Chen and B. R. Schatz, "A Path to Concept-based Information Access: From National Collaboratories to Digital Libraries," Editors: G. M. Olson, J. B. Smith, and T. W. Malone, in Coordination Theory and Collaboration Technology, 1995. (Accepted with minor revision)
H. Chen, C. Schuffels, and R. Orwig,"Internet Categorization and Search: A Machine Learning Approach," Journal of Visual Communication and Image Representation,} Special Issue on Digital Libraries, 1995. (Accepted with minor revision)
H. Chen, A. Houston, J. Yen, and J. F. Nunamaker, "Intelligent Meeting Facilitation Agents: An Example on GroupSystems," IEEE Computer, 1995. (Accepted with minor revision)
H. Chen, J. Martinez, T. D. Ng, and B. R. Schatz, "A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System," Journal of the American Society for Information Science, 1995. (Under re-review after minor revision)
H. Chen, B. R. Schatz, T. D. Ng, J. P. Martinez, A. J. Kirchhoff, C. Lin, "A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project," submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, Special Issue on Digital Libraries: Representation and Retrieval, 1995.
R. E. Orwig, H. Chen, and J. F. Nunamaker, "A Graphical, Self-Organizing Approach to Classifying Electronic Meeting Output," submitted to Journal of the American Society for Information Science, 1995.
H. Chen, L. She, A. Iyer, and G. Shankaranarayanan, "A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing," submitted to IEEE Transactions on Knowledge and Data Engineering, 1995.
H. Chen, B. R. Schatz, C. Lin,"Concept Classification and Search on Internet Using Machine Learning and Parallel Computing Techniques," submitted to World Wide Web Conference '95 (WWW-4), 1995.
Cole, T. "Digital Library Projects", LITA Newsletter, 16(2): 25-27, Spring 1995.
Johnson, E. H. and Cochrane, P. A. "A Hypertextual Interface for a Searcher's Thesaurus", Editors: Shipman, F., Furuta, R. and David M. Levy, in Digital Libraries '95 (Austin, June 11-13, 1995). http://csdl.tamu.edu/DL95/papers/johncoch/johncoch.html
Y. Li and R. Campbell. A Dynamic Priority-based Scheduling Method in Distributed Systems. International Conference on Parallel and Distributed ProcessingTechniques and Applications (PDPTA'95), Georgia, November 3-4, 1995
Y. Li and V. Puvvada and R. Campbell, "Dynamic Retrieval of Remote Digital Objects", Proc. of Fourth International Conference on Information and Knowledge Management (CIKM'95), Univ Maryland, Nov 1995.
B. Schatz. Information Analysis in the Net: The Interspace of the Twenty-First Century. refereed White Paper for "America in the Age of Information: A Forum on Federal Information and Communications R& D", July 6-7, National Library of Medicine. sponsored by CIC (Committee on Information and Communications) reporting to the Science Advisor to the President of the United States. http://interspace.grainger.uiuc.edu/america21.html
L. Star, editor, "Cultures of Computing" (Sociological Review Monograph). Oxford: Basil Blackwell, 1995.
L. Star and K. Ruhleder, "Steps toward an Ecology of Infrastructure", Proceedings of CSCW 94. New York: ACM Press. Pp. 253-264. Revised version accepted for Information Systems Research, special issue on Organizational Transformation, edited by JoAnne Yates and John Van Maanen.
L. Star and G. Bowker, "Work and Infrastructure," (short contribution), Communications of the ACM (Sept. 1995) 38: 41.
Bishop, A. Presentation as part of a panel on the DLI. ACM's Computer- Human Interaction Conference 1995 (SIGCHI), Denver, Colorado, May 10.
Bishop, A. Seminar on the DLI and other current digital library activities. Presented at the Skidmore Faculty Institute on information technology, Saratoga Springs, New York, May 26.
Bishop, A. Hosted expert forum on user-centered digital library research. Digital Libraries 95, Austin, Texas, June 13.
Bishop, A. and Joseph Squier. Artists on the Internet. Annual Meeting of the Internet Society, Honolulu, Hawaii,June 27.
Estabrook, L. Presentation on DLI social science research, as part of a panel at the 1995 Mid-Year Meeting of the American Society for Information Science, Chicago, Illinois, May 25.
Hardin, J., Mischo, W. and B. Schatz. NCSA WWW Colloquium Series, The Illinois Digital Library project, May 10, 1995.
Johnson, E. H. and Cochrane, P. A. A Hypertextual Interface for a Searcher's Thesaurus. Digital Libraries '95 , Austin, Texas, June 11, Austin.
Mischo, W. Illinois Digital Library Initiative. Illinois Library Association Annual Conference, Peoria, IL, May 5.
Mischo, W. NSF/NASA/ARPA Funded Digital Library Projects: Research Designed to Create Tomorrow's Libraries. American Library Association Annual Conference, Chicago, IL, June 24.
Mischo, W. Toward the Digital Library: Academic Library Futures. American Library Association Annual Conference, Chicago, IL, June 25.
Schatz, B. Digital Libraries and Infrastructure: Lessons from Traditional Libraries. IITA Digital Libraries Workshop, Falls Church, VA , May 18.
Schatz, B. Problem Solving in the Net: A& I in the Twenty-First Century. ACS Chemical Abstracts Services Futures workshop, Columbus, OH May 24.
Schatz, B. Invited panel speaker on the Future of Network Infrastructure and session chair for Information Space Environments, the Annual Meeting of the Internet Society, Honolulu, Hawaii, June 27.
Schatz, B. Information Analysis in the Net: The Interspace of the Twenty-First Century. White Paper for "America in the Age of Information: A Forum on Federal Information and Communications R& D", July 6-7, National Library of Medicine. sponsored by CIC (Committee on Information and Communications).
Schatz, B. University of Illinois Digital Library Initiative poster for the ARPA Principal Investigator Meeting, Fort Lauderdale, Florida, July 10-13.
Schatz, B. The Illinois Digital Library Project: Towards Search in the Net. NSF/NCSA WWW Federal Consortium Annual Meeting, Urbana, IL, August 1, 1995.
L. Star, Plenary speaker, "Infrastructure and Work, " to Conference on Mediated Activity in Organizational Contexts, School of Education, University of Helsinki, Finland, January 1995.
L. Star, invited attendee to NSF special meeting at Stanford, April, 1995, on digital libraries and science and technology studies.
L. Star, "The Illinois Digital Library Project" and "Standardization" to the Institute for Informatics and Norwegian Computing Centre, University of Oslo, Norway, June, 1995. also participated in a seminar there on standards and the internet, and made a presentation on standards and communities.
In May the hypertextual thesaurus browser was demonstrated to the Librarian of the Institute for Industrial and Labor Relations,UIUC, Margaret Chaplan, who has just completed a vocabulary switching study between LCSH and the Laborline Thesaurus. Also present was Diane Rothenburg, Assoc. Director of the ERIC Clearinghouse on Early Childhood Education. Both of these experienced thesaurus makers and researchers were impressed with the capabilities of the thesaurus and expressed an interest in such software if arrangements for its use outside the DLI project could be arranged. (Letters documenting this interest are on file.)
Testbed personnel have provided demonstrations of the UIUC DLI software at the Grainger Engineering Library Information Center Testbed site for the following groups:
5/1 CIC Symposium on Learning Technologies (Provosts and Chief Information Officers of 13 universities)
5/3 OpenText Corporation
5/9 University of New Zealand
5/9 Tribune Broadcasting Company Board of Directors
5/10 NCSA WWW Colloquium
5/16 Center for Human Resource Management Seminar
5/18 University of Minnesota-Duluth
5/24 Librarians (41) from the University of Tennessee and University of Kentucky
6/2 Sybase Corporation.
7/15 Hewlett Packard - equipment partnership
7/18 Caterpillar, Inc. (10 librarians)
7/20 University of Library and Information Science, Tsukuba, Japan
7/21 University Sains Malaysia and Universiti Teknologi Malaysia
7/25 LG Information and Communications, Ltd. of South Korea
H. Chen, Principal investigator (PI), National Science Foundation, CISE, IRIS, "Concept-based Categorization and Search on Internet: A Machine Learning, Parallel Computing Approach," $200,755, September 1995-August 1998.
H. Chen, Principal investigator (PI), National Center for Supercomputing Applications (NCSA), "Information Analysis and Knowledge Discovery for Digital Libraries," High-performance Computing Resources Grants, on SGI Power Challenge Array, Cray CS6400, and Convex Exemplar, July 1995-August 1996.
H. Chen, Co-Investigator (Co-I, PI: N. Strausfeld, University of Arizona),
National Science Foundation, Database Activities Program -- Division of Biological Intrumentations and Resources, Database Activities Relating to identifiable Neurons, "FLYBRAIN, The First in a Federation of Databases for Insect Neurology," $896,424, September 1995-August 1998.
B. Schatz, Principal investigator (PI), and C. Jamison,
National Science Foundation, Database Activities Program -- Division of Biological Intrumentations and Resources, "Building an Electronic Scientific Community", $377,532, original grant BIR-9319844, 2nd year continuation BIR-9503547, September 1994 - August 1996.
Our external relations coordinator, Tom Habing, thabing@uiuc.edu, is responsible for interacting with these partners.
AIAA
We have received SGML text only (no graphics) for some articles from AIAA Journal. We are still figuring out how to process their DTD which differs from other DTDs we have worked with.
AIP
We have 34 issues of Applied Physics Letters up on the Web and they are in good shape. We are in the process of working with SoftQuad to learn how to solve rendering problems with Panorama.
APS
We have January - July 1995 of Physical Review Letters processed, but again, need to solve some rendering problems with Panorama.
ASCE
We have received two issues of the Journal of Transportation. The SGML is in good shape and are currently working on resolving problems with graphics formatting.
ASCE has also committed to sending us the following journal titles in SGML:
Journal of Aerospace Engineering
Journal of Computing in Civil Engineering Journal of Construction Engineering and Management Journal of Materials in Civil Engineering Journal of Performance of Constructed Facilities
IEEE
We're receiving back issues of IEEE Transactions and are currently processing them. The Transactions cover a five year time period and there are multiple DTDs, so this will take some time.
IEEE Computer Society
We've received samples of text in SGML from the following journals: Computer, Computational Science and Engineering, Design and Test of Computers, and Software. IEEE Computer Society allowed us to put a sample of materials that we've marked up up on the Web as an example of what the DLI is all about. The public access for this URL is: http://morrigan.grainger.uiuc.edu/
SoftQuad
We are working with SoftQuad to resolve some of Panorama's problems with math rendering. They have communicated that they hope to have the problems solved this year.
Go back to the DLI progress reports page
