Quarterly Report for 2nd Quarter 1997
(May - July ‘97)
DLI Project, University of Illinois

Federating Repositories of Scientific Literature
Bruce Schatz, PI,
schatz@uiuc.edu
contact:
dli@uiuc.edu

 

SIGNIFICANT EVENTS:

The final phase of the design and implementation of the Web client has been completed. The Web client will now be tested by select user groups to insure the usability of the interface and the performance of LiveLink. Campus-wide distribution of the client will commence October 15.

The design of a new information retrieval protocol, called IRTP (for Information Retrieval Transfer Protocol) has been initiated. A major feature of IRTP will be a hierarchical attribute space that will allow a server to handle a search should it not support the particular kind of search requested by a client. IRTP will be easy to implement but also flexible enough to be used in a sessionless as well as a session-oriented way.

TESTBED

This past quarter the Testbed team concentrated on preparations for a limited distribution of a Web-based client. This included working closely with the Social Science team on the interface, which will provide simulated stateful connections. The Web client (Livelink, OpenText's new version of the search and indexing engine for the World Wide Web) supports searching across multiple repositories and more complex boolean searching.

In the upcoming quarter, the Testbed team will be working on improving the performance and functionality of the Web-based client. A research programmer was recently hired to implement these goals. In addition, the Testbed team will be working closely with Engineering faculty and staff at the UIUC to promote usage of the Testbed.

The testbed team also hosted the American Society of Agricultural Engineers and Oxford University Press on separate occasions to go over the current status of the testbed and plans for the continuation of the testbed after the end of the project.

EVALUATION

During the past quarter, the Social Science Team has been involved primarily in system development and conducting research related to digital infrastructure and scientific and engineering work. Team members were heavily involved in the final phase of design and implementation for our web client. They also continued producing more detailed analyses of registration and transaction log data. Work continued on studies of the work and information practices of an engineering workgroup, the use of document structure, and office classification practices. The Social Science Team also continued its efforts to foster the emerging research community in the social informatics of DLs through presentations, publications, and preliminary work on a monograph devoted to human-centered design and analysis of DLs.

The Social Science Team worked closely with both designers and intended users on the development of our new web client. Usability tests with science and engineering librarians provided suggestions for improving the web interface and functionality. The Social Science Team met on a biweekly basis with web client designers, bringing their expertise on user needs and preferences to bear on the design of system features and promotional material. They also created online help files, a tutorial, and a website on engineering material to link to the testbed. Team members also developed a marketing strategy, an online survey, and a usability test plan for gaining feedback from users during the trial roll-out of the web client.

Social Science Team members worked with the Testbed Team to design the transaction logging and user authentication process modifications needed for support of the web testbed client. They also worked with the transaction log data and the reports of other DL systems to identify how data from our clients could be integrated to produce desired reports on the extent and nature of testbed use.

One team member continued interviews and observations with a research group in mechanical and industrial engineering related to their work practices. This study will contribute to understanding group and individual work practices and information use. Another team member continued work on a study of how researchers use the individual components of journal articles by conducting a small series of interviews with members of the research group mentioned above. In the third significant study currently under way, another team member continued exploring the organization of office spaces through launching a web site that invites people to describe their work settings. The methodology of this study represents a significant attempt to develop new online approaches for gathering qualitative data on work practices.

Social Science Team members continued their efforts to foster a research community focused on human-centered studies of information systems. Team members participated in, for example: an NSF-sponsored workshop devoted to the development of a new funding initiative in knowledge networking; a workshop on knowledge work sponsored by Case Western; the ACM DL 1997 conference; and a workshop on the ecology of infrastructure. Team members revised papers on their DL studies for publication and continued work on the planned monograph on human-centered design and evaluation of DLs by contacting potential authors and publishers and helping to develop a book prospectus produced by Nancy Van House of the Berkeley DLI project.

Finally, team members welcomed a new research assistant, Eric Larson, and spent a significant amount of time planning study activities for the final year of DLI. They developed a plan for collecting, integrating, and reporting data on the extent and nature of testbed use and on the nature of work and communication in the digital era.

INTERNET

We are preparing to finally take IODyne into the outside world. The design is stable and we are working on improving the internals and enhancing the drag-and-drop functionality as well as the general robustness of IODyne. Current plans are to reimplement IODyne in Java, using the new Java Foundation Classes. Although Java performance is still somewhat of a concern, with the new just-in-time compilers becoming available this should not be an issue by the time we are ready for a 1.0 release of IODyne early in 1998.

Search documents, the objects which encapsulate abstract queries and hold files (small databases of search terms and bibliographic records), have now become truly persistent objects, and are savable and openable just like documents in other applications. It is also possible to open IODyne by double-clicking on an IODyne search document icon in the file explorer or on the desktop. You can open multiple search documents within IODyne.

This use of documents also allows IODyne to be launched from other applications, such as Web browsers and email clients that use MIME typing. For example, an IODyne document stored on a Web site can launch IODyne as a helper app when you click on the icon for the document.

The big change in the use of search documents is how repositories get associated with them. In previous versions of IODyne, you connected to a repository in the global application environment. Now you "attach" repositories to individual search documents by dragging the label from the repository display window and dropping it onto the search document. This has three distinct advantages over the prior design. First, different sets of repositories can be associated with different search documents; it is also absolutely clear which repositories are associated with which search documents. Second, the explicit association of repositories with a search document makes the repository references a persistent part of the search document; when you open a search document, IODyne can automatically connect to and search the repositories referred to on it. Third, the new connection model is more robust, with IODyne now able to actively attempt reconnection to servers that have disconnected. These, coupled with the encapsulation of repository configuration data, makes stored searches very easy to create, reuse, and share with others.

Repository configuration data are encapsulated by Repository Configuration Objects (RCOs). RCOs can be distributed around the Internet, and collections of pointers to RCOs can be assembled by online collection managers to create virtual collections. IODyne uses its own HTTP sockets to fetch RCOs and RCO collections. The particular ways in which repositories will be identified and organized throughout the Internet are still being discussed; repositories need to be uniquely identifiable, calling for some sort of registry system (though something based on URNs would work), and yet the architecture must be open enough to allow anyone to build repositories, RCOs, and collections of pointers to RCOs.

We are also designing a new information retrieval protocol, called IRTP (for Information Retrieval Transfer Protocol). This started out as a design for the Abstract Retrieval Gateway (ARG), a gateway designed to optimize the use of IODyne clients by handling a lot of the query and result translation now done by the IODyne client itself. We realized early in the design process that we could do a lot more than just build a gateway for IODyne clients, and that we could build something to support access through Web server gateways and other bibliographic servers as well. We finally decided to implement the IRTP protocol when we realized that no easily implementable information retrieval protocols existed. We will take implementation experience from Z39.50 and insights gained from other protocols implemented since that time to build a protocol which is not only easy to implement but is also flexible enough to be used in a sessionless as well as a session-oriented way. A major feature of IRTP will be a hierarchical attribute space that will allow a server to handle a search in a reasonable way should it not support the particular kind of search requested by a client. Responses from servers will also include optional explanations of how searches were done. This will be particularly useful should the search have to be done in a different way than the user expected. It is up to the client program to extract and display the explanation in a non-obtrusive way (we have already designed this feature for IODyne).

The Term Suggestion Service (TSS) layer has demonstrated its robustness and flexibility and will be retained and extended for the next stage of development. This is the part of IODyne that retrieves records from term suggestion databases (thesaurus, KWIC, KWOC, concept space, etc.) that are specially prepared for IODyne through off-line processes. It is a generalized database framework based on tables of objects with unique primary keys. Objects are stored in memo-type fields using SGML-style tags to describe their structure, which can be arbitrarily complex. When the TSS layer fetches an object from a term suggestion database, it passes the object to the term suggestion tool (e.g. thesaurus browser, KWIC window) that requested it. That tool then renders the object. The simplicity of the database structure used by the TSS layer means that term suggestion databases can be stored on any number of different database servers. Any ODBC database server can store TSS databases, as can an HTTP server with a CGI script, or a Microsoft Internet Information Server with Active Server Pages.

The use of the TSS layer has prompted the reimplementation of all the term suggestion services, as they were previously implemented on a "proof of concept" basis and each had its own database structure. The concept space browser has been the easiest to reimplement, as it retrieves relatively small records that can easily be rendered using RDL (record display language). The KWIC tool has been reimplemented but still has minor problems with large KWIC records. A slight change to the indexing algorithm as well as the markup scheme will remedy that. The KWOC tool has yet to be reimplemented; a major issue here is how to optimize the special indexing table used to fetch TSS KWOC objects. Although the thesaurus browser was the first term suggestion service reimplemented to use TSS layers, work with larger and more complicated thesauri has forced us to redesign the browser and the structure of TSS thesaurus objects. However, this has also created an opportunity to unify how thesauri and classification schemea are encoded, thus enabling the thesaurus browser to handle navigation of classification schema (e.g. INSPEC class codes, Dewey) as well as simplify the user interface by not having a separate classification navigator. The major issue here is how to optimize storage of large hierarchies, which is not a trivial problem. The other side of this problem is how to manage the rendering of large hierarchies in the browser, and especially how to dump the least recently used parts of a hierarchy when there are a lot of components to display. This must be implemented in a way which at no time causes a disorienting change to the display.

For the new implementation of the thesaurus/classification browser there is an opportunity to expand the use of RDL (record display language). RDL was developed especially for use in the IODyne client to allow for configuration of repositories. IODyne uses RDL to display full records (e.g. A&I records) as well as certain TSS tools such as the concept space navigator and part of the thesaurus navigator. More complicated record formats, such as MARC records, present a more complicated rendering problem and call for enhancement of RDL. In a different way, rendering complicated thesaurus and classification hierarchies requires the enhancement of RDL as well. The other important task of RDL is to bind displayed objects to search methods, which is how IODyne "knows" that an information object dragged from a thesaurus hierarchy is a subject search, for example. Enhancement is needed in this area to extend the kinds of attributes that can be bound to an object.

RESEARCH

Last quarter we reported on the discovery and application of new parallelization techniques based in part on the study of cSpace, a parallel C++ application. The parallelization of the outermost loop in cSpace was accomplished based on a transformation involving the re-association of a non-commutative, associative coalescing loop operator.

Research in the topic of parallelization of computer applications continues, and recent developments in the area of coalescing loop operators has led to the determination that 'gather' operations are associative and can be parallelized based on a transformation similar to that presented in section 2.3.8 of [Pottenger 1997].

The following depicts a generic gather operation:

for (int i=0,j=0; i<n; ++i)

if (is_true(a[i]))

// gather

a[j++] = i;

 

Conceptually, this pattern is quite common in many sparse and symbolic codes such as those under development in the DLI at Illinois.

The difficulty in parallelizing such a loop occurs because the initial value of j for a given iteration is dependent on how often a[i] was true in preceding iterations. However, due to the fact that gather is an associative coalescing operation, this loop can be transformed into parallel fom.

The following is a simple (serial) fortran version of a gather operation:

program gather

integer a(1000)

// initialize with sample scatter pattern

do j = 1, 1000

if (mod(j,2) .eq. 0) then

a(j) = 1

else

a(j) = 0

endif

enddo

// gather

j = 0

do i = 1, 1000

if (a(i) .eq. 1) then

j = j + 1

a(j) = i

endif

enddo

print *, j, a

end

Parallelizing the 'do i' loop above can be accomplished based solely on the associativity of the 'do i' loop represented as a coalescing loop operator (please see section 2.3.3 in [Pottenger 1997] for a definition of an associative coalescing loop operator), and takes the following form:

program parallel_gather

parameter (size=1000,procs=4)

integer a(size)

integer a_p(size/procs,procs), j_p(procs)

integer start_p(procs), end_p(procs)

// initialize with sample scatter pattern

do i = 1, size

if (mod(i,2) .eq. 0) then

a(i) = 1

else

a(i) = 0

endif

enddo

// initialize privatized-thru-expansion induction variable j

do i = 1, procs

j_p(i) = 0

enddo

// gather into privatized-thru-expansion local arrays a_p

C$DOACROSS local(i), share(a,j_p,a_p)

do i = 1, size

if (a(i) .eq. 1) then

j_p(mp_my_threadnum()+1) = j_p(mp_my_threadnum()+1) + 1

a_p(j_p(mp_my_threadnum()+1),mp_my_threadnum()+1) = i

endif

enddo

// serially determine bounds for each locally gathered a_p

// (per step 3 in "Transformation of op&append", thesis section 2.3.8)

start_p(1) = 1

do i = 2, procs

start_p(i) = start_p(i-1) + j_p(i-1)

enddo

do i = 1, procs

end_p(i) = start_p(i) + j_p(i) - 1

j_p(i) = 0

enddo

// coalesce processor-private slices a_p into one conglomerate a

C$DOACROSS local(i,j)

do i = 1, procs

do j = start_p(mp_my_threadnum()+1), end_p(mp_my_threadnum()+1)

j_p(mp_my_threadnum()+1) = j_p(mp_my_threadnum()+1) + 1

a(j) = a_p(j_p(mp_my_threadnum()+1),mp_my_threadnum()+1)

enddo

enddo

print *,j_p

print *,a_p

print *,a

end

For simplicity we have assumed that the number of threads divides n evenly. In the above transformation, mp_my_threadnum() is the thread-id of a given parallel thread. The array a is first initialized with a pattern of true-false values (0's and 1's) representing a 'scattered' array. The induction variable j and the array a are then privatized through expansion and j_p is initialized.

The main 'do i = 1, size' loop performs a gather operation on each private section a_p of the array a (this corresponds to the 'doevery' loop in the transformation presented in section 2.3.4 of [Pottenger 1997]). This loop is executed in parallel as a 'doall' (dependence free) loop.

The next section can be performed either serially or in parallel, and computes the size of each gathered section a_p. (This corresponds to the third step of the transformation discussed in section 2.3.8 in [Pottenger 1997].) Following this, the privatized sections are coalesced into the global conglomerate a based on associativity alone (i.e., the operands a_p are not re-ordered).

The entire transformation is based on the fact that 'gather' is an associative, non-commutative coalescing operation.

References

[Pottenger 1997] William Morton Pottenger. Theory, Techniques, and Experiments in Solving Recurrences in Computer Programs, Ph.D. Thesis Department of Computer Science, Univ. of Illinois at Urbana Champaign, www.ncsa.uiuc.edu/People/billp/billp-thesis.ps, May.

The Artificial Intelligence Group at the University of Arizona has continued to refine its visual thesaurus techniques and is now able to extract information from more than one aerial photograph. The system has been enhanced to allow generation of visual SOM based on photographic region similarity. A similarity search feature has been added that allows the user to select an image from the SOM and see the associated regions. The user may then select one of those regions and return to the original photograph where the region is displayed in red and all SOM associated regions on that photograph are displayed in blue.

Analysis of the effects of stop words and stop phrases on the Noun Phraser concept space is currently underway. The AI Group is also exploring the application of a Java-based Scalable Self-organizing Map to retrieved documents giving the user a graphic overview of large sets of retrieved documents. A Java graphical display for the concept space terms is also under development.

PUBLICATIONS

H. Chen, T. Smith, M. Larsgaard, L. Hill, M. Ramsey ``A Geographic Knowledge Representation System for Multimedia Geospatial Retrieval and Analysis'' accepted by International Journal of Digital Libraries.

H. Chen, M. Ramsey and B. Zhu ``An Advanced Visual Thesaurus for Browsing Large Collections of Geographic Images'' submitted to JASIS Perspectives Issue on Visual Information Retrieval Interfaces.

Laura Neumann, Geoffrey Bowker and Susan Leigh Star, "Things Come Together: Information Convergence," Journal of the American Society for Information Science, in press.

Sandusky, Robert. Book Review of "Information Systems Development and Data Modeling: Conceptual and Philosophical Foundations," by Rudy Hirschheim, H. K. Klein, and K. Lyytinen, Cambridge University Press, Cambridge, 1995. The Information Society: An International Journal, Volume 13, Number 2, April-June 1997.

Susan Leigh Star and James Griesemer, "Institutional Ecology, 'Translations,' and Coherence: Amateurs and Professionals in Berkeley's Museum of Vertebrate Zoology, 1907-1939," Science Studies Reader, Mario Biagioli, ed. Cambridge, MA: Harvard University Press, in press (reprint of 1989 article).

Susan Leigh Star and Anselm Strauss. The Dialogues between Visible and Invisible Work: Layers of Silence, Arenas of Voice", Computer-Supported Cooperative Work: An International Journal, in press.

Susan Leigh Star. (White paper - national policy document): Report of our NSF Workshop on Human-Centered Systems: Information, Interactivity, Intelligence.http://www.ifp.uiuc.edu/nsfhcs/

Stefan Timmermans, Geoffrey Bowker and Susan Leigh Star, "The Architecture of Difference: Visibility, Control and Comparability in Building a Nursing Interventions Classification," Differences in Medicine, ed. Marc Berg and Annemarie Mol, Raleigh, NC: Duke University Press, in press.

"NCSA Claims Parallel Code Breakthrough" covers the essence of the HPCWire report, and appeared in High Performance Computing And Communications Week, Volume 6, Number 28, July 28, 1997.

PRESENTATIONS

Bishop, Ann P. "Digital Libraries and the Disaggregation of Knowledge: Use of Article Components," Digital Libraries Initiative national meeting, Pittsburgh, PA, May 5-6, 1997.

Mischo, Wm. "The UIUC Digital Libraries Initiative": Presentation to the UIUC Gateway Committee, University of Illinois Library, Urbana, IL, May 15, 1997.

Sandusky, Robert. "Retrieving the User: Surveillance in Digital Libraries", American Society for Information Science Mid-Year Conference, Scottsdale, AZ, June 1-4, 1997.

Star, S. Leigh. Guest Lecturer, Invitational Workshop on "The Ecology of Infrastructure," Multimedia Laboratory, Department of Computer Science, University of Jyvaskala, Finland, May, 1997.

Star, S. Leigh. Organizer and presenter, International workshop on "Knowledge Work," Case Western Reserve University, Cleveland, June, 1997.

Twidale, Michael, D. Nichols, J. O'Brien, and R. Sandusky. "Collaboration in the Digital Library", Workshop at ACM Digital Libraries '97, Philadelphia, PA, July 23-26, 1997.

PROFESSIONAL ACTIVITIES

Ann Bishop participated in NSF's invited workshop on "Human Dimensions of Knowledge Networking: Access, Usability, Impact," held in Santa Barbara, CA, June 18-20.

Star, S. Leigh. Participated in IGPA, discussion group organized by Dan Alpert, "The Future of the University in a Digital Age."

VISTORS

May 20

Claire Vishik, Research Scientist, Schlumberger

May 21

Carolyn Snyder, Director of Libraries, Southern Illinois University

May 22

Louise Zipp, Iowa State University, Assistant director for collection Development

Robert Allen, Purdue University, Physics Librarian

May 29

Richard Griscom, Music Librarian, University of Louisville

June 9

Patsy Hulse, Queenslands University, New Zealand

June 17

Dr. Mohammed J. Ashoor, Dean of Library Affairs, King Fahd University of Petroleum and Minerals

Dr. Elmar Krause, Library Services, University of Humboldt, Berlin

July 17

Mike Stout, Head of Electronic Journals, Oxford University Press

July 18

Donna Hull, Director of Books and Journals, American Society for Agricultural Engineering (ASAE)

July 25

Prem Singh, University Librarian and Director of Publications, CCS Haryana Agricultural University, Haryana, India

July 31

Sandy Wilcox, Head of the University of Wisconsin Foundation