To enhance the usability of the testbed materials and databases and address initial security concerns, we've implemented several locally-developed procedures and processes. Five of these customizations (pre-index filtering, metadata creation, tag aliasing, the handling of embedded TeX math, and preliminary digital signature implementation) are summarized below. From a research and development standpoint, the specific techniques used aren't as important as the issues dealt with through these customizations; there are in fact other approaches to dealing with many of these issues being considered and researched as part of the overall UIUC DLI project.
I. Filtering During Database Builds:
While support in SGML for character entities and system-specific processing instructions can facilitate accurate rendering of materials, the use of such features can complicate indexing and searching of SGML document instances. For example, consider author last name "O'Grady," author last name "Özbay," the text "...Multiple wavelength Fabry-Pérot lasers ...," and the case where the text "... regions resemble the usual Voronoi polyhedra ..." requires special justification by the typesetting system during original print publication. In SGML these 4 instances (based on actual materials received from our publishing partners) may be represented as:
<SURNAME>O’Grady</SURNAME>
<SURNAME>Özbay</SURNAME>
...Multiple wavelength Fabry–Pérot lasers...
...regions resemble<?Pub Tag justify>the usual Voronoi polyhedra...
Of course this is not how end-user searchers would search for these names and phrases. Users would search for author "Ogrady" or perhaps "O'Grady." Unless a special interface facility was created for entering diacritics, they'd search for author "Ozbay." They'd search for "Fabry-Perot lasers" or possibly "Fabry Perot lasers". They'd search for "region" near "Voronoi polyhedra" (where near is an implicitly or explicitly entered adjacency operator meaning within a specified number of words or characters).
To handle cases such as these, it's essential to build into the search database system and/or the search client the capability to translate between user search strings and SGML mark-up. Unfortunately, a complete solution is not possible using a client-only approach -- both because information may be missing from the submitted search query (e.g., diacritics) and because there are too many different ways to code text in SGML for such an approach to be practical, so most of the work to eliminate the undesirable impact of SGML mark-up on search and retrieval must be done at the indexing stage.
While the OpenText query language does not have built-in facilities for eliminating the impact of SGML entities and processing instructions on phrase and adjacency searching (nor does any other query language we've found to date), it does have the built-in capability to filter SGML documents immediately prior to indexing. This filtering is done "on the fly" as SGML document instances are being indexed. No alteration of the stored SGML files is made (i.e., during retrieval you can still get the original, unfiltered SGML). OpenText also has features that allow it to ignore specific characters, to map specific characters to different characters for search and retrieval (e.g., to map 'A' to 'a' and '-' to ' '), and to treat strings of multiple spaces as a single space for indexing purposes. Positional attributes of words and phrases within the SGML document instances (i.e., the byte count of where in the file a word or phrase occurs) is preserved, but from a search and retrieval standpoint, even for adjacency searching, it's as if the multiple spaces or ignored characters weren't present.
Thus, we were able to construct a customized filter that turned specified SGML mark-up into either spaces or ignored characters (e.g., apostrophe). Selected ISO entities and publisher-specific entities are translated either to all apostrophes, all spaces, or single characters embedded in a string of apostrophes, as appropriate (a listing of the ISO entity translations currently used is presented in Attachment 1 of this note). All processing instruction tags (i.e., tags beginning "<?Pub") were translated to spaces. When filtered the above SGML samples go into the OpenText indexer looking like:
<SURNAME>O'''''''Grady</SURNAME>
<SURNAME>'O''''zbay</SURNAME>
Multiple wavelength Fabry P'e''''''rot lasers
regions resemble the usual Voronoi polyhedra.
Because we told OpenText to map apostrophes to nulls, to equate uppercase and lowercase characters, and to map hyphens to spaces, and because OpenText automatically ignores redundant spaces, these examples can all be successfully retrieved using the example end-user searches given above.
Finally, we also used the same filtering process to filter out document instance SGML DOCTYPE and file entity declarations, which are needed by SoftQuad's Panorama for proper rendering but are not allowed in input SGML by the OpenText indexing system.
II. External MetaData:
While the concept of digital object metadata as a 'good thing' has broad acceptance, the exact definition of what should comprise metadata remains nebulous. Among other things, we see metadata as a means to help meet 2 immediate needs:
1) By definition, documents in a distributed federated document system will be distributed on different servers. Index databases will also be distributed, but it should not be assumed that the distribution of index databases will exactly match or parallel the distribution of documents -- i.e., we must allow for an index database, maintained say at a library, to index and point to documents maintained remotely, say at a publisher. This should be true even if the index supports full-text searching. However, it is also desirable that more than just a pointer to the full document should be returned from such an index system in response to a successful search query. Some minimal information about the document, such as its title, an abstract, a statement of responsibility, rights and restrictions, etc. -- e.g., in library terms, a short bibliographic citation -- should be available quickly and easily from the index database system. Additionally, it would be desirable to have readily available information about external linkages associated with the retrieved document -- e.g., to other works cited, to associated figures, etc. This latter information is useful both to provide more retrieval options and also to facilitate establishment and maintenance of linkages into and out of documents.
2) Additionally, we also have found it necessary to do some of the more complex normalization of publisher DTDs by incorporating certain information in a standard format within document metadata. This tends to increase the size of the metadata for each document, but appears to be the optimum approach where information is not included directly in document instances, or is included in such diverse forms as to make normalization through tag and region aliasing (see section III of this note) difficult or inefficient.
In the absence of clear community consensus as to exactly what should be contained in a document's metadata, we've developed a working metadata structure for the purposes of the UIUC DLI testbed. The current <uimeta> hierarchy is given in Attachment 2 of this note. We currently generate this metadata for each SGML document instance we receive from our publisher partners using customized software. The OpenText indexing system has the ability to incorporate this metadata structure at the time of index build -- thus allowing searching of both metadata and documents together. Moreover, OpenText allows for retrieving metadata separate from retrieving document content -- thus, we can delete (or move) document instances after indexing while still retaining the ability to not only perform complete full-text searches and retrieve pointers to full content, but also to retrieve document metadata information (of course, once document instances are removed from the index server, we can no longer retrieve non-metadata document content directly from the OpenText index system). We anticipate that the definition of metadata will continue to evolve over the course of the DLI project.
II A. Internal MetaData
Prior to the implementation of external document metadata as described above, we were using an alternate scheme that involved the addition of metadata-like constructs to each individual SGML document instance. Though the need for much of this internal 'metadata' has lessened with the implementation of external metadata in the DLI testbed, we continue to add metadata-like constructs to testbed materials to meet 2 needs.
First, to accommodate certain document integrity/authentication schemes we're investigating, it is useful to include, internal to the SGML, document instance URL and digital signature information. Specific SGML tags have been defined to support the addition of this information. Details of this implementation are described in section V of this note.
Second, in translating articles formatted for print directly to SGML, information desired in stand-alone SGML format may be missing from files being converted to SGML, or information may be ordered differently than desired for presentation by SGML rendering clients. The internal metadata-like constructs we've added both insure the presence of desired information (e.g., copyright statements, item bibliographic information, etc.) and allow for implementation of appropriately ordered print and display headers. The current set of metadata-like SGML elements we add to testbed documents are described in Attachment 2A of this note.
Note, as with the external metadata structures described in Attachment 2, the structures described in Attachment 2A are largely arbitrary and tailored to the needs of our particular testbed configuration. Many of the functions accomplished by these structures, on the other hand, are generic and necessary. One lesson learned so far in the processing of materials provided by our publishers is that in translating materials formatted for print into SGML to be distributed in stand-alone fashion over the Internet, the rearrangement of information and the addition of information is typically required.
III. Tag Aliasing/DTD Normalization:
The inherent flexibility of SGML, while having numerous benefits, adds significant complexities to effective and efficient search and retrieval across a federated system of document repositories. To date we've received SGML materials from 7 publishers. Each publisher is producing SGML conforming to a different and apparently unique SGML Document Type Definition. Fortunately, there is a degree of logical congruence between the various DTDs, roughly equivalent to the degree of congruence between the logical structure of the materials. All the materials in the UIUC DLI testbed can generically be described as journal article literature, with all that implies -- e.g., an article title, article authors, a body containing paragraph structures and in many case section and subsection structures, citations to external works, equations, figures, tables, etc. Thus, though the nomenclature varies, the DTDs have generally analogous structures.
This has allowed us to effectively "normalize" document structure for our federated collection of scientific journal literature in SGML for searching purposes. A set of key SGML tag structures have been identified, and normalized Data Dictionaries (the files which control OpenText search and retrieval processes) have been built (or designed) for 6 of the DTDs provided by our publishing partners to date. The current "DLI standard" set of tags is shown in Attachment 3 of this note, along with the corresponding tags or pseudo-tags for each of the different publisher DTDs.
One of the main advantages of this approach is that the normalization is done at index build, thereby making it unnecessary to build such normalization into the client with the associated performance penalties at the time of search and retrieval. (Such an approach would be appropriate if trying to search across dissimilar federated collections -- e.g., searching simultaneously an image database and a journal literature database. Not only index structure but protocols might vary in such cases. For that reason research into client and gateway normalization approaches is being conducted in other parts of the UIUC DLI project.)
A limitation of the current normalization approach is that compromises in naming conventions were necessary. Because some or our DTDs used the same tag name (<TITLE>) for article title, figure caption, cited article title, etc., all the normalized data dictionaries had to follow suit. Thus the <ATL> (cited article title) and <LEGEND> (figure caption) tags of another publisher were combined and aliased together into a single tag name in their normalized Data Dictionary. However, because OpenText searching of SGML structures takes into account hierarchy as well as tag names, this does not reduce search capability -- e.g., a search for "region <LEGEND> including the word nanostructure" becomes "(region <TITLE> including the word nanostructure) within region <FIGURE>." The same results are obtained. Additionally, unique structures are maintained in each publisher's database segment, allowing for more specialized searching over a publisher-specific domain.
IV. Handling of Embedded TeX:
In digitizing scientific literature an immediate and prominent concern is how best to handle mathematics. The correct rendering of mathematics and similar forms of notation is critical to insure proper meaning. Though considerable progress has been made, the SGML community has not yet fully resolved the issues related to math rendering. The emphasis of SGML on content over appearance means that it is sometimes difficult for currently available SGML rendering systems to properly render SGML-encoded math.
Some of our publisher partners, anticipating that these issues will be resolved in the near term, are marking up the mathematics in their articles in SGML, using the best available SGML math mark-up schemes. Other publishers have chosen to embed fragments of TeX mark-up within their SGML documents. This latter approach has necessitated that we create a series of customized pre-indexing procedures for such materials:
1) we automatically generate GIF image files for every embedded TeX occurrence (unless the process recognizes that the TeX fragment can be easily coded directly into SGML -- e.g., a TeX instance that is simply a string in italics);
2) a "HIDE=TRUE" attribute is added to the SGML tag introducing the TeX fragment (a corresponding change is made in the publisher DTD to allow this attribute);
3) at each point where a TeX fragment occurs, an empty SGML tag (<uie> for display mathematics and <uii> for inline TeX instances) with an entity attribute pointing to the name of the GIF file created for that TeX fragment is added to the SGML instance (again with corresponding changes in the DTD);
4) information is added to the style sheet used in Panorama for rendering that publisher's materials to insure that the original TeX content is hidden and the GIF version of the information is displayed (by hiding rather than eliminating the TeX fragment, the fragment can be indexed for possible search and retrieval); and
5) a "packed" file is created for each document containing the SGML file plus all GIF figures created from TeX instances within that SGML file.
The last step is necessary because of the large number of TeX instances that can occur in a single article (we've had samples from our publisher partners with over 400 TeX instances in a single article SGML file). While Panorama will fetch such GIF files over the network as they are needed, it does so on a one-at-a-time basis. This is unacceptable for anything more than very infrequent use. By packing the SGML and the GIFs together and including an unpack process on the client side, we make it unnecessary for Panorama to make all these separate network fetches. Again, as with the metadata work described above, this is largely a stopgap approach. We expect that either the community will settle on an appropriate multi-part MIME type solution along the lines of our "packed" file approach, or that renderers such as Panorama will learn how to use such features as HTTP "keep-alive" to more efficiently fetch multi-part digital objects.
V. DLI-related Network Security:
In providing copyrighted materials via the WWW, security concerns arise. These concerns fall primarily into 4 overlapping categories:
1. Permissions & user authentication: defining who's allowed access to materials and making sure only they are allowed access to those materials.
2. Material integrity & authentication: ensuring that the materials obtained have not been tampered with or corrupted since publication, and have been provided by an authorized source.
3. Viable commercial transactions: ensuring a user paying for materials gets what was paid for in a manner such that neither buyer nor seller can later repudiate transaction success.
4. Privacy & in-transit anti-theft: ensuring that transmittals are private and can't be intercepted in transit for use by an unauthorized user..
All these issues are being addressed by the WWW community at large, but across-the-board, community-wide standard resolution of these issues is some time away. For the UIUC DLI testbed, we are dealing currently with issues 1 and 2. The approaches we're using are specific to DLI in detail, but generally follow our best estimate of current community trends. Issue 3 may be of interest later in the DLI project (assuming commercial/NetBill trial now under consideration). Issue 4 will not be addressed directly by the UIUC DLI testbed.
Permissions & User Authentication:
To address issue 1, we have already implemented standard HTTP IP address checking and unencrypted (uuencoded) login/password security on HTTP gateways and document repository servers. We've also implemented standard unencrypted UNIX login/password security on index servers -- maintaining separate accounts and groups for gateways, in-library clients, on-campus clients, and off-campus clients (currently limited to sponsor agencies and publisher partners).
A number of more-sophisticated techniques are now being implemented (e.g., the authentication technologies inherent in SSL and SHTTP protocols), and some of these will be investigated during the course of the DLI project. Long term, of course, responsibility for managing permissions and limiting access to servers and gateways will be distributed along with the document repositories. It is anticipated that a variety of techniques will be used from the simple approach described above, to techniques much more difficult to defeat.
Material Integrity & Authentication:
Our primary objective in regard to issue 2 is to discourage the inadvertent or intentional modification of testbed documents or the unauthorized mirroring of testbed servers. Though it is presently impossible to absolutely prevent such actions, they can be discouraged by providing easy and convenient mechanisms by which digital documents (and their source) can be validated and by providing a measure of tracibility. While rigorous end-user authentication of the servers to which they connect (through such protocols as SSL and SHTTP) will help do this to some extent, it presently appears that digital signature technologies will be required to more fully protect publisher interests. The legal value of digital signatures has been recognized by the ABA and is now being written into law in some states (notably Utah and California).
To model and investigate issue 2 in the context of DLI, we are currently testing various digital signature strategies, using PGP 2.6.2 digital signature/encryption software to generate the signatures. While there remain issues about how to reliably distribute valid public decryption keys widely, private key/public key digital signature utilities such as PGP appear to offer a convenient and effective scheme to verify the authenticity of digital documents.
There are different approaches to implementing digital signatures. Digital signatures that attest to the authenticity of a document can be prepended or appended onto the file they authenticate as part of a file wrapper structure, they can be embedded at a well-defined convenient point elsewhere within the file they authenticate, or they can be created as entirely separate files. Each approach has inherent advantages and disadvantages.
Wrapping a file with delimiters and appending the digital signature at the end of the file is convenient -- both the signature and contents travel together -- and algorithms to sign and check signatures are simple and efficient. Conversely, wrapper and signature will typically have to be removed before the file can be used as originally intended. For instance, Panorama won't open an otherwise correct SGML file that includes a standard PGP wrapper and digital signature. For this reason, signature validation typically only occurs when the document is retrieved. If the document is later passed on or moved (presumably within limitations of copyright), it may be difficult to re-validate. (Presumably this limitation will lessen in time as more flexible clients and/or standard techniques for retaining detached signatures evolve.)
SGML, on the other hand, offers an opportunity to embed digital signature blocks within documents in such a way that they don't interfere with normal utility of the file. (It's important when doing so that some encoding scheme such as PGP's radix-64 'armor ASCII' be used to ensure that no characters in the signature are confused with SGML markup delimiters.) SGML documents could contain a digital signature but still be fully compliant with SGML standards, and therefore usable by Panorama, etc. Document authentication could be done during retrieval, each time the document is rendered, or at any other arbitrary time post-receipt. The SGML document could also store (in theory) the signatures of associated graphic files, which can't store signatures internally due to the more restricted nature of most graphic file formats. The greatest limitations of such scenarios are a lack of current standardization about how (where) to integrate signatures into an SGML document, and the added complexity of algorithms to authenticate documents signed in such a way.
Finally, maintaining signatures separately on the server from the digital documents they sign also has the advantage of supporting document authentication at any arbitrary time in a simple and well understood way, but the decided disadvantage of decoupling the signature from the document. For instance, if a digital document is retrieved, authenticated initially, used, and then later on a user wants to authenticate the document before re-use, it may or may not be obvious where to obtain the signature file, particularly if the document and associated signature file are no longer being served where previously available (or more likely yet have been legitimately changed or updated since the original retrieval).
Once the decision is made how (where) to sign a digital document, there remain the issues of who and when. In delivery of a digital document across the Internet there are typically two provider organizations of interest: the creator (publisher or author) of the document, and the document distributor (organization serving the document). In many cases they'll be synonymous, but the possibility of authorized agents that are allowed to distribute documents of organizationally separate and distinct publishers must be accommodated. The digital signature of the creator, typically made prior to deposit of the document with the distributor, authenticates the content of the digital document; the digital signature of the distributor, typically made at the time the document is served, authenticates the distribution of the document from a particular distribution point. Ideally there should be corroboration possible between the 2 signatures (e.g., the distribution point URL or equivalent is embedded in the document signed by the publisher, implying that the distributor is authorized by the publisher to distribute the document).
Preliminary Testing of Digital Signatures in UIUC DLI Testbed
It's unclear at this time exactly how digital signatures will be implemented on a large scale to support web publishing of the sort being studied in this project. It should also be noted that the techniques and issues discussed above overlap and are in large measure a subset of the technologies that will likely be used to support certain kinds of commercial transactions. The UIUC DLI testbed is currently testing several simplified implementations of digital signatures to better understand and investigate the implications of digital signature technologies in the context of web publishing, and to provide an acceptable initial level of security for the integrity of testbed materials. These are very preliminary, proof-of-concept investigations at this time, pending more standardized, community-generated solutions in the future. Among the techniques being studied:
- Integrating 'creator' and 'distributor' signatures into the body of testbed SGML documents. The documents remain fully SGML compliant and viewable in Panorama. Document integrity can be verified (via an appropriate customized client application) not only on receipt but also at any later time (even while viewing the document in Panorama).
- Implementing 'wrapper-style' digital signatures on non-SGML graphic files. It is necessary to 'unsign' such files before use, but they can be unsigned using standard, widely available PGP software. A simple, (but at present still custom) client is necessary to do this unsigning in conjunction with SGML rendering systems such as Panorama.
- Implementing 'wrapper-style' digital signatures on the 'packed' SGML/equation GIF files described in Section IV of this note. The SGML files involved may have internal signatures as well. In this case files must be both 'unsigned' and 'unpacked' before use. A custom Web Browser helper application is required in this case.