In order to address these issues, the group proposes that the software applications comprising the Digital Library be instrumented to collect data on DL usage. This proposal contains two different approaches to data collection and analysis:
Transaction logs will be examined on an individual session basis.
Transaction logs contain much more local detail about usage than the statistical summaries. Both methods will be supplemented with other social scientific methods such as observations, user surveys, etc. Transaction logs have been instrumented into the prototype full text search being developed by Bill Mischo. The transaction logs will be collected only during specified test periods and the statistical summaries will be generated on an ongoing basis.
Each user is required to register for each session of the DL to participate in the logging study and no unregistered user can access a document in the DL.
Each session begins by asking the user to either:
Registration will explain that all users must participate in the study, and that data will be used as feedback for further developments in the system.
The following demographic and contact data will be collected from the user to describe the user and to allow follow-up surveys:
A user name, password, and participant ID (sequential) will be assigned from the server at registration.
User name, name, contact information and participant ID will be securely stored in one place.
ID and demographics will be stored in a different place and relatively open for analysis
**Alternatives:
Use logins only, with no password. Users sometimes dislike new passwords for each system. Perhaps use social security number only. Do we need that much security? Use login as participant ID.
Or, id cards can be given to users who have to scan in the id card at the beginning of the session. The card is returned at the end of a session. This would make logging a session relatively easy.
This type of registration and administrative information will be sent to the user from the server, via the DL home page in Mosaic. There will be no option for the user to change the default home page in the DL version of NCSA Mosaic.
There has been discussion of occasional online surveys set up so that the user must complete it before accessing the DL during the project. This might be accomplished by replacing the home page with the survey. However, the open URL feature would allow the user to avoid the survey.
An invisible stamp is being considered to attach the user id to each request made from a client while a user is logged in. Would this invisible stamp be ignored by servers outside the digital library so that users could retrieve other documents from the Internet?
Data will be collected to examine the use of the alpha version of the Digital Library. Location of these data collection sites is still to be decided. At least two stages of instrumentation/implementation are planned. It is expected that the alpha version alpha version will be implemented in 'public' Grainger sites, and a small number of individual office computers. The beta version will be installed in many private desktops as well as public sites in Grainger and perhaps other locations. The instrumentation and the DL beta version will differ from the alpha version based on previous findings. This way there will be two distinct deployments of the instrumented systems and usage can be compared at least within version.
Data will be collected by primarily by instrumenting the NCSA Mosaic client software some data will be collected by instrumenting the NCSA Mosaic server software. Data will allow analysis of:
*document refers to 'file', 'URL', or 'part of article' accessible to user in one piece.
Data will be collected and scripts will be written to automatically generate reports about the following (per week, or month)
* type of document, or part of article, will be defined by the fields in SGML: title, section titles/Headings, index terms/keywords, subject descriptors, abstract terms, figure captions, text of an article, author names, author addresses, author biographical information, organization names, bibliography within an article, hypertext links to other documents.
The problem is that right now it is very memory and labor intensive to give the user an option of which fields are displayed or retreived. The Panorama viewer displays an outline next to the full article. If we are allowed to modify the Panorama code, we can record which fields in the outline a user retrieves.
In order to obtain information about how useful different parts of an article are to the DL user, for the alpa version, 'fields searched' rather than 'fields retrieved or displayed' may have to be a proxy for usefulness of field.
* the transaction logs built into the full text searches will allow analysis of document access in terms of which parts of an article (field) are searched; number of words entered; word relationships used to search; number of articles returned for a search before user displays one article; number of searches that return zero articles; etc.
(** many of these may be limited to within DL, or DL documents will be compared with outside documents for instance, Number of documents in DL accessed might be changed to: Number of documents accessed; overall in the Digital Library, and from outside in the Internet, etc.)
In order to automatically analyze the usage data, each record of a transaction, and the URL's or filenames of each document, will have to be labelled in a standard way.
Each Digital Library session will begin when the user logs in and ends when the user logs off. A logoff option will be included, or the 'quit' command can function as a logoff. In addition, a time out will be implemented based on mouse movements. After a specified period of time with no mouse movements, the user is prompted asking whether he or she wants the session to continue. An answer of 'no' or no answer will automatically log the user off. Note that the concept of 'session' here is an artificial construct when using NCSA Mosaic which recognizes each transaction separately, and not within any session. The artificially constructed session in DL is used to assign an ID number to the user, and allow record keeping. The session begins when a user logs in to the DL and ends when the user logs off.
The full text search engine prototype includes code for recording transactions. The transaction log of a session includes start and finish times for the session, all options selected, words entered and commands executed by the user, as well as the number of times each word occurs in the DL, and the number of articles resulting from the search. New searches are also noted.
**Will all search engines be run through Mosaic? If yes, then should the search log be captured through Mosaic? How to keep detailed transaction log data separate from the data for statistical analysis?
Transactions conducted through Mosaic during a DL 'session' may be recorded as one entry per transaction and recorded on the client and sent to some "DATA STORAGE CENTRAL" server.
For instance, each record might include
User ID, host IP, date, time, command, document URL
Commands:
** When will the the records be sent to data storage central?
After each transaction? At the end of a session when a user logs off, or is logged off? If the records are sent at the end of a session, the session might be marked with login time and logoff time at the beginning and end of the session. Or a record could be sent indicating login and logoff with times.
**Also, window history/global history from Mosaic should be sent to file on server before being deleted as per request by HCChen.
URLs of documents in the DL should all include 'dli' or something like that to differentiate from URLs outside of the DL.
The file names, or URL, should be standardized to always reflect the type of document, or part of article, or 'field' [in an extension]. This way, a new type of file can be added to the digital library and indicated with its name, without requiring additional client instrumentation code changes.
In addition to the fields used by the testbed development team, the social science team identified the following possible 'types of documents' or 'parts or articles' or 'fields':
*Look into physical simulations or animations to include- database was mentioned on 10/28 (animation, simulation, video, sounds?)
The document names, or URLs should include information about the source journal.
eg. http://www.machine.name/dli/journal.name.year/vol.article#.document.
label.document.type.extension [or all articles will have an ID number eg. /dli/##/doc.label.doc.type.extension