Semantic Federation from Distributed Repositories of
Scientific Literature
Bruce Schatz, Principal Investigator
University of Illinois DLI project
thabing@uiuc.edu
DLI Project-Wide Workshop
November 10, 1995 Santa Barbara, CA
Levels of Federation
- Syntactic
- connection protocols (translation gateways)
- Structural
- field names (query normalization)
- field values (tag normalization)
- Semantic
- context (term co-occurrence)
- meaning (content parsing)
Testbed Federation
- Index with Document Structure
Tag normalization for field values
- Deposit with common tags after transform
problems with sections and with authors
- Search across multiple repositories
Query normalization for field names
- Gateway maps multiple protocols
problems with distribution and definition
- Display integrates multiple views
multiple sources at multiple levels
Semantic Retrieval
- automatic indexing of concepts
- find context of phrases within documents
- generates a concept space based on term frequency
- useful for interactive searching
- given a term, can suggest other terms
- merging concept spaces supports vocabulary switching
- concepts require supercomputing
- concept space for INSPEC took 1 day on SGI Challenge
- co-occurrence matrix for 400K abstracts
Publishing Cycle
USER: request
LIBRARY: reference
INDEXER: classify
PUBLISHER: quality
AUTHOR: generate
- users are authors, computers are publishers
- every community has a repository
- a billion repositories on the Net !!
Vocabulary Switching
- fine-grained concept spaces
for every community and subcommunity
- user and collection modeling
choose domains for user and for search
- interactive vocabulary switching
intersect at common terms to suggest across domains
- supercomputers as time machines
personal computers same computations in 5-10 years
Switching Experiments
small-scale in molecular biology (JASIS)
- worms and flies
- 5000 documents generate each space
- 10 hours per space on a workstation
- sperm as connection term
large-scale in engineering (in progress)
- 3M abstracts from Compendex and Inspec
- 15 large domains of engineering (200K per space)
- 10 hours per space on a supercomputer
- fluid dynamics as connection term
Community Repositories
- User-driven Community Searching
- choose topics (repositories) you know
- choose topics you want to know
- vocabulary switch across domains
- community specific term suggestion
- Interspace simulation with 1000 communities
- Compendex partitioned by class codes
- 3M abstracts and 1K spaces on Convex Exemplar
first crack in large-scale semantic retrieval
Computer-Assisted Indexing
- domain experts but classification amateurs
- large community A& I is too general and too old
- small community A& I is not consistent and much labor
- useful for interactive subject classification
- automatic suggestions for potential classifications
- domain expert culls list from controlled vocabulary
- semi-automatic support via concept spaces
- concept dictionary of tag words from co-occurrence
- tag frequency in documents determines classification
Building the Interspace
- every machine has its own information space
- every machine has its own concept space
- spaces for every user and every community
- search is matching selected objects
- relies on computer-assisted A& I
- analysis is merging community spaces
- vocabulary switch through graph intersect
The 21st Century: Analysis
- Beyond Search to Analysis
- Cross-Correlating Information from many sources across the Net
- The Net solves problems
- Every community has its own special library
- Every community & every person does A& I !!
Go back to the Home Page