Desktop Libraries: Personal and Group Information Management
Bob Fleischer
14 October 1987
SUMMARY
Professional, managerial, and most white collar workers maintain personal
and group collections of information. I propose that Digital undertake
a program to understand, develop, and apply technologies for the organization,
management, and retrieval of information in group and personal libraries.
This proposal includes a three-person, three-year program of research,
plus several smaller advanced development and prototyping projects, all
focused on building a base for information organization and retrieval products.
The "desktop library" is a collection of files, notes, correspondence,
images, and other information managed by computer. This management includes
not only storage, but also includes services and structures which support
the organization of that information. The ultimate goal of the "desktop
library" is to facilitate retrieval from possibly large, complex, and shared
information spaces.
With appropriate computer support, the retrieval function can be served
in a much better way than it is ever served manually. The available technologies,
and corresponding internal developments, are described in this paper. Digital
is in a uniquely strong position to serve the information needs of knowledge
workers. Through this program these technologies may be investigated and,
in stages, incorporated into a comprehensive information management system.
THE OPPORTUNITY
Every manager and professional, and nearly every white collar worker, gathers
and maintains collections of information either for their own use, or for
use by their group. This collection of information, which I will call the
"personal library" (or "group library") for lack of a better term, consists
of many kinds of information-bearing objects.
Among these kinds of information are:
-
- informal notes,
-
- correspondence of special interest, both sent and received,
-
- drawings, pictures, and other forms of illustration,
-
- files of structured information such as address and telephone number
lists,
-
- formal documents both in preparation and in final form,
-
- cross references and bibliographies,
-
- reports, magazines, and books,
-
- instructional and reference material.
Traditionally this information has been kept in file cabinets, on shelves,
and in index card boxes (and in piles on desks!), but increasingly this
information can be generated and maintained in electronic form. And the
introduction of the computer to the desktop has introduced additional kinds
of information to be maintained, e.g., spreadsheet files and application
programs.
Some of this information is generated by the individual, but most of
it is generated by colleagues, other parts of the organization, or by external
sources. And in many cases the generation and maintenance of these objects
is not a personal task, but actually a collaborative task.
The desktop computer has done a lot to allow the storing of many of
these kinds of information. And first the shared computer, and now the
networked desktop, has done a lot to facilitate the sharing of information
objects among colleagues. But today's computer systems still do very little
to support the organization of, and access to, this information.
(With computer-based files, owing to the regular increase of storage
densities, the number of files managed by a user may increase by an order
of magnitude, in a few years' time, without any corresponding increase
in cost or physical space! The system gets "logically" unmanageable even
though it never gets physically out of hand.)
"Does your hard disk drive resemble a zoo? You may have created all
of those subdirectories with an eye towards organization, but now, with
files tucked away in every nook and cranny, you spend hours every month
just cleaning up after them.
"Zoo Keeper by Polaris Software is intended to put you back in control
of your menagerie. It's not a directory program or a DOS shell, but a file
locator." -from a software review in InfoWorld
Emphasis On Retrieval
It must be recognized that the principal objective in organization of information
is to facilitate its subsequent use, i.e., retrieval. When we use the term
"filing" system, or "file cabinet", we are using a term that stresses the
storage function. It might be more useful to think in terms of "retrieval"
systems, for that is the function that ultimately serves the user's ends.
There are traditional approaches to organizing information collections
for retrieval, but they are very labor-intensive. The traditional approach
for organizing large collections of formal documents is the library. A
librarian performs the maintenance tasks of storing, indexing, and pruning
the collection. A librarian also performs the very important task of advising
users about how to conduct their searches.
The smaller collections of information rarely get a dedicated librarian.
A group's or manager's active files, for example, will be maintained by
secretarial staff as only one of many tasks. Indexing, other than the selection
of file drawers and folders, is spotty if done at all. Retrieval often
depends upon the fact that the person who made the file and put the document
into it is also there to retrieve it; and often it depends upon the fact
that the collection is not too large for exhaustive search.
Individual collections are usually poorly maintained. Many people tend
to regard their personal files as a kind of "black hole", into which things
go but never come out.
The computer technology we are developing today will allow more and
more of the information that an individual, group, or corporation uses
to be computer-based. But today's computer-based filing systems are even
more primitive in terms of their basic capabilities than the rather simple
facilities of a library. They do almost nothing to aid retrieval other
than allowing the naming of the storage locations. They could, and should,
do much more. As the size of computer storage systems grows, the situation
only gets worse.
It is my belief that a major new opportunity for computer software and
associated systems lies in the management of group and personal information
collections. The "desktop library" could be as significant as "desktop
publishing" or the introduction of the spreadsheet. Some of what should
be done is so fundamental to the task of the professional and "knowledge-worker"
that it rightly belongs in the realm of system and network services rather
than "applications". As such they should be integrated into Digital's plans
and offerings for general-purpose workstations.
"It is thus possible to imagine that the standard word processing system
of the future will come with an attached IR [information retrieval] system,
and that it will process the collection of documents on the WP system."
- Michael Lesk, Bell Communications Research and University College London
AVAILABLE TECHNOLOGIES
There are several available technologies which could be brought together
to address this problem area. Some of these technologies have been available
for many years, but only now is the need great enough and the power of
economical systems high enough to support their commercial application.
Text Retrieval
Document indexing systems allow multiple descriptors, typically keywords
and key-phrases, to be associated with objects in storage. These descriptive
elements can be used singly or in combination to specify selection of a
single document, or a range of documents, for further consideration. Techniques
for maintaining and applying thesauruses are well known and help to allow
the matching of synonymous terms. In addition, spelling errors can be compensated
for by special matching algorithms based on known error characteristics.
Phonetically-based algorithms can help to match terms with similar sounds,
a function that is especially important when trying to match proper names.
Indexing systems can be classified along several orthogonal dimensions.
Index terms can be derived either directly from the content of the text
object, or from external sources. All significant words of the document
could be indexed (full-text indexing), or only keywords. Keywords and descriptors
can be obtained directly from the human user, algorithmically generated
from the content of the object, or implied from the context in which the
object is created or used.
Most commercial development of text retrieval technology has centered
around the problems of very large reference data bases, typified by the
Medlars system for medical data or the Lexis system for legal information.
The emphasis on the "desktop library" program, however, should be to support
the smaller collections that groups and individuals generate in their day
to day operation. The practical constraints on these data bases will be
different from those for large systems. In particular, the maintenance
of these collections must not require dedicated support personnel, nor
must they impose significant overhead on the users.
STARS - The Information Services Group (ISG) at the Customer Support
Center, Colorado Springs (CSC/CS) has developed an application that they
use for doing full text searches of databases. The databases they have
created contain reference information and solutions to problems they have
encountered and solved through the telephone support business. These databases
are shared world wide.
The full text application is known as STorage And Retrieval System (STARS).
Currently STARS only does full text searches and has full text and keyword
capability. STARS was developed with emphasis on speed in adding information
to the databases and retrieval of related information. STARS also has a
keyword capability.
Implicit Descriptive (Context) Indexing
An interesting addition to the technology for using indexed storage came
out of a small prototyping effort at MCC. Bill Jones developed a personal
filing system called the Memory Extender (ME), a continuation of work he
began at Bell Labs. The ME system considers a set of descriptors to be
a "context" for further operations on the filing system. When objects are
created their initial descriptors can be taken from the current context,
thus eliminating the burden of manual indexing for every object. In addition,
when an object is retrieved into a context, its descriptors, which are
weighted, are updated in light of the receiving context.
The context, which at any time can be manually updated, stored, or replaced
by another context, also sets the initial conditions for retrieval operations.
Retrieval in this windowing-based system is accomplished by using the context
to develop a ranked list of objects in the filing system, ranked in order
of similarity to the context. The user then selects the object from that
list.
Hypertext
Hypertext and hypermedia systems are another important technology for the
organization and retrieval of information. The basic idea behind a hypertext
system is that things that we traditionally think of as one object, such
as a memo, a mail message, a drawing, a phone number entry, or a note,
can be incorporated into a larger structure by connections called "links".
Links are pointers from one object to another (or from one location in
one object to a particular location in another). For example, the traditional
cross-reference "see also Chapter xx" is a link.
There is great utility in the ability to set links between objects.
These links in themselves can convey information by conveying relationships.
In a filing context, links can help to facilitate access to related information
-- information that might be overlooked because in conventional filing
systems it remains unseen. Structures of links can be used to organize
the management of information, e.g., I could send you "all the information
I have on a particular topic" by asking the mailer to send all the objects
linked to a particular starting node (or perhaps I only have to mail you
the link).
Traditional links must be followed by hand, i.e., the computer does
not manage or follow them for you. But a hypertext system is essentially
a system for managing links between otherwise familiar objects (ASCII text,
bitmap image, graphics metafile, etc.).
Digital is beginning to develop a hypermedia system (hypermedia is multimedia
hypertext, i.e., the objects are not confined to be textual objects). This
effort, still in Phase 0, is the Memex project in Valbonne.
The "Information Lens"
Probably the best work I've seen in developing helpful mail interfaces
is the "Information Lens" work at MIT, led by Tom Malone. (A paper on this
appears in the May "Communications of the ACM".) This work shows how, thorough
a simple structuring of messages, the mail system can in turn derive more
information from the messages, which can then be used to sort the mail
to users and to particular in-folders. The structuring of message types
is hierarchical, as in object-oriented programming classes, with inheritance.
The sorting is rule-based -- similar to a lot of expert systems programming.
The Information Lens is primarily concerned with coping with information
overload in rich mail environments, rather than retrieval from long-term
storage, but the same techniques could be applicable there. The ability
that the Information Lens gives to broadcast a message and allow intelligent
filtering by prospective recipients is essentially an intelligent, although
short-term, filing system.
Supporting Technologies
Digital's networking technology will play a key role in enabling us to
offer far better support to group and, ultimately, corporate libraries
than the competition. The problems of managing files -- and retrieval --
only become more acute when they are shared by multiple workers. We could
make a significant advance in the support of collaborative work through
the support of group information systems.
Digital's emergent imaging technology will be vital to the success of
any information management products. Practical personal and group information
systems MUST support images. Too much of the world's information is in
the form of pictures, or is in a form that can be most easily captured
by aiming a camera at it, to try to offer a "desktop library" system without
integral image support. (Lest there be any doubt about this, just look
at the HyperCard demonstration "stackware" -- it contains a lot of images.)
Digital now is developing imaging technology that can support such uses;
and early exploration of such uses will do much to verify and strengthen
the design of Digital's imaging architectures.
A PROPOSAL
I propose that Digital undertake a major program to understand, develop,
and apply technologies for the organization, management, and retrieval
of information in group and personal libraries.
Research And Investigation
The first step in this program would be to establish a solid foundation
in the relevant technologies. This would include doing a thorough literature
search to learn from others and avoid re-inventing any wheels. It would
also be necessary to gather together a small team with expertise in this
area to provide architectural guidance and to conduct applied research
into structures, user interfaces, and applications. It is important to
realize that this program is chiefly a pulling-together of technologies,
as opposed to major development of technologies.
Some of the questions to which we need answers are:
-
- Is there a paradigm that unifies descriptive indexing and hypertext?
They both are techniques that "associate" information objects; can they
be combined into a single user interface in a way that is approachable
for casual use and yet sufficiently powerful?
-
- Is there a role for the current state of the art in natural language
processing? It is obvious that "perfect" natural language understanding
would be wonderful for information retrieval, but can today's less-than-perfect
understanding yield useful results?
-
- How should information retrieval mechanisms be incorporated into computing
systems? Should they be layered as separate capabilities, or should they
be integrated into, for example, all file access mechanisms? Can we design
an interface such that a "personal librarian" is always waiting in the
wings, but is totally unobtrusive except when summoned? How can the characteristics
of windowed workstations be exploited in the user interface to these functions?
-
- How should information spaces and retrieval systems handle multiple users?
Individuals have unique vocabularies, for example. A filing system, especially
an implicitly indexed one like the MCC Memory Extender, tailors itself
to the terms and combinations of terms used by the individual. How can
two or more persons share such a filing space? Should the system maintain
separate indexes? How should externally-obtained information spaces and
indexes be incorporated into the access structures?
-
- How well can we distribute information spaces over a network? What are
the response time constraints, and what limits do they dictate for network
size? Are special algorithms required to maintain the distributed structures?
-
- How well do structured message techniques, as in the "Information Lens",
work? How could they be expanded to cover general filing as well as messaging?
-
- How can images be indexed for retrieval? Is any form of automatic matching
possible and useful?
In addition to the above questions, we will need answers to the "traditional"
information retrieval questions such as indexing method, matching and ranking
algorithms, generation and use of thesauruses, search specification, etc.
We would probably do well to draw upon the best available research on this
area before contemplating new investigations.
Prototyping And Building Simple Products
The second step in this program -- one which actually could be conducted
concurrently with the first effort -- is to take practical steps to apply
existing technology, including products from other vendors, to information
management problems. Some of these would be done only as prototypes; but
others might be produced as low-volume products for niche markets.
For example, the Apple "HyperCard" system could be used, in conjunction
with VAX-based servers, to explore the application of rich structures to
the organization of mail, notes, and files. Such an effort could lead to
attractive PC integration products as well as add to our understanding
of the total problem.
As the Memex Hyperinformation management tools become available, we
should immediately apply their power to the management of personal and
group information.
Another practical learning step would be to explore the integration
of imaging technology into such prototypes and products.
A Comprehensive Solution
The third step in the program is to define and build upon an architecture
for personal, group, and corporate information management. This will build
upon the investigation, research, prototyping, and early products coming
from the first two steps. It will have to be integrated with corporate
architectures for mail, conferencing, imaging, compound documents, "hyperinformation",
file storage, and user interfaces. In fact it will be a kind of a showcase
for these technologies, as well as the underlying corporate networking
and system technologies. And it will serve an important need of managers,
professionals, and office workers of many types.
ORGANIZATION
The immediate work, the work of the first two steps of this program, should
be carried out in our research organizations and in various advanced development
groups throughout the company. The Cambridge Research Laboratory (CRL)
should certainly be involved in the research questions. This could, in
fact, form the charter of a research team in CRL.
The prototyping could be done by CRL, but should also involve A/D groups
connected with the supporting technologies, of which there are many. Imaging-related
prototyping, for example, might suitably be run from the Quantum lab in
Maynard or the Southwest Engineering lab in Albuquerque. Prototyping based
upon Macintosh HyperCard as a front-end would be appropriate for the MSD
A/D lab in Merrimack. The CASE team in Valbonne would be a natural home
for Memex-based prototyping. The Colorado Springs group that developed
STARS could be the site of further investigation into text retrieval.
In any event, concrete steps must be taken to ensure successful technology
transfer from the lab to prototypes and from prototypes to the development
groups. The research team must work in close cooperation with the A/D teams.
Personnel rotation between the lab, A/D groups, and product groups should
be encouraged.
It is too early to determine in which organization the ultimate product
responsibility should lie. Any products resulting from this effort would
be applicable in the broadest possible range of workstation applications.
And they would not be evolutionary developments of current product families.
It would be inappropriate to associate this program with one product group
at this time.
COSTS AND SCHEDULING
A comprehensive research effort of the type described above, focused on
providing a firm foundation for a class of products later on, could be
conducted by a team of 3 researchers. I would expect usable results in
about one year's time, with the bulk of the topics mentioned above investigated
within 3 years. Of course, I am assuming a reasonable amount of support
personnel and capital equipment (as would be typical for research teams
at CRL), and the assignment of researchers with appropriate backgrounds
for this task.
The staffing of prototyping projects will vary according to the scope
of the project, but I'd expect that most if not all would be one man-year
efforts. For example, adapting Macintosh HyperCard to access and display
data bases resident on VAX systems would take two people about one-half
of a year.
EPILOGUE
From the "Distributed Systems Handbook", March 1978:
"Digital Equipment Corporation is proud to have been part, along with
our customers, of many of the most innovative and foresighted uses of computers.
From the beginning we have believed that computers should be tools that
can be used by people who need information to do their jobs. We have promoted
the design of interactive computer systems that can be placed where they
are needed. We see the trend toward the increased use of interactive, distributed
computer systems as confirmation of our basic philosophy." - Kenneth H.
Olsen