Desktop Libraries: Personal and Group Information Management

Bob Fleischer
14 October 1987

SUMMARY

Professional, managerial, and most white collar workers maintain personal and group collections of information. I propose that Digital undertake a program to understand, develop, and apply technologies for the organization, management, and retrieval of information in group and personal libraries.

This proposal includes a three-person, three-year program of research, plus several smaller advanced development and prototyping projects, all focused on building a base for information organization and retrieval products.

The "desktop library" is a collection of files, notes, correspondence, images, and other information managed by computer. This management includes not only storage, but also includes services and structures which support the organization of that information. The ultimate goal of the "desktop library" is to facilitate retrieval from possibly large, complex, and shared information spaces.

With appropriate computer support, the retrieval function can be served in a much better way than it is ever served manually. The available technologies, and corresponding internal developments, are described in this paper. Digital is in a uniquely strong position to serve the information needs of knowledge workers. Through this program these technologies may be investigated and, in stages, incorporated into a comprehensive information management system.
 

THE OPPORTUNITY

Every manager and professional, and nearly every white collar worker, gathers and maintains collections of information either for their own use, or for use by their group. This collection of information, which I will call the "personal library" (or "group library") for lack of a better term, consists of many kinds of information-bearing objects.

Among these kinds of information are:

Traditionally this information has been kept in file cabinets, on shelves, and in index card boxes (and in piles on desks!), but increasingly this information can be generated and maintained in electronic form. And the introduction of the computer to the desktop has introduced additional kinds of information to be maintained, e.g., spreadsheet files and application programs.

Some of this information is generated by the individual, but most of it is generated by colleagues, other parts of the organization, or by external sources. And in many cases the generation and maintenance of these objects is not a personal task, but actually a collaborative task.

The desktop computer has done a lot to allow the storing of many of these kinds of information. And first the shared computer, and now the networked desktop, has done a lot to facilitate the sharing of information objects among colleagues. But today's computer systems still do very little to support the organization of, and access to, this information.

(With computer-based files, owing to the regular increase of storage densities, the number of files managed by a user may increase by an order of magnitude, in a few years' time, without any corresponding increase in cost or physical space! The system gets "logically" unmanageable even though it never gets physically out of hand.)

 

Emphasis On Retrieval

It must be recognized that the principal objective in organization of information is to facilitate its subsequent use, i.e., retrieval. When we use the term "filing" system, or "file cabinet", we are using a term that stresses the storage function. It might be more useful to think in terms of "retrieval" systems, for that is the function that ultimately serves the user's ends.

There are traditional approaches to organizing information collections for retrieval, but they are very labor-intensive. The traditional approach for organizing large collections of formal documents is the library. A librarian performs the maintenance tasks of storing, indexing, and pruning the collection. A librarian also performs the very important task of advising users about how to conduct their searches.

The smaller collections of information rarely get a dedicated librarian. A group's or manager's active files, for example, will be maintained by secretarial staff as only one of many tasks. Indexing, other than the selection of file drawers and folders, is spotty if done at all. Retrieval often depends upon the fact that the person who made the file and put the document into it is also there to retrieve it; and often it depends upon the fact that the collection is not too large for exhaustive search.

Individual collections are usually poorly maintained. Many people tend to regard their personal files as a kind of "black hole", into which things go but never come out.

The computer technology we are developing today will allow more and more of the information that an individual, group, or corporation uses to be computer-based. But today's computer-based filing systems are even more primitive in terms of their basic capabilities than the rather simple facilities of a library. They do almost nothing to aid retrieval other than allowing the naming of the storage locations. They could, and should, do much more. As the size of computer storage systems grows, the situation only gets worse.

It is my belief that a major new opportunity for computer software and associated systems lies in the management of group and personal information collections. The "desktop library" could be as significant as "desktop publishing" or the introduction of the spreadsheet. Some of what should be done is so fundamental to the task of the professional and "knowledge-worker" that it rightly belongs in the realm of system and network services rather than "applications". As such they should be integrated into Digital's plans and offerings for general-purpose workstations.

 

AVAILABLE TECHNOLOGIES

There are several available technologies which could be brought together to address this problem area. Some of these technologies have been available for many years, but only now is the need great enough and the power of economical systems high enough to support their commercial application.
 

Text Retrieval

Document indexing systems allow multiple descriptors, typically keywords and key-phrases, to be associated with objects in storage. These descriptive elements can be used singly or in combination to specify selection of a single document, or a range of documents, for further consideration. Techniques for maintaining and applying thesauruses are well known and help to allow the matching of synonymous terms. In addition, spelling errors can be compensated for by special matching algorithms based on known error characteristics. Phonetically-based algorithms can help to match terms with similar sounds, a function that is especially important when trying to match proper names.

Indexing systems can be classified along several orthogonal dimensions. Index terms can be derived either directly from the content of the text object, or from external sources. All significant words of the document could be indexed (full-text indexing), or only keywords. Keywords and descriptors can be obtained directly from the human user, algorithmically generated from the content of the object, or implied from the context in which the object is created or used.

Most commercial development of text retrieval technology has centered around the problems of very large reference data bases, typified by the Medlars system for medical data or the Lexis system for legal information. The emphasis on the "desktop library" program, however, should be to support the smaller collections that groups and individuals generate in their day to day operation. The practical constraints on these data bases will be different from those for large systems. In particular, the maintenance of these collections must not require dedicated support personnel, nor must they impose significant overhead on the users.

STARS - The Information Services Group (ISG) at the Customer Support Center, Colorado Springs (CSC/CS) has developed an application that they use for doing full text searches of databases. The databases they have created contain reference information and solutions to problems they have encountered and solved through the telephone support business. These databases are shared world wide.

The full text application is known as STorage And Retrieval System (STARS). Currently STARS only does full text searches and has full text and keyword capability. STARS was developed with emphasis on speed in adding information to the databases and retrieval of related information. STARS also has a keyword capability.
 

Implicit Descriptive (Context) Indexing

An interesting addition to the technology for using indexed storage came out of a small prototyping effort at MCC. Bill Jones developed a personal filing system called the Memory Extender (ME), a continuation of work he began at Bell Labs. The ME system considers a set of descriptors to be a "context" for further operations on the filing system. When objects are created their initial descriptors can be taken from the current context, thus eliminating the burden of manual indexing for every object. In addition, when an object is retrieved into a context, its descriptors, which are weighted, are updated in light of the receiving context.

The context, which at any time can be manually updated, stored, or replaced by another context, also sets the initial conditions for retrieval operations. Retrieval in this windowing-based system is accomplished by using the context to develop a ranked list of objects in the filing system, ranked in order of similarity to the context. The user then selects the object from that list.
 

Hypertext

Hypertext and hypermedia systems are another important technology for the organization and retrieval of information. The basic idea behind a hypertext system is that things that we traditionally think of as one object, such as a memo, a mail message, a drawing, a phone number entry, or a note, can be incorporated into a larger structure by connections called "links". Links are pointers from one object to another (or from one location in one object to a particular location in another). For example, the traditional cross-reference "see also Chapter xx" is a link.

There is great utility in the ability to set links between objects. These links in themselves can convey information by conveying relationships. In a filing context, links can help to facilitate access to related information -- information that might be overlooked because in conventional filing systems it remains unseen. Structures of links can be used to organize the management of information, e.g., I could send you "all the information I have on a particular topic" by asking the mailer to send all the objects linked to a particular starting node (or perhaps I only have to mail you the link).

Traditional links must be followed by hand, i.e., the computer does not manage or follow them for you. But a hypertext system is essentially a system for managing links between otherwise familiar objects (ASCII text, bitmap image, graphics metafile, etc.).

Digital is beginning to develop a hypermedia system (hypermedia is multimedia hypertext, i.e., the objects are not confined to be textual objects). This effort, still in Phase 0, is the Memex project in Valbonne.
 

The "Information Lens"

Probably the best work I've seen in developing helpful mail interfaces is the "Information Lens" work at MIT, led by Tom Malone. (A paper on this appears in the May "Communications of the ACM".) This work shows how, thorough a simple structuring of messages, the mail system can in turn derive more information from the messages, which can then be used to sort the mail to users and to particular in-folders. The structuring of message types is hierarchical, as in object-oriented programming classes, with inheritance. The sorting is rule-based -- similar to a lot of expert systems programming.

The Information Lens is primarily concerned with coping with information overload in rich mail environments, rather than retrieval from long-term storage, but the same techniques could be applicable there. The ability that the Information Lens gives to broadcast a message and allow intelligent filtering by prospective recipients is essentially an intelligent, although short-term, filing system.

Supporting Technologies

Digital's networking technology will play a key role in enabling us to offer far better support to group and, ultimately, corporate libraries than the competition. The problems of managing files -- and retrieval -- only become more acute when they are shared by multiple workers. We could make a significant advance in the support of collaborative work through the support of group information systems.

Digital's emergent imaging technology will be vital to the success of any information management products. Practical personal and group information systems MUST support images. Too much of the world's information is in the form of pictures, or is in a form that can be most easily captured by aiming a camera at it, to try to offer a "desktop library" system without integral image support. (Lest there be any doubt about this, just look at the HyperCard demonstration "stackware" -- it contains a lot of images.) Digital now is developing imaging technology that can support such uses; and early exploration of such uses will do much to verify and strengthen the design of Digital's imaging architectures.
 

A PROPOSAL

I propose that Digital undertake a major program to understand, develop, and apply technologies for the organization, management, and retrieval of information in group and personal libraries.
 

Research And Investigation

The first step in this program would be to establish a solid foundation in the relevant technologies. This would include doing a thorough literature search to learn from others and avoid re-inventing any wheels. It would also be necessary to gather together a small team with expertise in this area to provide architectural guidance and to conduct applied research into structures, user interfaces, and applications. It is important to realize that this program is chiefly a pulling-together of technologies, as opposed to major development of technologies.

Some of the questions to which we need answers are:

In addition to the above questions, we will need answers to the "traditional" information retrieval questions such as indexing method, matching and ranking algorithms, generation and use of thesauruses, search specification, etc. We would probably do well to draw upon the best available research on this area before contemplating new investigations.
 

Prototyping And Building Simple Products

The second step in this program -- one which actually could be conducted concurrently with the first effort -- is to take practical steps to apply existing technology, including products from other vendors, to information management problems. Some of these would be done only as prototypes; but others might be produced as low-volume products for niche markets.

For example, the Apple "HyperCard" system could be used, in conjunction with VAX-based servers, to explore the application of rich structures to the organization of mail, notes, and files. Such an effort could lead to attractive PC integration products as well as add to our understanding of the total problem.

As the Memex Hyperinformation management tools become available, we should immediately apply their power to the management of personal and group information.

Another practical learning step would be to explore the integration of imaging technology into such prototypes and products.
 

A Comprehensive Solution

The third step in the program is to define and build upon an architecture for personal, group, and corporate information management. This will build upon the investigation, research, prototyping, and early products coming from the first two steps. It will have to be integrated with corporate architectures for mail, conferencing, imaging, compound documents, "hyperinformation", file storage, and user interfaces. In fact it will be a kind of a showcase for these technologies, as well as the underlying corporate networking and system technologies. And it will serve an important need of managers, professionals, and office workers of many types.
 

ORGANIZATION

The immediate work, the work of the first two steps of this program, should be carried out in our research organizations and in various advanced development groups throughout the company. The Cambridge Research Laboratory (CRL) should certainly be involved in the research questions. This could, in fact, form the charter of a research team in CRL.

The prototyping could be done by CRL, but should also involve A/D groups connected with the supporting technologies, of which there are many. Imaging-related prototyping, for example, might suitably be run from the Quantum lab in Maynard or the Southwest Engineering lab in Albuquerque. Prototyping based upon Macintosh HyperCard as a front-end would be appropriate for the MSD A/D lab in Merrimack. The CASE team in Valbonne would be a natural home for Memex-based prototyping. The Colorado Springs group that developed STARS could be the site of further investigation into text retrieval.

In any event, concrete steps must be taken to ensure successful technology transfer from the lab to prototypes and from prototypes to the development groups. The research team must work in close cooperation with the A/D teams. Personnel rotation between the lab, A/D groups, and product groups should be encouraged.

It is too early to determine in which organization the ultimate product responsibility should lie. Any products resulting from this effort would be applicable in the broadest possible range of workstation applications. And they would not be evolutionary developments of current product families. It would be inappropriate to associate this program with one product group at this time.
 

COSTS AND SCHEDULING

A comprehensive research effort of the type described above, focused on providing a firm foundation for a class of products later on, could be conducted by a team of 3 researchers. I would expect usable results in about one year's time, with the bulk of the topics mentioned above investigated within 3 years. Of course, I am assuming a reasonable amount of support personnel and capital equipment (as would be typical for research teams at CRL), and the assignment of researchers with appropriate backgrounds for this task.

The staffing of prototyping projects will vary according to the scope of the project, but I'd expect that most if not all would be one man-year efforts. For example, adapting Macintosh HyperCard to access and display data bases resident on VAX systems would take two people about one-half of a year.
 

EPILOGUE