Joint ICSU Press/UNESCO Expert Conference on ELECTRONIC PUBLISHING IN SCIENCE

UNESCO, Paris, 19-23 February 1996

ISSUES AFFECTING THE DEVELOPMENT OF
DIGITAL LIBRARIES IN SCIENCE AND TECHNOLOGY

By:
Robert Wedgeworth, University Librarian and Professor of Library Administration
William H. Mischo, Engineering Librarian and Professor of Library Administration
University of Illinois at Urbana-Champaign, IL U.S.A


Table of Contents


Abstract

This paper will examine some of the technical and policy issues associated with electronic journal systems currently under investigation in the 4 year, $4 million, NSF- sponsored Digital Library Initiative(DLI)project at the University of Illinois at Urbana- Champaign. It will describe the project testbed in terms of the nature and types of materials used, and will outline research on the economics, technologies and the behaviors of the researchers using the testbed. Preliminary technical results focus on the efficacy of the Standard Generalized Markup Language(SGML) as a mechanism for indexing and retrieval and the Open Text Index as a search engine for the indexing and retrieval of DLI documents.

Return to the Table of Contents


INTRODUCTION

The difference between the promise and the reality of scholarly journals in electronic formats has been the subject of much discussion and conjecture, particularly the movement of scientific journal literature from primarily a print medium to an electronic environment(Okerson ARL 1995). Indeed, there is the general assumption that the scholarly literature (as distinct from the trade and popular literature) with its emphasis on the rapid dissemination of articles to be read by scholars in the field is most likely to undergo early conversion to electronic and online formats (Lavagnino LITA 1996, Okerson ARL 1993). While the demise of the printed page has been discussed since Licklider at LC and Project Intrex at MIT in the 1960's and more recently by Lancaster, the last several years have witnessed an accelerated frenzy of activity in the area of electronic publishing. Rapid development and introduction of wide-area networking and digital depository technologies (including the World-Wide Web and the Internet) have shown the potential for dramatic change in the way scientific information is transmitted and received. Growing dissatisfaction with the pricing policies of print journal publishers have stimulated much interest in exploring changes in the respective roles of authors, publishers, A& I services and libraries (Stoller C& RL 1996). Specifically, the processes for the production, review, access, distribution and display of scientific information are responding directly to changes in information technologies that suggest alternative models for the digital library of the future.

In an effort to understand better the technical, economic, legal, social and political issues surrounding these rapidly evolving technologies, a large number of electronic publishing demonstration projects--both commercial and research-oriented have been undertaken. Many more such projects are being developed. These projects (and other studies) are attempting to define and clarify the implementation issues associated with electronic publishing and full- text document storage, retrieval and dissemination.

What then, are the key barriers to the establishment of an economically viable and technologically effective electronic publishing system for scientific journal literature? How does our current publisher-centric model for the submittal, review, production and dissemination of scientific information translate into an electronic environment? What other models might serve the scholarly community better in terms of enhanced points of access and more timely availability?

Return to the Table of Contents


UIUC DLI Project Overview

This paper will examine some of the technical and non-technical issues associated with electronic journal systems in the context of work performed under a Digital Library Initiative (DLI) project underway at the University of Illinois at Urbana-Champaign (UIUC). The Illinois DLI project is designed to provide models for the electronic distribution and retrieval of scientific journal literature. This project is one of six federally funded Digital Library Initiative (DLI) grant projects jointly sponsored by three U.S. government funding bodies--the National Science Foundation (NSF), the National Aeronautics and Space Administration (NASA), and ARPA (Advanced Research Projects Agency). The other projects are centered at:

*Carnegie Mellon University (Pittsburgh), $4.8 million for the Informedia interactive online digital video library system in collaboration with WQED/Pittsburgh television station that will enable users to access, explore and retrieve science and mathematics materials from video archives.

*University of California, Berkeley, $4 million to produce a prototype digital library with a focus on environmental information.

*University of Michigan, $4 million to conduct coordinated research and development to create, operate, use and evaluate a testbed of a large-scale, continually evolving digital library with a content focus on earth and space sciences.

*University of California, Santa Barbara, $4 million for Project Alexandria that will develop a digital library providing easy access to large and diverse collections of maps, images and pictorial materials as well as a full range of electronic library services.

*Stanford University (Palo Alto, California), $3.6 million for the Stanford Integrated Digital Library Project that will develop the enabling technologies for a single, integrated “virtual” library that will provide uniform access to the large number of emerging networked information sources and collections--both online versions of pre-existing works and new works that will become available in the future.

The UIUC DLI project is a four year $4 million project focusing on building a large-scale testbed comprised of journal articles supplied to the project in SGML (Standard Generalized Markup Language) format by several professional society and commercial publishers. The DLI testbed effort is focusing on building a large-scale multi-journal, multi-publisher production testbed of scientific journal articles. This testbed is housed in the Grainger Engineering Library Information Center, a $30 million facility which opened in March 1994. Although the Grainger Library provides a full range of library and information services in the fields of engineering and physical sciences, it also serves as the research laboratory for the University Library with the primary charge of investigating emerging information technologies. Collaborative agreements have been reached with a number of major publishers in engineering and science to provide the Project with electronic copies of their materials prior to the articles reaching the print stage. Each electronic article contains the complete text, graphics, images, tables, and equations in SGML format, with publisher-defined document tags that delineate the document structure (e.g. title, author, section, figures, tables, equations, bibliography).

The testbed collection will, in the course of the four year project, grow to contain all of the articles published in approximately thirty major journals (from 1995 forward) in the areas of computer science, electrical engineering, physics, civil engineering, and aerospace engineering. There are approximately 3,000 articles presently available for searching and display to users at public terminals in the Grainger Library. The publisher partners supplying the Project with articles presently include: the IEEE Computer Society, the IEEE (Institute of Electrical and Electronic Engineers), the American Physical Society (APS), the American Institute of Physics (AIP), the American Society of Civil Engineers (ASCE), the American Society of Agricultural Engineers (ASAE), and the American Institute of Aeronautics and Astronautics (AIAA). Additional commitments have been obtained from several other professional societies, e.g. the American Association for Advancement of Science (AAAS - Science) and commercial publishers, e.g. John Wiley & Sons, to also supply us with articles in SGML format.

The UIUC DLI Project is studying the short-term and long-term technical and policy issues connected with the procurement, processing, indexing, searching, retrieval, transmission, and display of full-text scientific articles. While the project focus is on the computing and information infrastructures necessary for the system, the social and economic issues are also being examined. The primary goal of the project is to design and develop a model for providing effective access to a federated system of World Wide Web-based distributed document repositories that will contain the scientific journal literature in electronic format. In addition, the project is focusing on mechanisms to provide intelligent front-end and gateway functions to enable end-users to seamlessly access a host of heterogeneous information resources. Indeed, one of the primary requirements for a functioning digital library system is the ability to link and integrate a host of heterogeneous information resources, in all their multiple formats, which will include document surrogate records from online catalogs and A& I services, full-text repositories, Web and Internet resources, digital images, digital video, and a myriad of locally maintained databases.

The UIUC project is focusing on these issues within the context of the present publisher- centered article submission, editorial, review, and production environment. Our investigations are focusing on a publishing environment in which the major change in emphasis will be the conversion from a print marketing and distribution system to Internet-based organizational, retrieval, and distribution mechanisms for journal articles. However, our model of a federated system of distributed document depositories will apply in any publishing environment comprised of registered document depositories. The model uses SGML as the standard for open document retrieval and delivery by assuming that the full-text journal articles are made available in SGML format with a standard Document Type Definition (DTD) specifying the form and content of the SGML.

SGML is regarded as the standard for open document transmission and display. The far- reaching value of SGML lies in its capability to identify the fine-grained content and structure of a document. This allows sophisticated indexing and retrieval of documents, which is a necessity in a full-text retrieval environment. While SGML is becoming ubiquitous, it is still, for the most part, being generated by publishers as a byproduct rather than an integral part of their production process. For example, in many cases, we have been the first to actually display the SGML version of the published articles. While it is clear that SGML plays a major role in providing mechanisms for effective full-text retrieval, the Project is investigating the efficacy of SGML as a rendering and display tool, particularly in the scientific and engineering disciplines that rely heavily on mathematical equations. In these areas, current implementations of SGML viewers and renderers exhibit some shortcomings.

For the display of the journal articles, the DLI testbed team has been working with SoftQuad in the testing and evaluation of their Panorama SGML viewer. The Panorama viewer has been implemented in an Internet environment with Netscape Navigator using the CCI (Common Client Interface) capability. This configuration provides a means to access DLI testbed documents over the Internet using the HTTP (HyperText Transport Protocol) protocols with Panorama and a Web browser. Work continues with SoftQuad on the many configuration and style issues that arise when using Panorama for a heterogeneous collection of science-oriented documents. There have been some major concerns with Panorama’s ability to properly render mathematical equations and formulas. These have proven to be difficult problems and a number of solutions are being pursued.

System retrieval performance monitoring and end-user studies will be used to evaluate retrieval performance in the UIUC DLI testbed. Long-term research on the infrastructures of distributed information objects in what will become the Internet and World Wide Web of the future is also being conducted. The technological and evaluative information gleaned in these studies will be used to advance a model for a federated system of distributed document repositories containing the full-text of scientific articles. This model of a federated repository system will provide mechanisms for searching and fetching full-text documents. In this model, publishers will mount and register articles on servers at their local sites. These documents will be indexed and their meta-data extracted and mounted on index/meta-data databases for retrieval and display independent of the full text of the articles. These distributed index/meta-data databases may be maintained and accessed at the publisher’s server site, reflected to a central location, or associated with A& I services or other information resources. Libraries and/or other information brokers will then provide intelligent front-end gateway software that matches user interest criteria with the characteristics of certain index/meta-data databases. This front-end software will simultaneously search multiple meta-data/index databases and retrieve pointers to the distributed full-text document repositories. The result is a set of full-text articles that match the identified user search criteria.

Return to the Table of Contents


Preliminary Findings

As the DLI Project completes its second year, a number of important technical issues and concerns have been identified. At this stage the preliminary findings are limited to technical issues since the first stage of the Project requires the building of the testbed with only planning for the economic and sociological studies having taken place.

It is clear that SGML provides a powerful indexing and retrieval mechanism for full-text documents. In addition, the Open Text search engine has proven to be a robust retrieval engine for SGML documents. However, the complexities of the Open Text command language and the nuances of SGML have dictated the development of a user-friendly front- end to assist in search and navigation. Research on full-text search and retrieval clearly shows that sophisticated search techniques such as term proximity and search field limitations are necessary for effective retrieval. The richly coded SGML facilitates this capability. Yet, SGML has some serious limitations in its ability to express mathematics adequately. Also, currently available raw SGML renderers have difficulty displaying mathematical equations and formulas accurately. Many scientific publishers are accustomed to the use of TeX for mathematical equations. Our work to date suggests that a hybrid SGML/TeX display may prove advantageous. Other SGML unresolved SGML issues involve the accurate expression of diacritics such as umlauts.

SGML alone, is not a prescriptive code for describing a document comprehensively. Rather, it is a flexible template for marking the structure and content of a document based on publisher-defined DTDs that reflect publisher needs and conventions.. Therefore, the DTDs for individual publishers developed to describe their SGML are different--in some cases significantly different-- and this has a dramatic effect on indexing and retrieval. In order to provide consistent retrieval across separate DTDs, the Project has identified a set of meta-data elements which are extracted from the document, added to the document proper, or constructed from parts of SGML data. The meta-data elements are used for searching and short display. Also, tag normalization (converting tags and tag structures to a normalized form) provides consistency for searching across DTDs. These normalization activities argue strongly for a general agreement based on the ISO 12083 standard. Agreement by publishers to follow certain conventions for author names, author affiliations and bibliographical entries would also be a progressive step.

The World Wide Web environment has shown some serious shortcomings for the SGML- based publishing and retrieval. The Web HTTP server protocols do not make allowances for holding open database connections in order to transmit multiple search arguments or other complex iterative searches. The Web databases are significantly different from modern information retrieval systems that operate in a dynamic network environment with a stateful connection to a database. Other significant problems in Web sessions include difficulties in converting SGML to HTML for viewing. In the HTML environment almost all formulas and equations need to be converted to bit-mapped images. For scientific articles, there can significant delays in transmitting and displaying pages as the Web browser goes back to the HTTP server to get each individual image. However, we are aware that the several Web database, display and search/retrieval technologies are evolving rapidly and we anticipate advances in the state-of-the-art during the course of the Project.

Finally, our overall experience with the process of mounting a production full-text retrieval system indicates that it is a very labor intensive endeavor. Procuring, processing and mounting a large body of full-text documents places great demands on staff and computing facilities. The implications are daunting for individual publishers, libraries and other information service organizations who may attempt to develop their own full-text document repositories, given the requisite staff expertise and investments in computer hardware and software.

Return to the Table of Contents


Retrieval

The testbed team has expended a great deal of effort in evaluating the available SGML full-text database management and retrieval software packages. After careful study, the testbed team chose the Open Text Corporation’s Open Text Index search engine for indexing and accessing the DLI project documents. The Open TextM engine, originally developed at the University of Waterloo, is an extremely robust and expandable system that allows phrase, Boolean, and proximity searching and is tailored to SGML processing and retrieval.

To study access techniques and retrieval effectiveness in conjunction with these full-text journals, the testbed team has designed and implemented a prototype client-server system with a front-end client written in Visual Basic 3.0 for a Microsoft Windows environment operating over a set of Open Text databases. This custom front-end and server have been designed, from the onset, as a demonstration system to study full-text retrieval and explore functions that pose problems in a Web environment. The present limitations in Web search capability include the inability to maintain state or hold open the connection to the database, difficulties in dynamically updating forms, and bandwidth limitations that preclude dynamically updating word wheels and the like. However, in the past year, Web technology has advanced at a dizzying pace. Web browsers have begun to take on the characteristics of Internet operating systems, allowing multimedia, database, and desktop technologies to be embedded or enabled from within the browser. As these Web technologies continue to evolve, we can expect that the shortcomings inherent in database and retrieval applications will be addressed and resolved. It is clear that the Web will continue to evolve and grow and will play an increasingly important role in information applications.

In addition, the prototype DLI client is part of an overarching gateway client that provides access to remote and local information resources from a public workstation. Indeed, the integration of A& I service databases, online catalogs, locally mounted and remote periodical index databases, campus and Library maintained databases, and the full-text DLI project data is an integral part of the UIUC comprehensive digital library system. Users typically desire relevant information from multiple, sometimes disparate resources in what are often multiple formats. In particular, the linking of A& I service databases with full-text document stores is an area that demands attention and will be investigated in this project.

The ubiquity of the Internet combined with the rise of sophisticated Internet authoring and publishing tools will lead, it is anticipated, to a changing publishing paradigm for the science journal literature. The model put forth in this project anticipates the rise of a federated system of publisher document repositories that will contain the full-text of their journals, magazines, and newsletters along with value-added features that make use of the Internet’s dynamic linking capabilities. This is a natural evolutionary step from today’s Internet and World Wide Web environment. Indeed, numerous publishers are experimenting with Web document repositories and building prototype systems for test and evaluation. What is required technically for the vision of a ‘federated’ repository to be realized is an overarching set of retrieval standards and protocols and a willingness on the part of publishers to work together to promote these requisite standards. The importance of retrieval capabilities cannot be over- emphasized.

Return to the Table of Contents


Authentication

The authentication of a digital document becomes an issue of great importance in a wide-area networked document retrieval environment centered around the delivery of SGML objects as described in this paper. A user of this system who has retrieved the full-text (with associated images and mathematics) of a document could, if they so desired, modify the document and then pass this altered version on to another user. Indeed, there is a great emphasis in today’s applications suite software packages, and will be additionally so in tomorrow’s operating systems, to provide easy-to-apply work group tools and environments in which documents can be altered easily. This is a situation markedly different from today’s document copying environment in which photocopied articles are passed to colleagues around the room or around the county. Clearly, there needs to be put in place a system in which the recipient of a document can be certain that they have in their possession the unaltered, original version, or the author’s original version with clearly differentiated annotations from either the author or subsequent reviewers or readers.

There is a pressing need to establish an internationally agreed upon document authentication system that incorporates a unique digital signature for every original source document. This system, as currently envisioned, would support a document repository protocol in which each document would be assigned a digital checksum along with a public encryption key. The checksum and key would then be used against a designated authentication server on which a copy of the original document would reside. With this system in place, the recipient of any digital document would be able to verify the authenticity of the document and/or obtain an unaltered copy of the original document.

Return to the Table of Contents


Scalability

In order for a federated system of distributed document repositories with standardized retrieval techniques and access mechanisms to become reality, some major questions of scalablity must be resolved. To permit faster and more effective retrieval and display, enhancements to Internet and World Wide Web transport protocols need to be achieved. Bandwidths need to be increased to accommodate the transfer of large, high-quality images and full-motion video. Current work in data compression and advances in high-speed networking are likely to address these problems. Not as easy to solve are the problems connected with retrieving relevant documents from the multitude of heterogeneous full-text document repositories that will be maintained by a wide range of publishers, document centers, organizations, and commercial entities.

There is no single aggregate periodical index database, be it a public entity or commercial enterprise, that provides comprehensive access to the periodical literature in, for example, the physical sciences and engineering. We do not presently have, via the commercial Abstracting and Indexing (A& I) services, universal bibliographic control over the journal literature in the physical sciences and engineering. A comprehensive search for materials in these disciplines involves searching a number of A& I databases, including Chemical Abstracts, Compendex (Engineering Index), INSPEC, Current Contents, NTIS, Aerospace Database, Metadex, Energy Database, Ceramic Abstracts, etc. Adding the life sciences to the equation would necessitate additional searching of BIOSIS, Medline, Excerpta Medica, Psychlit, and others.

In the same manner, it is hard to conceive of any public or commercial entity generating and maintaining a comprehensive index to all of the full-text periodical literature, across all disciplines and for a period of many years, being put up on an Internet system such as the World Wide Web, or its successor. However, we have witnessed, in the last six months, the introduction of retrieval systems that are attempting to index all of the HTML documents on the WWW.

The Illinois DLI project is using the Open Text search engine as the basic database management system for the full-text SGML articles it is receiving from participating publishers. The Illinois project will eventually reach 100,000 articles from some 15 publishers. The Open Text software has clearly demonstrated, with its comprehensive Web Index, the ability to scale to the level necessary to accommodate a very large number of small full-text articles. However, an exhaustive index of SGML full-text articles is more complicated (and much larger) than an index of HTML documents.

Return to the Table of Contents


Affordability

The current publisher-centric scientific journal marketing and distribution system offers little flexibility in gaining access to its products. Many scientific communities, especially in the developing countries are effectively excluded from access to these materials due to high prices. Academic libraries, one of the principal subscribers to scientific journals, have experienced a steady decline in their ability to acquire substantial numbers of scientific journals. Studies of five midwestern U.S. research libraries, including UIUC, showed that between 1988-1992 these libraries canceled 13,021 titles across all disciplines. These cancellations represented a 5.7% decrease in the total titles owned . Yet these libraries paid 30.5% more for 5.7% fewer titles (Chrzastowski 1995). Although double digit increases in journal prices have been common in recent years, even at more modest price increases the long term trend is for academic libraries to own fewer and fewer journal subscriptions. Since scientific journals tend to be more expensive than other titles, they have been frequent targets for cancellation. Prospects for electronic journal distribution are not encouraging as prices for electronic publishing products are considerably more expensive than their print counterparts. The UIUC DLI will study the economics of a distributed model using both end-user payment schemes as well as multiple subscriber cooperative economic schemes.

Return to the Table of Contents


Accessibility

Understanding how digital library systems accommodate users, and how users behave while using large-scale digital library collections is a major focus of the UIUC DLI. Although sociological research is not an integral part of all of the DLI projects, the socio- technical aspects of the six DLI projects are so compelling that a coordinated DLI-wide User Research Working Group has been established (Bishop 1995). The UIUC sociological research is in its early planning stages, but contemplates studies of both groups of users and individuals. A range of methodologies will be used to evaluate the complexities of online digital library use and its effects on users. By the end of the project it is anticipated that as many as 100,000 users will have access to over 100,000 documents in the UIUC DLI testbed.

A frequently overlooked topic in the development of systems for distributing electronic files is the management of copyright restrictions. An essential metadata/database repository development will be those intended for rights management in order to facilitate rapid and authorized access to copyrighted materials. The proposed distributed model for access to scientific journals under development at UIUC assumes the existence of such meta- data/databases as the proposed Library of Congress Electronic Copyright Management System.

Return to the Table of Contents


Conclusions

A substantial amount of work remains to be done in order to facilitate the development of electronic publishing in science. Although we have reported here on one of six coordinated projects focused on the digitization of many different types of materials, there are many more research and development projects that are, or should be, addressing these problems. Based on the work completed to date in the UIUC DLI project it seems clear that the entire system for disseminating scientific information is in the process of dramatic and dynamic change. Some observations that we would be inclined to make at this time include:

*Development of industry-wide standards for the use of SGML.

The adoption of the ISO 12083 DTD standard for journal article should be facilitated. Additionally, a standard for document meta-data should be developed in a cooperative setting involving publishers, database vendors, librarians and other relevant parties.

*Improved software to enhance retrieval and display performance.

Software that allows for the simultaneous searching of multiple document repositories based on comprehensive studies of full text document retrieval systems needs to be developed. Studies of user behavior should also be the basis for the development of improved front-end gateway software intended for end-user searching of multiple, heterogeneous databases.

*Labor intensiveness of SGML coding may require more collaborative efforts to develop a large number of document repositories.

Since many individual publishers, libraries and other information providers will not have the requisite expertise and financial resources to engage in developing and maintaining their own SGML document retrieval servers there are likely to be opportunities for third party or consortial enterprises.

*Need for more robust display software for science literature.

The difficulties encountered in the SGML full-text display, particularly in rendering mathematical equations, formulas and diacritics cannot be overemphasized at this point. Software improvements in this area are imperative as are more standardized methods of coding for publishers.

*Internationally accessible rights management databases.

Rapid access to heterogeneous document databases many of which will contain copyrighted material focuses greater attention on the need for rights management databases more suited to the needs of science and education. The integrity of science will also dictate resolution to the difficult problem of authentication of documents in a dynamic, Web-based environment.

*Redefinition of the roles of authors, publishers, librarians and others.

The anticipation of the development of a federated worldwide system of document repositories described in this paper assumes major changes in the respective roles of publishers, libraries and other information service providers in science. Libraries, individually and in consortia, are likely to become heavily involved in serving as gateways to the document repositories for scientific information, for example.

This paper outlines a substantial agenda of issues and problems of major interest to the ICSU Press and UNESCO constituencies. Many of the developments needed in order to progress toward reliable and effective electronic publishing in science can be influenced significantly by the members of these organizations. However, we should bear in mind that most of the observations and conclusions presented here are tentative. As we move toward the beginning of the second half of the UIUC DLI Project, we look forward to sharing more definitive experiences and observations of real users performing real searches using the testbed which will continue under development.

Return to the Table of Contents

FIGURES


REFERENCES

Okerson, Ann S. and James J. O’Donnell, eds. Scholarly Journals at the Crossroads: A Subversive Proposal for Electronic Publishing. Association of Research Libraries, 1995.

Lavagnino, Merri Beth, comp. “The Life (and Death?) of Print: A Debate,” LITA Newsletter 17(1): 18-19 Winter 1996.

Okerson, Ann S., ed. Scholarly Publishing on the Electronic Networks, The New Generation: Visions and Opportunities in Not-For-Profit Publishing. Association of Research Libraries, 1993.

Stoller, Michael A., Robert Christopherson and Michael Miranda, “The Economics of Journal Pricing,” College & Research Libraries 57(1): 9-21 January 1996.

Chrzastowski, Tina E. and Karen A. Schmidt, “Collections at Risk: Revisiting Serial Cancellations in Academic Libraries,” College & Research Libraries (in prep.)

Bishop, Ann Peterson, “Working Towards an Understanding of Digital Library Use,” D-Lib Magazine (October 1995).

February 1996


Last updated : April 03 1996 Copyright © 1995-1996
ICSU Press and individual authors. All rights reserved

Listing By Title

Listing by Author's Name, in alphabetical order

Return to the ICSU Press/UNESCO Conference Programme Homepage


University of Illinois at Urbana-Champaign
The Library of the University of Illinois at Urbana-Champaign
Comments to: Tim Cole
08.11.98 RA