Manuscriptorium - Linguistic and Cultural Diversity in Cyberspace

Adolf KNOLL Secretary for Science, Research and International Cooperation,

National Library of the Czech Republic (Prague, Czech Republic)

has been demonstrated by not only institutions of our country but some foreign libraries as well. It is worth noting that the borders of the countries in the Central Europe were subject to frequent changes, and many countries made part of the Holy Roman Empire on whose territory one may observe the development of similar approaches to the creation of manuscripts and a broad cultural exchange. The territory of the Holy Roman Empire embraced, at different times, that of contemporary Germany, Austria, Czechia, Slovenia, Switzerland and Luxemburg, Northern Italy and Western Poland Silesia).

Moreover, Western Christianity served as a media for cultural similarity, which allows us to speak of the Western European manuscript tradition in opposition to Slavic (Slavic manuscripts in the Cyrillic script), Greek and Jewish traditions. In the course of the centuries, lots of these manuscripts have changed their location due to warfare and other events, making it especially important to unite them at least virtually within a unified digital library.

Additionally, European libraries may hold non-European manuscripts and old books, like it is done by the National Library of the Czech Republic which holds interesting Arabic, Persian, Ottoman and Indian manuscripts.

These are the reasons for which some foreign libraries started joining Manuscriptorium. The first among them was the Wroclaw University Library in Poland. Aggregation from abroad has been developing within the ENRICH European project. It was led by our library in 2007–2009 and included 18 partners from several European countries. Manuscriptorium has grown since then thanks to some subsequent European projects within the CULTURE programme and bilateral initiatives.

Metadata Agenda

In1995, we decided to introduce daily digitization of manuscripts. It immediately became clear that we needed to be maximally free from the influence of specific software on digital content. At that time, there were few approaches in the world that could be considered reliable and comprehensive enough for our task. So, we decided to take our own route and design our own SGML-based approach. That was how a new language appeared. We called it DOBM and used until 2002. Any approach of ours had to contain a document description and reflect the document structure. It had to represent each document by means of digital images which we decided to supplement with some technical information for the purpose of getting true simulation of a digital copy in a computer environment so that each manuscript could be displayed in the form it had at the time of digitization. In 1999, we produced a CD with our approach to digitization of various documents and demonstrated that it was possible to use this approach for processing audio records. Our

approach was recommended by UNESCO as a model one for the Memory of the World UNESCO Programme⁸³.

In 2002, we changed the platform for TEI. It became possible thanks to the development of a new approach to the electronic description of manuscripts.

This approach was a product of the European MASTER project in which we also played a part. We added to the so-called master.dtd⁸⁴ some structural elements and some elements that allowed for keeping technical information about digitization and the digital images themselves. The thus extended standard was called masterx.dtd⁸⁵. It was used virtually as a national manuscript digitization standard until 2009 when it was replaced by enrich.dtd⁸⁶, a standard developed within the ENRICH project on the TEI P5 platform.

Unlike previous standards, enrich.dtd was created as a data exchange standard or, to be more precise, as a unified platform for the aggregation of digitized manuscripts and old printed books.Therefore, enrich.dtd supports only the codification of the manuscript description and structure, including those images which serve for the structure implementation. Meeting the needs of the ENRICH partners, this standard was implemented as an internal standard of Manuscriptorium, i.e. not only as a data exchange format but also as a data aggregation format within a unified database. This became possible because, while designing this standard, we managed to meet the requirements of the two main approaches to manuscript description in the electronic environment:

those of the library and researchers.

From the standpoint of the requirements to the description of manuscripts, the library approach, which is based mainly on the MARC format and its variations, is insufficient and elementary. Therefore, at the end of the 1990s, there appeared the first attempts to use the TEI language. Unfortunately, the first project, which was produced in this language (master.dtd), did not take into account some specificity of MARC formats, especially their formal approach to recording the names of physical persons, though it suggested a detailed and even hierarchical description of the manuscript content. Both approaches allowed for carrying out all necessary operations with manuscripts in two fields: their elementary recording with libraries and their scientific description by researchers. However,

83 Digitization of Rare Library Materials: Storage and Access to Data / Project Management by Adolf Knoll and Stanislav Psohlavec. Authors: Adolf Knoll, Stanislav Psohlavec, Jan Mottl, Jan Vomlel, Tomáš Mayer.

Prague, National Library - Albertina icome Praha, 1999. Memoriae Mundi Series Bohemica, also online at:

http://digit.nkp.cz/knihcin/digit/WWW/ENTER.HTM.

84 http://www.tei-c.org/About/Archive_new/Master/Reference/oldindex.html.

85 http://digit.nkp.cz/MMSB/1.1/msnkaip.xsd.

86 http://projects.oucs.ox.ac.uk/ENRICH/.

when we decided to combine these approaches within a unified database, we revealed some loss of information. Since MARC doesn’t allow for a sufficiently deep and detailed description and, at the same time, flexible extension, we found a decision on the basis of the TEI platform.

In 2012–2013, in order to produce data ourselves, we worked out detailed requirements to the creation of the information batch (definition of a digital document with regard of access requirements and permanent storage within the VISK6 subprogramme)⁸⁷. We determined the structure of the batch (i.e.

what and where should be located). Undoubtedly, its essential constituent was the enrich.dtd format, which was supplemented by some other features: the rules of assigning unique file names which are part of the manuscript’s digital format; the rules of recording technical data about digital images and other kinds of technical data such as ICC profiles; tool calibration data; the need in having two specific calibration matrices and recording of colorimetric features of the colors represented in them.

We had to take into account that each change in the manuscript format resulted in a complex migration of the entire digital library content and changes in the “environment”, i.e. the programmes and tools which support document production, indexing and displaying, to say nothing of handling them at the user’s end.

Aggregation Principle

The intention to concentrate in the digital environment all of the requested information about resources appeared right after the emergence of the Internet:

the decisions how to do it may be different but they all have common features, i.e. accumulation of information from various resources and their indexing and an opportunity for conducting search in order to meet user information requirements. Search engines appeared for this purpose, like the currently available Google, portals and even digital libraries. When the latter are referred to, classification challenges turn up: for instance, portals are often called digital libraries, though they are none of the kind: Internet search engines work with all accessible Internet wealth, while portals aggregate information from collaborating resources only. In both cases, users find meta-information which, if necessary, links them with the required resource which is accessible for processing. In the cultural environment, such portal is EUROPEANA. From the standpoint of the user, who needs to spend time and efforts to access the

87 Definice digitálního dokumentu pro potřeby zpřístupnění a trvalého uložení v podprogramu VISK6 / Olga Čiperová, ŠtěpánČernohorský, TomášKlimek, TomášPsohlavec, online at: http://www.manuscriptorium.com/

sites/default/files/docs/manuscriptorium_visk6_definice.pdf.

resource in need, the portal replaces his/her physical journey from one library to another by a virtual journey from one resource to another. The activity of librarians or collection curators is replaced by the nature of the resource, its interface specifically. In any case, libraries and interfaces may have their own specific features, and each demands a special user approach.

Intending to simplify the work of our users, we strive to aggregate not only meta-information/metadata but also information as such, i.e. data or files with texts, images, audio and video records. The simplest way to do it is by concentrating all information in one databank, one physical digital storage, but for this we must come to an agreement with the owners and transform every document we get from them to the format of our resource and download it in full into our digital storage. Such is the work of the World Digital Library of the Library of Congress. Due to the labor intensiveness of this process the Library of Congress may obtain only a small share of the digital collections of its partners.

Another solution is aggregation of those metadata which allow for not only obtaining information about the nature and location of the required document but also understanding how it is structured in the partner’s digital storage whose main purpose is to meet the requirements of the partner’s digital library. Each digital library has (1) a metadata database in which users can search and which contains information about the whereabouts of the data in question (in our case, these are texts or images of the manuscript pages), and (2) a digital storage of these data, from which users obtain the images or texts of the required manuscripts.

If, in the partner’s storage, image files are given fixed invariable addresses, as is the fact in the vast majority of cases, these image files may be requested by various data presentation systems, for example, through various digital libraries which contain data about the location and the way these image files can be pooled within a manuscript.

Aggregation in Manuscriptorium is based on this principle. It allows users to work in real time with the data from different digital libraries within the Manuscriptorium’s unified interface. You may get an impression that all data are located in one place, while Manuscriptorium travelled from one digital storage to another to harvest them. Such aggregation of data from various resources is called seamless aggregation, and it rests on the exhaustive agreement with the partners.

Manuscriptorium prefers to get the metadata required for aggregation in an automated mode via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).It demands that the OAI profile includes a document

description and structure with references to the files out of which it is possible to represent it in the digital environment. At our end, we offer to every partner special tools for these metadata to transform/convert them from the partner’s format into the unifying internal format of Manuscriptorium. In our terminology, a set of tools corresponding to a certain partner is called a connector. The advantage of our approach is that connectors are developed only once and forever, and that harvesting/collection of metadata via OAI is done on a regular basis. As a result, when a partner adds new documents/

manuscripts to his/her digital library, these documents/manuscripts are added to the Manuscriptorium’s virtual collection after certain procedures (harvesting and processing for Manuscriptorium).

Undoubtedly, there are other methods of cooperation for the cases when a partner has neither an opportunity to work with OAI nor his/her own digital library. For the cases like these Manuscriptorium offers a set of online tools which can foster the production of descriptions in the Manuscriptorium’s internal format, their maintenance and downloading to Manuscriptorium.

User Environment Content and Customization

Manuscriptorium contains about 331,000 descriptions, more than 25,000 of which refer to digitized documents which are accompanied by more than 600 full texts. The total number of pure full texts exceeds 2,000 because some documents are represented by texts only, without any images.

Manuscripts are the main part of the digital library, but there are also extremely rare printed editions including geographical maps. The data to Manuscriptorium have been provided by some 120 institutions, 55 of which are from the Czech Republic. Almost 70% of the fully digitized documents are provided by our foreign partners. Our library is the major provider of documents (over 3,500); other big partners in our country are the Moravian Library in Brno, the Strahov Monastery in Prague and the National Museum Library; they are followed by many organizations of various kinds, including the Kynžvart Castle Library (Königswart in Western Czechia), i.e. the library of Prince Metternich, Foreign Minister of the Austrian Empire. Among foreign partners the biggest content-providers (by the number of documents provided) are the Complutence University of Madrid (Spain), The Holy Trinity-St. Sergius Lavra (Russia), the Wroclaw University Library (Poland); National Libraries of Italy (Florence), Spain, Iceland and Romania and University Libraries of Vilnius (Lithuania), Heidelberg (Germany), Bratislava (Slovakia), Zielona Góra (Poland) and many others. In some cases, our partners administer the group of participating libraries themselves (Cologne University, Germany)

In document Linguistic and Cultural Diversity in Cyberspace (Pldal 194-200)