Digital editions survival kit

2021-10-26, Magdalena Turska

Reconstructing an edition

Computer systems are not meant to last, to the contrary – not only do they require regular maintenance but we need to take into account the unavoidable cycle of major refurbishments. This paper, just presented at the virtual TEI conference, aims to demonstrate how critically important aspects of an edition can be reconstructed from a rather minimal data set and how such a survival kit can be useful not only for disaster recovery but also as a sustainable approach for the maintenance of scholarly publications.

The following schemata are the key components of the survival kit:

  • Source documents
  • Document encoding and transformation scheme
  • Layout templates
  • Interoperable metadata mapping specification

TEI is the perfect archival format for text-centric data: human readable and easy to process – as long as we have the capacity to read text files we can recover the information from a repository of TEI texts. TEI encoded documents with an associated ODD schema and documentation already form a solid basis for reconstruction even if the ODD would say nothing about the final form of the publication as intended by the editors.

TEI source and rendition via the Processing Model
TEI source and rendition via the Processing Model

The TEI Processing Model covers part of this territory, describing how a source document should be transformed for publication. Nevertheless, in the virtual realm, the document is always accompanied by a certain context on the page: controls to zoom in or out, facing facsimile image or switch between normalized and original spellings, just to name a few options. To explicitly define such a context and specify how a publication page would look and behave we can rely on HTML5 layout templates. A modern, web components based approach to website design gives us a beautifully simple and expressive method of assembling web pages from a virtual Lego block equivalent.

HTML5 page layout using web components
HTML5 page layout using web components

The last missing piece is to document how abstract concepts, e.g. author or date of creation are realized in the encoding so we can recover and use them for queries within the publication as well as for data interchange with other systems. Given the richness of TEI it’s impossible to prescribe what metadata needs to be gathered and how exactly it should be encoded in any given project. On the other hand, it is rather simple to express the mapping in XML, e.g. with an index configuration syntax.

Sample index configuration with fields and facets
Sample index configuration with fields and facets for a TEI document

Such a set of specifications preserves all the information necessary to rebuild the edition from scratch, focusing on the intentions and decisions of the editor while filtering out the ephemeral or secondary presentation aspects. Good to put in the vault and send into space but equally useful when the time to migrate to a new infrastructure comes.

original Dodis layout
Original Dodis document view

How does it work in practice? You might want to have a closer look at one of the TEI Publisher’s demo apps When the Wall came Down which we managed to recreate on the basis of TEI sources and accompanying ODD released by Dodis on the 30th anniversary of the Fall of the Berlin Wall. We managed to get a draft version in only 165 lines of custom code and during just one day of pre-conference workshop. Our task would be still simpler if we also had the web page template and index configuration available.

reconstructed layout
Recreated document view

Given that there’s barely an extra effort involved in assembling the survival kit, preparing it is a clear win. After all, we already have the sources and the ODD! Enriching it with a processing model is not particularly difficult, especially if we use it to generate our transformations. Similarly, in most database systems we will need to prepare the index configurations. At this point we probably don’t need to mention that TEI Publisher already implements this approach since quite a few versions (ODD with the processing model from inception, web components for user interface since version 4 and fields and facets since version 5).

Just think about it, if you pack your edition nicely, it becomes a present which archives and libraries would very much like to keep safe in their vaults and running on their servers forever…

Annotation editor released with new TEI Publisher 7.1.0

Answering the secret dream of many TEI users, the new TEI Publisher version 7.1.0 incorporates a — beautifully simple to use, yet powerful — way to enrich existing TEI documents. Just select a text passage, click on a button and within seconds — and without a pointy bracket in sight! — mark it as one of many supported annotation types. A place or person? Sure, and with built-in connectors for external authority files, too. Critical apparatus entries? We got you! Dates, corrections, regularizations and even quick fixes for typos in your transcription.

As usual, everything is customizable and extendable, so if you want a particular kind of annotation we do not support out of the box, it’s not difficult to add your own or tinker existing ones. Read more in the documentation.

The good news doesn’t end there: you can now use the TEI formula element with TeX notation for math. See the component’s demo page which presents some elaborate formulae or visit Publisher’s Demo collection which now sports shiny new examples: Euler’s Algebra for a wee help with your quadratic equations or The Italienische Madrigal by Alfred (not Albert!) Einstein, with musical scores encoded with MEI. It is nicely rendered with Verovio library through a dedicated pb-mei component and you can even listen to the piece to cheer up. And you can now set Publisher’s interface even to simplified or traditional Chinese.

TEI Publisher 7.1.0 is available as an application package on top of the eXist XML Database. Install it into a recent eXist (5.0.0 or newer) by going to the dashboard and selecting TEI Publisher from the package manager.

For more information refer to the documentation or visit the homepage to play around with it.

It’s not for the first time that our special thanks go to the Office of the Historian of the United States Department of State – this time for funding the major portion of the annotation editor. The Math support has been kindly funded by Bernoulli-Euler Zentrum in Basel.

Cross search

With a growing number of editions realized with the TEI Publisher it is a logical next step to wish for a search service which can run queries across multiple corpora at the same time.

Usually the problem to solve would be the great diversity of encoding across projects, even if they all use TEI as a vocabulary of choice. Even commonly represented information, like the language of the source document, can be stored in various locations in a TEI document. Lucene-based fields and facets, introduced in eXist-db 5.0 provide a mechanism to smoothly abstract away these encoding differences – we can just define, say, a language facet and it’s the collection index configuration’s role to take care of specifying where exactly to grab data from.

The next potential issue would be actually running the queries across corpora, particularly with the decentralized infrastructure where editions are hosted on diverse servers. The answer here is to define an API which individual editions need to expose, so that the aggregate search engine can just poll all its registered ‘members’, regardless of their location or how they implement the search internally.

cross-search results
Cross-search results page

The cross-search prototype is exactly such a search engine. With a simple configuration one can register all ‘member’ editions. Only requirement for the editions themselves is that they expose the api/search/document API endpoint, which is a matter of simple customization for all TEI Publisher 7 applications which support Open API specifications out of the box. The api/search/document endpoint must accept a number of parameters defined in the specification. For this prototype the title, author and lang(uage) fields as well as genre, language and corpus facets were assumed.

We are very happy to report that our prototype works really well as a proof of concept with the eclectic collection of documents from TEI Publisher demo apps, all originating with vastly different projects with diverse encoding styles. Next, we intend to extend this idea into a general portal for archives and libraries and we would welcome collaboration from such institutions.

Our sincere thanks go to the Bibliothek für Bildungsgeschichtliche Forschung des DIPF / Research Library for the History of Education at DIPF for supporting this project.