Names sell: Named entity recognition in TEI Publisher

2022-06-10, Wolfgang Meier

TEI Publisher 8 will include experimental support for detecting and tagging named entities in texts. The idea is to further simplify the work of editors when annotating documents via TEI Publisher’s web-based annotation editor by automatically identifying potential candidates for people, places etc.

If you have the development branch (or future TEI Publisher 8) installed and the named entity recognition (NER) service running (more on this below), an additional button will be shown in the top left toolbar. Clicking on it gives you a choice of NER models to use. By default those are the standard models provided by the NER engine we’re using. Below we see NER in action, detecting entities in a modern-language text copied from wikipedia.

Entities identified by NER get a marker in striped color, which allows users to distinguish them from annotations, which were manually tagged. The user can now review the identified entities, assign them an authority entry etc. As each annotation is reviewed, the stripes will be removed.

While NER works well in this case on a modern language text, you’ll soon encounter the limits of the standard model when trying it out on different types of literature. However, we can gradually improve the quality of the entity recognition by feeding completely annotated documents back into the process, i.e. train our own recognition model. The ideal workflow could be imagined as follows:

  • editors manually tag a portion of documents via web-based annotations
  • once a certain number of entities has been tagged, we can train a custom model
  • continuing the annotation process, the custom model can be used to identify potential candidates for semantic annotations, thus improving the workflow
  • the model is retrained on the growing set of fully annotated documents, resulting in better prediction rates

The ultimate goal is to make this process as smooth as possible, i.e. it should not hinder your editing work, but support it!

The integration in TEI Publisher is completely functional, but we need more testing, experimenting and kicking the tires with some real-world use cases. NLP is not a simple subject and I’m in no way an expert, so I’d like to invite the community to help. I have just prepared the ground work.

Technical Background

There are plenty of NLP and NER libraries and tools. However, such libraries work on plain text, not structured texts like TEI. They will get confused by angle brackets (just like many humans). The trick thus is to transform the XML into a plain text without loosing context, which means we somehow need to keep track of element boundaries, offsets of inline elements etc.

Likewise, the result of running NER is again a plain text document, accompanied by a list of detected entities and character offsets. Those need to be mapped back onto the XML structure and eventually merged into the TEI. This back-and-forth conversion is the main job handled by TEI Publisher and its API.

My first idea was to integrate another, existing command-line tool for enriching TEI with named entities. But after a few first experiments, it occurred to me that TEI Publisher already had some important bits and pieces in place within the annotation framework:

  • it defines a standoff JSON format to keep annotations separate from the TEI as long as the user is making changes. The web-based editor reads this format to display the nice fruit salad – i.e. marked entities – you see on screen.
  • it implements algorithms to merge the standoff annotations into inline TEI elements. This works loss-less: anything which is not an annotation is left untouched. The merge algorithm is fast and reliable.

We could thus reuse those building blocks and just add a communication layer, which mediates between TEI Publisher and the external NLP library. This communication layer has been realized through a set of Open API endpoints on both sides, allowing them to have a conversation, sending data back and forth. Below you see the NLP API endpoints exposed by TEI Publisher:

The NLP part is a python service using spaCy as the underlying NLP library. Compared to many NLP libraries I have seen before, spaCy has a rather simple, clean API. Getting started proved to be smooth and painless as most of the functionality comes pre-configured and ready to be used. A Python notebook demonstrating how to do a simple NER with spaCy is shown below:

SpaCy can do more than just NER, e.g. part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis. It is also quite extensible, allowing other libraries to be plugged into its pipelines.

Training a model

Whenever you use a general-purpose NER model, you’ll quickly notice that it works great for some types of texts, i.e. texts similar to the ones it was trained on. However, results quickly degrade if you apply it to other genres or texts written in a slightly different period of time. For example, the German standard model in spaCy produces good results when run on a text from Wikipedia, but it already starts missing things when you apply it to letters written in the first half of the 20th century: the language used back then was just different, some would say: more sophisticated, than the language used today.

For example, if we run the standard German model against a letter by theologian Karl Barth written in 1921, many words are wrongly identified as places:

For most real world scenarios, there’s thus no way around training your own custom model. Fortunately this fits rather well into the annotation workflow implemented by TEI Publisher: usually, training a model involves going through a large amount of text to tell the NLP engine which words are considered part of an entity and which are not. Sometimes this is done in tabular form – which is quite tedious, but there are also tools to support the task, e.g. Prodigy, a commercial application created by the makers of spaCy.

All those approaches have one disadvantage: you do all the hard work just for the purpose of training a model. Compared to this, enriching the TEI with entities is a useful task in itself. Even if you figure out later that you can’t really use NER, the work invested is not lost. TEI Publisher tries to make this as seamless as possible, being able to transform any semantically rich TEI into training data. Preparing training data is thus kind of a natural side effect of annotating documents and does not require additional manual steps or separate tagging.

TEI Publisher exposes an API endpoint through which you can download training data in JSON format for either a single document or a whole collection.

It chunks the text into blocks (paragraphs, headings etc.) and extracts a plain text representation of each. Here it is important that sentences are preserved semantically. Inline notes, apps or choices would appear out of context in the middle of a sentence, so they have to be removed. Notes will be moved into separate blocks at the end.

All existing entities in a block, i.e. persName, placeName etc., are listed along with the text, recording their type, start and end positions:

  {
    "source": "train/1006.xml",
    "text": " Bultmann Marburg, 25.V.1922 Lieber Herr Barth! ",
    "entities": [
      [
        10,
        17,
        "LOC"
      ],
      [
        41,
        46,
        "PER"
      ]
    ]
  }

The exposed API endpoint is used by the python service running spaCy to retrieve the training data. The service will normalize the data, e.g. collapse whitespace, and initiate the training. By default, 30% of the sample records will be reserved for validating the model, 70% are fed into the training. Once completed, the resulting model will become available in spaCy as a new model to be used for entity extraction.

The actual training can be triggered in two ways:

  • via a web page in TEI Publisher, which calls the python service, passing it the information you entered in a form
  • via a separate python/spaCy project

We strongly recommend the 2nd approach as it gives you more control over the process and its configuration. For simple tests, the web-based form is sufficient though.

You can train a model from any collection residing below TEI Publisher’s data collection or the corresponding data collection of any custom application generated by TEI Publisher 8. For example, for the Karl Barth edition we already have several volumes of letters, which were manually enriched with entities via the web-based annotation editor. For a quick test, we can thus upload one volume of annotated letters to the train/ subcollection below TEI Publisher’s data root (using eXide) and start training a custom NER model on this collection via the web interface:

The training will take a short while as it does several runs trying to improve accuracy. During the process, key figures are shown for each run and you should see how they gradually improve until a certain threshold is reached and the training stops.

Afterwards the new trained model will become available within the annotation editor and we can try NER on the same letter as before using the custom model:

Looking at the same passage of Barth’s letter, we see that now only Göttingen is identified as a place. The wrongly detected places are gone. This is clearly an improvement over the standard model. The one name in the fragment, Nelly, is still not identified as a person, but the size of the training set was still rather small. The more documents we annotate, the more training data we have, and the quality of the model should gradually improve over time.

Running the NER service

For named entity recognition to be available in the web-based annotation editor, a separate Python service is needed: clone the tei-publisher-ner repository and follow the instructions in the README.

The tei-publisher-ner package uses spaCy’s own project configuration library, which in itself is quite a useful tool. The main configuration is contained in project.yml, which exposes a number of commands. You can also use those to train a custom model as an alternative to the web form we saw above. This gives you more control and more options.

To train a custom model you can either change the variables in project.yml or pass them as command line parameters, e.g.:

python3 -m spacy project run all . --vars.name=hsg_demo --vars.app_name=hsg-annotate --vars.training_collection=frus1981-88v05 --vars.lang=en

This will contact a TEI Publisher generated app called hsg-annotateand retrieve training data from the frus1981-88v05 collection below the app’s data root, using English as the training language. The output of the command will pretty much look the same as the output you saw on the web before.

Conclusion

Even in its current, rudimentary form, the NER integration in TEI Publisher can already help to speed up the editing workflow. We do need to gain more experience with training custom models though and the community is warmly invited to help with this.

There’s also a lot of room for improving the annotation workflow, e.g. with

  • automatically linking detected entities to authority entries (where non-ambiguous)
  • implement a wizard-like dialog, which walks users through the entities identified by NER one by one, allowing them to quickly confirm or reject an annotation and associate it with the correct authority entry
  • employ rule-based detection models in addition to the statistical, trained models: for example, if you already have a list of names from a back of book index, a rule-based algorithm may produce better results than a trained model
  • support batch operation across multiple documents
  • integrate other spaCy features like part-of-speech tagging etc.

Newsletter 2022/1

2022-04-05, Wolfgang Meier

e-editiones as a Society

A highlight of the past year was e-editiones being awarded the 2021 TEI Community Prize. The jury offered us these kind words:

«The awards panel was especially impressed by the way e-editiones has managed to gather a non-profit community of those creating scholarly digital editions and made the process of doing so easier through the coordination of ongoing development of the TEI Publisher software. The awards panel also noted the provision of training opportunities and open availability of the workshop materials for those wishing to (re)learn the software in their own time.»

2021 has fortunately seen an increase in membership. Today there are 12 institutional and 30 individual members. However, it remains our goal to attract new members who will continuously support the association.

One of e-editiones’ strategic goals is to ensure the long-term availability of digital editions. To this end e-editiones supports hosting offers that also include the continuous maintenance of digital editions (software updates). This year Archives Online has set up such an offer with «Sources Online». Some of the editions associated with e-editiones already use the Sources Online servers.

In February 2021, e-editiones received a grant by the Ernst Göhner Stiftung. The joint proposal was backed by a number of members and their institutions. Together with the additional generous contributions made by participating projects, this grant supported a larger part of the development work past year.

With the Escher correspondence, a prominent Swiss edition has been migrated to TEI Publisher. This step has been necessary since maintenance and hosting with the previous setup became too expensive in the long term. With TEI Publisher application hosted by Sources Online, the annual costs are reduced to a third.

More Information

Events

Community Meetings

e-editiones was able to hold 7 community meetings and thus contribute to the exchange of expertise. Many of the meetings were extremely well received. However, the number of participants varied greatly.

  • 2021-10-05 An introduction to the Distributed Text Services (DTS)
  • 2021-09-07: TEI Publisher 7.1 – configuring web annotations
  • 2021-07-06: Contributing to TEI Publisher – a gentle introduction
  • 2021-06-15: For(e) humanists – Metadata, Forms and more
  • 2021-06-01: FairCopy
  • 2021-05-04: Open access scholarly digital editions at the Finnish Literature Society: experiences with TEI Publisher
  • 2021-04-06: Workflow from Word files or Transkribus to TEI Publisher

Workshops

Among the workshops, special mention should be made of the 5-part introductory workshop by Anne Diekjobst and Claudia Sutter, which received very good feedback.

  • 2021-08-2021 Manuscript Mondays – Einführung in das digitale Edieren handschriftlicher Quellen (5-teilig)
  • 2021-03-30, 17.00 CEST Versioning and Archiving Data: TEI2Zenodo
  • 2021-03-08 Beginners Workshop (Git, Editing Workflows)

A big thank you to all who actively participated in the meetings and workshops.

More Information

Communication

On Slack, we had 218 active members at the end of 2021; on average, there were 16 active members daily, with 3 to 4 members posting a total of about 8 messages daily. On Twitter, we post 4 to 5 tweets per month. 2021 brought us 172 new followers.

Our mailing list, on the other hand, is not very actively used. We are happy to receive suggestions and ideas on how to proceed with it.

TEI Publisher Developments

In February 2021 e-editiones successfully applied for a grant by the Ernst Göhner Stiftung, from which we received 30000 SFR for further TEI Publisher development. The joint proposal was backed by a number of members and their institutions and included features like:

  • support for web annotations
  • versioning and long term availability
  • persistent URLs
  • accessibility
  • showing and navigating events in a timeline
  • display of mathematical formulas

Obviously the budget did not suffice to cover everything planned, but thanks to the additional contributions provided by the member institutions, we could address the main topics and even go beyond in some areas.

Web Annotations

The most visible feature is the editor for web-based annotations, released in version 7.1.0. The development team had been contemplating this feature already for a couple years and thanks to additional generous funding by the Office of the Historian at the US Department of State, it finally became reality. Being able to annotate TEI documents via a graphic, web based interface eases the burden of enhancing a transcription with semantic, analytic or text-critical markup. Users work in a user-friendly environment in which XML code is neatly hidden from sight. The integration of external authority databases saves a lot of time, improves consistency and opens possibilities for data exchange and interoperability. At the same time, the annotation editor is fully configurable, allowing complex, nested markup where necessary. Several member institutions are actively using the editor in their daily work and continue to contribute to its development.

The annotation editor marks the first milestone in our endeavour to extend TEI Publisher to support the entire editing workflow rather than just publishing the end result. Further steps are already in planning, e.g. form-based editing of the TEI header, which will offer the same level of customizability and extensibility.

Web Components

Several new web components have been released during the past year: most notable, a new timeline component allows users to visualise dates and events in an interactive, graphical display. The component can be used to directly select a date or date range, or it can be connected to a facetted search to drill down into a collection of items. Development was supported by the Staatsarchiv Zürich. This component is prominently featured in the remake of the Alfred Escher correspondence edition (to be announced soon).

The Bernoulli edition in Basel financed a component to display mathematical formulas embedded in the TEI text, using either MathML or TeX notation. Other new components include a grid element for viewing tabular data, and a "split list", which comes handy when browsing lists of places, people or abbreviations. The latter was again financed by the Staatsarchiv Zürich and first appeared in the Escher correspondence.

As usual, many other components have seen major improvements needed by concrete projects. For example, the map component is now able to display a large number of places at once by clustering the markers.

TEI Publisher 8

The next major release of TEI Publisher is currently being finalized. It will include a few breaking changes, mainly in the libraries used, but as usual we’re aiming to remain as backwards compatible as possible. One reason for the breaking changes is to better support persistent, bookmarkable URLs as well as browsing the navigation history. The goal is to make the URL structure as seen by the end-user completely independent of the underlying organization of documents and collections. This involves both a server- and client-side part.

The main remaining task on the way to the release is to extensively document the possibilities and approaches enabled by those changes.

Other Software Packages

In line with our modular development policy, a number of software packages which should be considered part of Publisher 8 are published in separate repositories:

  • tuttle – a Git Integration for eXist-db (stable): allows synchronizing a data collection directly from a github or gitlab repository to the database. It can deal with multiple repositories as well as incremental updates. Thanks to the Karl Barth Gesamtausgabe for contributing funding.
  • TEI Publisher Named Entity Recognition API (beta): adds named entity recognition to the web-based annotation editor. Use it to identify places, people and other entities while annotating a document. The package also provides scripts to train your own model based on already annotated documents via machine learning. Training data is automatically extracted from the annotated document by TEI Publisher, so there’s no need to go through the tedious task of compiling suitable data by hand.
  • Static Site Generator for TEI Publisher (beta): transform a website based on TEI Publisher into a static version, which no longer requires TEI Publisher nor eXist-db. The feature intends to provide a low cost option for small editions whose main purpose is to allow users to browse through a collection of texts without demanding sophisticated search or navigation facilities. Generated files can be easily hosted on free services like GitHub pages. To see it in action, visit our viewer for the TEI Guidelines, which we used as a testbed.
  • Docker Compose Configuration (beta): a configuration to help users install a TEI Publisher-based application on a docker-enabled host. It handles the more difficult tasks of installing a reverse proxy in front of Publisher as well as registering an SSL certificate. Hosting via docker compose can be a viable option for smaller projects with limited budget and users who lack the server administration skills necessary to set up a dedicated hosting service.
  • Fore: an XForms-inspired library for building complex forms with web components. While this is not an integral part of TEI Publisher yet, it will become the fundament for many of the future, workflow-related features we have in mind (see below).

The Future

The e-editiones board will prepare a new joint funding proposal soon. To be successfull we’ll again need projects or institutions to express interest and signal readiness to make a contribution (in whatever way). If you are working with TEI Publisher and wish for a certain feature, please do not hesitate to contact us, so we can add it to the list of topics.

A main area we currently have in mind is to further enhance TEI Publisher with respect to supporting the editorial workflow:

  • integrate general metadata – i.e. TEI header – editing facilities based on customizable forms
  • add an interface to manage local authorities within the annotation framework
  • support other annotation types, e.g. empty elements and stand-off annotations which are not inlined
  • named entitiy recognition: automatically connect detected entities with matching authority entries
  • allow batch processing of entire documents via named entity recognition
  • support multi-user workflows with support for a annotate/review/merge process

Other areas might be:

  • a form-based interface to customize basic settings and CSS variables
  • direct integration with Transkribus and other HTR/OCR software
  • support for other input formats (Excel, CSV)
  • remove the dependency on Google Material Design to make all visual aspects configurable
  • port the core TEI processing model implementation to be usable without eXist (e.g. directly within oXygen)
  • drop the somewhat arbitrary distinction between browsing and searching currently imposed by publisher and refactor the search API to be more easily customizable
  • enhance and speed up search result display (KWIC)

If any of this rings a bell, please consider if you could support it by taking part in the joint funding proposal. Obviously we’ll also be more than happy to pick up suggestions not yet on the list.

Call for Applications / Оголошення про подачу заявок

Українська

Small grants program for scholars affected by war in Ukraine

e-editiones and the TEI Consortium in collaboration with Archives Online and JinnTec announce a small grant program aiming to help scholars of Ukrainian cultural heritage to continue their work that has been disrupted by the Russian invasion of Ukraine. We urge other institutions to support this call so we can offer funding for more of our colleagues affected by the war.

Who is eligible?

Any scholar who had to leave Ukraine or relocate within Ukrainian territory because of the war and is working on sources broadly conceived as textual cultural heritage and plans to make data and results openly available.

What is offered?

  • Initial funding ranging between 500 and 2000 EUR to support the continuation of research work on materials related to cultural heritage available in TEI
  • Help with encoding and other conceptual aspects of a digital edition
  • Technical support in converting data sources into the TEI Standard and/or publishing them as a digital edition online
  • Hosting of the edition for at least 3 years on Sources Online servers

How to apply?

Please send a brief description of the project and its current state to info@e-editiones.org. Applications will be reviewed on an ongoing basis. The number of grants awarded will depend on the funding contributions we will be able to secure, the confirmed initial pool is 10 000 EUR. Please include a description of your personal situation and why you should be eligible.

How can I contribute?

If you or your organization would like to contribute funds or services to this initiative, please email info@e-editiones.org with details. Individuals are encouraged to donate via PayPal. e-editiones is registered as a non-profit society in Switzerland.

Supporters

  • SAGW (Swiss Academy of Humanities and Social Science)



Archives Online

Програма малих грантів для вчених, які постраждали від війни в Україні

e-editiones та Консорціум TEI у співпраці з Archives Online і JinnTec
оголошують програму невеликих грантів, спрямовану на допомогу
дослідникам української культурної спадщини, щоб ті могли продовжувати
свою роботу, перервану вторгненням Росії в Україну. Ми закликаємо інші
установи підтримати цей звернення, щоб ми могли запропонувати
фінансування для більшої кількості наших колег, що постраждали від
війни.

Хто має право?

Будь-який науковець, який був змушений виїхати з України чи переїхати
в межах української території через війну і працює над джерелами, що
широко розуміються як текстова культурна спадщина та планує викласти
дані та результати у відкритому доступі.

Що пропонується?

  • Початкове фінансування від 500 до 2000 євро для підтримки
    продовження науково-дослідницької роботи над матеріалами, що
    стосуються культурної спадщини, доступними в TEI
  • Допомога з кодуванням та іншими концептуальними аспектами цифрового видання
  • Технічна підтримка перетворення джерел даних у стандарт TEI та/або
    їх публікація у вигляді цифрового видання в Інтернеті
  • Хостинг видання не менше 3 років на серверах Sources Online

Як подати заявку?

Будь ласка, надішліть короткий опис проекту та його поточний стан на
адресу info@e-editiones.org. Заявки розглядатимуться на постійній
основі.
Кількість наданих грантів залежатиме від фінансових внесків, які ми
зможемо забазпечити, при підтвердженому початковому бюджеті 10000
євро. Будь ласка додайте опис вашої особистої ситуації та обгрунтуйте,
чому ваша кандидатура має бути розглянута позитивно.

Як я можу зробити свій внесок?

Якщо ви або ваша організація бажаєте внести кошти чи послуги на
користь цієї ініціативи, будь ласка, напишіть деталі на адресу
info@e-editiones.org.
Фізичним особам пропонується робити пожертви через PayPal. e-editiones
є зареєстровані як некомерційна організація у Швейцарії.




Archives Online