The Digital Library in the TextGrid Repository

The Digital Library in the TextGrid Repository represents an extensive collection of German texts in digital form, ranging from the beginning of the printing press up to the first decades of the 20th century. The collection is of particular interest to German Literature Studies as it contains virtually all the important texts in the canon and numerous other texts relevant to literary history whose copyright has expired. The same applies to Philosophy and Cultural Studies as a whole. For the most part, the texts are taken from textbooks and can therefore be cited, as well as the remaining texts which predominantly stem from the digitalisation of first editions.

Fields

The texts of the Digital Library in TextGrid, which have been from the online library zeno.org, can be arranged in the following categories:

  • Literature (texts by 693 authors)
  • Fairytales (58 texts)
  • History (14 texts)
  • Cultural history (113 texts)
  • Art (12 texts)
  • Music (81 texts)
  • Natural sciences (20 texts)
  • Philosophy (texts by 248 authors)
  • Sociology (1 text)
  • Reference works (27 texts)

TextGrid makes these texts available not only for reading, but particularly for further processing, e.g. in editions and text corpora. For this purpose, the XML files are converted into a valid TEI format during the course of the project, which will make an exact research into the texts possible.

Collection "Literature"

Since end of 2009, the data stock of the “literature folder” (fiction) is available for download, XML/TEI-formatted for scholarly use and advanced research. Philosophical texts and lexica will follow.

The Corpus

Preparation of the corpus
During transformation, the original zeno-XML-mark-up was converted into TEI. In addition, further mark-up was added on the basis of defined existing mark-up or other text structure (e.g. lg-grouping, speaker etc.). Due to the limited manageability of such enormous amounts of data, the applied heuristics are kept schematic and unspecific, hence, it might in some cases result in erroneous interpretations. To remove such source of error step by step, the “literature folder” will be adapted continuously; hence, there will be different versions available.

Metadata
The extraction of metadata of the bibliographic details of authors’ works turned out to be challenging: In addition to the variety of different source types, there is an inconsistent capture of bibliographic declaration, e.g. title, volume, publisher, etc. of a work is given in differing order and are separated either by comma or full stop. Date specifications exist partially and are noted in differing formats. The following metadata could be extracted:

  • Author: sourceDesc/biblFull/titleStmt/author, PND in the attribute „key”.
  • Work title: fileDesc/titleStmt/title.
  • Date specification: If a distinct date of origin was identified, it was tagged as follows: sourceDesc/biblFull/publicationStmt/date. If there is no distinct date of origin assured, we tried to approximate the period of origin by the inscirption <date notBefore=“…” notAfter=“…”>. Therefore the authors' dates of birth and – if existing – information about the first publication where used. These information are evaluated in order to specify the most marginal period with the adjuncts „notBefore“ and „notAfter“.
  • All of the available bibliographic information represented by a string: sourceDesc/biblFull/titleStmt/title.

Currently, the extraction of metadata is being continued in order to code the data even more accurately structured.

Download

Please use the following links to download the entire data stock of the “literature folder” as well as a schema on the data (in German):
Download of published files: Text and images (version I) (1,9 GB)
Download text corpus version I (391 MB)

Download text corpus version II (390 MB)
Download schemas (Git Repository)

On the versions

The coding of the original data includes works, collections, chapters and other text items, e.g. headings, with one single recursive tag (article-element). The interlacing of these article-elements is plurivalent, so it is complex to determine if an article-element contains the whole text, one chapter, the dedication, one device or only the title or other metadata. The first version of the corpora, therefore, encoded in some extent particular titles within the TEI-element. This bug was eliminated in corpora II where this information is coded in the front element.

Portal

The data in the TextGrid portal (repository) can only be updated in longer time intervals. However, the portal will be configured interactively in the long-run, so users can exchange information on encoding imprecision directly. For information on identified challenges with the corpus please contact katrin.betz@uni-wuerzburg.de.
The texts of the literature collection are available from the
TextGrid Repository ; the collections of philosophical texts and lexica will follow.

Future functions of the portal

  • Download of self-assembled sub-corpus.
  • Search: search within time periods; suitable matches can be downloaded within the context (approx. 4 lines); improvement of the possibilities to restrict the search to certain sub-corpora.
  • Communication between users: links to research projects and publications which are based on the Digital Library; mailing list.

Differences between the corpus of the portal and the total corpus

Corpus of the portal
To ingest data of the corpus into the TextGrid Repository, the total file of the teiCorpus-level and the TEI-level were splitted. The metadata of the teiHeader were collected in multiple metadata-sets. During the input of data in the repository, Persistent Identifiers (PIDs) were assigned.
A splitted version of corpus II is currently available from the portal.

Entire corpus
The complete corpus includes only one data file per author. Works, collections, and text items like chapters are covered by TEI elements or teiCorpus-elements.
The entire corpus does not include PIDs. Instead, the identifier for the relative version is specified by the name of the author, the position of the TEI- or teiCorpus-elements and the file-generation date (idno type=“FileCreationTime“).

Known Bugs
Due to technical reasons, the PID will not be delivered in the texts but is available through the metadata and will be added as soon as possible.
Semantic wrong display for speaker, lg and closer.

Should you find any errors or defects in the mark-up, please contact thorsten.vitt@uni-wuerzburg.de. Please supply the URL and exact context of the error.

Licensing

Since a publishing company (Editura, operator of zeno.org) digitised texts in the public domain and provided the XML mark up, the company owns the ancillary copyright to the digitised, compiled and marked texts. TextGrid acquired the licence to use this digitised and XML-marked collection of texts on the condition that Editura is mentioned (Creative Commons licence “by” version 3.0).
In order to relay the annotated data stock including the metadata with as few restrictions as possible, TextGrid will also make this data stock available under the Creative Commons licence “by” version 3.0.

The texts as such, i.e. the texts without annotations and without added metadata, are available in the public domain. Texts already in the public domain are not affected by licensing.
TextGrid created a new database by processing and structuring the texts as well as editing the metadata; this database is automatically subject to own ancillary copyrights in accordance with general copyright regulations. These copyrights are also regulated by the Creative Commons licence “by” version 3.0.
Hence, the data stock of the Digital Library can be:

  • reproduced, distributed and made available to the general public
  • used to adapt and edit the content
  • used commercially

Refer to: http://creativecommons.org/licenses/by/3.0/
In each case, TextGrid must be mentioned in the form: TextGrid.

Should you pass on data of the data stock that are protected, please add the following information:The work title by name is a modification of the data stock of TextGrid’s Digital Library, www.editura.de, and is published under the Creative Commons licence.

Proceedings

Previous proceedings

  • Structural analysis of the text data: The data is structured in folders according to encyclopaedias/subject areas (history, cultural history, art, literature, fairytales, music, natural sciences, philosophy, and sociology); each folder contains subfolders (generally one subfolder per author which contains all the author’s works in one file).
  • Enriching of the original data (ID, information on the work, structural disambiguation)
  • Extraction of metadata: Metadata on the individual works are located in various files; the information on the digitalisation source is stored in an external catalogue file; the information on the time and place of publication is located at the beginning of the author file as an unstructured free-form text. All metadata belonging to a certain work is assigned to the respective work via a specific transformation routine.
  • Manual marking of the work level: As the mark-up does not allow an automatic division of the data into individual works, information on the works was added manually (initially for the literature folder, over 120,000 individual works). For this purpose, a user interface displaying the data and processing them further was created.
  • Filtering of the files by text type: For the “literature folder”,individual works had to be sorted according to their text type in order to enable the development of conversion routines specifically according to the text type. The existing user interface could be enhanced accordingly.
  • Specifications for the mapping of the text types poetry, prose and drama
  • Development of transformation routines for the mapping of the individual text types in the literature folder on TEI P5
  • Structural transformation from <div> to <teiCorpus>
  • Encoding of metadata that can be extracted automatically in <teiHeader>
  • First adjustment of the data structure to the TextGrid architecture
  • Integration of Adelungs’ dictionary and “Meyers Konversationalexikon“ (conversation dictionary) into the “Trierer Wörterbuchnetz” (Trier Dictionary Network)
  • Creation of routines for the mapping of Adelung’s dictionary to TEI P5

Further proceedings

  • Refining of the metadata, development of a user interface for manual correction of metadata
  • Error analysis of the TEI marking and corrections
  • Improvement of the data structure with regard to the TextGrid architecture
  • Additional structural analysis of the texts and more in-depth TEI marking
  • Allocation of persistent identifiers for each work level
  • Application and, if necessary, modification of the transformation routines for the remaining folders and dictionaries in the Digital Library

On funding

This collection of texts was acquired as part of the research project TextGrid (www.textgrid.de, funding code: 01UG1203A) with funds provided by the BMBF (“Bundesministerium für Bildung und Forschung” – German Federal Ministry of Education and Research). We therefore kindly ask you to add this note on funding when providing the data stock for further usage.