1 The TEI Infrastructure

This chapter describes the infrastructure for the encoding scheme defined by these Guidelines. It introduces the conceptual framework within which the following chapters are to be understood, and the means by which that conceptual framework is implemented. It assumes some familiarity with XML and XML schemas (see chapter v. A Gentle Introduction to XML) but is intended to be accessible to any user of these Guidelines. Other chapters supply further technical details, in particular chapter 22 Documentation Elements which describes the XML schema used to express the Guidelines themselves, and chapter 23 Using the TEI which combines a discussion of modification and conformance issues with a description of the intended behaviour of an ODD processor; these chapters should be read by anyone intending to implement a new TEI-based system.

The TEI encoding scheme consists of a number of modules, each of which declares particular XML elements and their attributes. Part of an element's declaration includes its assignment to one or more element classes. Another part defines its possible content and attributes with reference to these classes. This indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a schema appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema.

In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module tei described in the present chapter is of this kind because it defines classes, macros, and datatypes which are used by all other modules. The core module, defined in chapter 3 Elements Available in All TEI Documents contains declarations for elements and attributes which are likely to be needed in almost any kind of document, and is therefore recommended for global use. The header module defined in chapter 2 The TEI Header provides declarations for the metadata elements and attributes constituting the TEI header, a component which is required for TEI conformance, while the textstructure module defined in chapter 4 Default Text Structure declares basic structural elements needed for the encoding of most book-like objects. Most schemas will therefore need to include these four modules.

The specification for a TEI schema is itself a TEI document, using elements from the module described in chapter 22 Documentation Elements: we refer to such a document informally as an ODD document, from the design goal originally formulated for the system: ‘One Document Does it all’. Stylesheets for maintaining and processing ODD documents are maintained by the TEI, and these Guidelines are also maintained as such a document. As further discussed in 23.5 Implementation of an ODD System, an ODD document can be processed to generate a schema expressed using any of the three schema languages currently in wide use: the XML DTD language, the ISO RELAX NG language, or the W3C Schema language, as well as to generate documentation such as the Guidelines and their associated web site.

The bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter 23.3 Personalization and Customization.

The chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section 1.2 Defining a TEI Schema describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema.

The next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section 1.3 The TEI Class System).

Finally, section 1.4 Macros introduces the concept of macros, which are used to express some commonly used content models, and lists the datatypes used to constrain the range of legal values for TEI attributes (section 1.4.2 Datatype Macros).

1.1 TEI Modules

These Guidelines define several hundred elements and attributes for marking up documents of any kind. Each definition has the following components:

  • a prose description
  • a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG
  • usage examples

Each chapter of the Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a module. All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section 1.2 Defining a TEI Schema below.

The following table lists the modules defined by the current release of the Guidelines:

Module nameFormal public identifierWhere defined
analysisAnalysis and Interpretation17 Simple Analytic Mechanisms
certaintyCertainty and Uncertainty21 Certainty, Precision, and Responsibility
coreCommon Core3 Elements Available in All TEI Documents
corpusMetadata for Language Corpora15 Language Corpora
dictionariesPrint Dictionaries9 Dictionaries
dramaPerformance Texts7 Performance Texts
figuresTables, Formulae, Figures14 Tables, Formulæ, Graphics and Notated Music
gaijiCharacter and Glyph Documentation5 Characters, Glyphs, and Writing Modes
headerCommon Metadata2 The TEI Header
iso-fsFeature Structures18 Feature Structures
linkingLinking, Segmentation, and Alignment16 Linking, Segmentation, and Alignment
msdescriptionManuscript Description10 Manuscript Description
namesdatesNames, Dates, People, and Places13 Names, Dates, People, and Places
netsGraphs, Networks, and Trees19 Graphs, Networks, and Trees
spokenTranscribed Speech8 Transcriptions of Speech
tagdocsDocumentation Elements22 Documentation Elements
teiTEI Infrastructure1 The TEI Infrastructure
textcritText Criticism12 Critical Apparatus
textstructureDefault Text Structure4 Default Text Structure
transcrTranscription of Primary Sources11 Representation of Primary Sources
verseVerse6 Verse

For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme.

1.2 Defining a TEI Schema

To determine that an XML document is valid (as opposed to merely well-formed), its structure must be checked against a schema, as discussed in chapter v. A Gentle Introduction to XML. For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter 23.4 Conformance. Local systems may allow their schema to be implicit, but for interchange purposes the schema associated with a document must be made explicit. The method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated.

A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. The TEI provides an application-independent way of specifying a TEI schema by means of the schemaSpec element defined in chapter 22 Documentation Elements. The same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. These output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter 23 Using the TEI.

1.2.1 A Simple Customization

The simplest customization of the TEI scheme combines just the four recommended modules mentioned above. In ODD format, this schema specification takes this form:
<schemaSpec ident="TEI-minimalstart="TEI">
 <moduleRef key="tei"/>
 <moduleRef key="header"/>
 <moduleRef key="core"/>
 <moduleRef key="textstructure"/>
</schemaSpec>

This schema specification contains references to each of four modules, identified by the key attribute on the moduleRef element. The schema specification itself is also given an identifier (TEI-minimal). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. The resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter v. A Gentle Introduction to XML. The start point (or root element) of document instances to be validated against the schema is specified by means of the start attribute. Further information about the processing of an ODD specification is given in 23.5 Implementation of an ODD System.

1.2.2 A Larger Customization

These Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogeneous objects, and partly because encoders have different goals. Some examples of this heterogeneity include:

  • a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama;
  • a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative;
  • some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play;
  • an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features;
  • an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources;
  • the description of a text may require additional specialized metadata elements, for example when describing manuscript material in detail.

The TEI provides mechanisms to support all of these and many other use cases. The architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the ‘granularity’ of a text, for example:

  • a definition of a corpus or collection as a series of TEI documents, sharing a common TEI header (see chapter 15 Language Corpora)
  • a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section 4.3.1 Grouped Texts)
  • an element for the representation of embedded texts, where one narrative appears to ‘float’ within another (see section 4.3.2 Floating Texts)

Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. The markup constructs can be combined as needed for any given set of applications or project.

For example, a project aiming to produce an ambitious digital edition of a collection of manuscript materials, to include detailed metadata about each source, digital images of the content, along with a detailed transcription of each source, and a supporting biographical and geographical database might need a schema combining several modules, as follows:
<schemaSpec ident="TEI-PROJECTstart="TEI">
 <moduleRef key="tei"/>
 <moduleRef key="header"/>
 <moduleRef key="core"/>
 <moduleRef key="textstructure"/>
 <moduleRef key="msdescription"/>
<!-- manuscript description -->
 <moduleRef key="transcr"/>
<!-- transcription of primary sources -->
 <moduleRef key="figures"/>
<!-- figures and tables -->
 <moduleRef key="namesdates"/>
<!-- names, dates, people, and places -->
</schemaSpec>
Alternatively, a simpler schema might be used for a part of such a project: those preparing the transcriptions, for example, might need only elements from the core, textstructure, and transcr modules, and might therefore prefer to use a simpler schema such as that generated by the following:
<schemaSpec ident="TEI-TRANSCRstart="TEI">
 <moduleRef key="tei"/>
 <moduleRef key="core"/>
 <moduleRef key="textstructure"/>
 <moduleRef key="transcr"/>
</schemaSpec>

The TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in 23.3 Personalization and Customization and conformance rules relating to their application are discussed in 23.4 Conformance. These facilities are available for any schema language (though some features may not be available in all languages). The ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further 22.6 Combining TEI and Non-TEI Modules).

1.3 The TEI Class System

The TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an attribute class if its members share attributes, and as a model class if its members appear in the same locations. In either case, an element is said to inherit properties from any classes of which it is a member.

Classes (and therefore elements which are members of those classes) may also inherit properties from other classes. For example, supposing that class A is a member (or a subclass) of class B, any element which is a member of class A will inherit not only the properties defined by class A, but also those defined by class B. In such a situation, we also say that class B is a superclass of class A. The properties of a superclass are inherited by all members of its subclasses.

A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system.

1.3.1 Attribute Classes

An attribute class groups together elements which share some set of common attributes. Attribute classes are given names composed of the prefix att., often followed by an adjective. For example, the members of the class att.canonical have in common a key and a ref attribute, both of which are inherited from their membership in the class rather than individually defined for each element. These attributes are said to be defined by (or inherited from) the att.canonical class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the att.canonical class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to.

Some attribute classes are defined within the tei infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema.

The attributes provided by an attribute class are those specified by the class itself, either directly, or by inheritance from another class. For example, the attribute class att.pointing.group provides attributes domains and targFunc to all of its members. This class is however a subclass of the att.pointing class, from which its members also inherit the attributes target, targetLang and evaluate. Members of the class att.pointing will thus have these three attributes, while members of the class att.pointing.group will have all five.

Note that some modules define superclasses of an existing infrastructural class. For example, the global attribute class att.divLike makes attributes org and sample available, while the att.metrical class, which is specific to the verse module, provides attributes met, real, and rhyme. Because att.metrical is defined as a superclass of att.divLike, all five of these attributes are available to elements; the declaration for att.metrical adds its three attributes to the three already defined by att.divLike when the verse module is included in a schema. If, however, this module is not included in a schema, then the att.divLike class supplies only the two attributes first mentioned.

Attributes specific to particular modules are documented along with the relevant module rather than in the present chapter. One particular attribute class, known as att.global, is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in Appendix B Attribute Classes below.

1.3.1.1 Global Attributes

The following attributes are defined for every TEI element.

  • att.global provides attributes common to all elements in the TEI encoding scheme.
    xml:id(identifier) provides a unique identifier for the element bearing the attribute.
    n(number) gives a number (or other label) for an element, which is not necessarily unique within the document.
    xml:lang(language) indicates the language of the element content using a ‘tag’ generated according to BCP 47.
    rend [att.global.rendition](rendition) indicates how the element in question was rendered or presented in the source text.
    style [att.global.rendition]contains an expression in some formal style definition language which defines the rendering or presentation used for this element in the source text
    rendition [att.global.rendition]points to a description of the rendering or presentation used for this element in the source text.
    xml:baseprovides a base URI reference with which applications can resolve relative URI references into absolute URI references.
    xml:spacesignals an intention about how white space should be managed by applications.

These attributes are optionally available for any TEI element; none of them is required. Their usage is discussed in the following subsections.

1.3.1.1.1 Element Identifiers and Labels

The value supplied for the xml:id attribute must be a legal name, as defined in the World Wide Web Consortium's XML Recommendation. This means that it must begin with a letter, or the underscore character (‘_’), and contain no characters other than letters, digits, hyphens, underscores, full stops, and certain combining and extension characters.1

In XML names (and thus the values of xml:id in an XML TEI document) uppercase and lowercase letters are distinguished, and thus partTime and parttime are two distinctly different names, and could (though perhaps unwisely) be used to denote two different element occurrences.

If two elements are given the same identifier, a validating XML parser will signal a syntax error. The following example, therefore, is not valid:
<p               xml:id="PAGE1"><q>What's it going to be then, eh?</q></p> <p               xml:id="PAGE1">There was me, that is Alex, and my three droogs, that is Pete,               Georgie, and Dim, ... </p>

For a discussion of methods of providing unique identifiers for elements, see section 3.10.2 Creating New Reference Systems.

The n attribute also provides an identifying name or number for an element, but in this case the information need not be a legal xml:id value. Its value may be any string of characters; typically it is a number or other similar enumerator or label. For example, the numbers given to the items of a numbered list may be recorded with the n attribute; this would make it possible to record errors in the numeration of the original, as in this list of chapters, transcribed from a faulty original in which the number 10 is used twice, and 11 is omitted:
<list rend="numbered">
 <item n="1">About These Guidelines</item>
 <item n="2">A Gentle Introduction to XML</item>
 <item n="9">Verse</item>
 <item n="10">Drama</item>
 <item n="10">Spoken Materials </item>
 <item n="12">Dictionaries</item>
</list>
The n attribute may also be used to record non-unique names associated with elements in a text, possibly together with a unique identifier as in the following example:
<div type="bookn="onexml:id="TXT0101">
<!-- ... -->
 <div type="stanzan="xlii">
<!-- ... -->
 </div>
</div>
As noted above there is no requirement to record a value for either the xml:id or the n attribute. Any XML processor can identify the sequential position of one element within another in an XML document without any additional tagging. An encoding in which each line of a long poem is explicitly labelled with its numerical sequence such as the following
<l n="1">
<!-- ... -->
</l>
<l n="2">
<!-- ... -->
</l>
<l n="3">
<!-- ... -->
</l>
<!-- ... -->
<l n="100">
<!-- ... -->
</l>
is therefore probably redundant.
1.3.1.1.2 Language Indicators

The xml:lang attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the TEI element, and allow most elements to take the default value for xml:lang; the language of an element then need be explicitly specified only for elements in languages other than the base language. For this reason, it is recommended practice to supply a default value for the xml:lang attribute, either on the TEI root element, or on both the teiHeader and the text element. The latter is appropriate in the not uncommon case where the text element in a TEI document uses a different default language from that of the TEI header attached to it. Other language shifts in the source should be explicitly identified by use of the xml:lang attribute on an element at an appropriate level wherever possible.

In the following example schematic, an English language TEI header is attached to an English language text:
<TEI xml:lang="en" xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
<!-- ... -->
 </teiHeader>
 <text>
<!-- ... -->
 </text>
</TEI>
The same effect would be obtained by specifying the default language for both header and text:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader xml:lang="en">
<!-- ... -->
 </teiHeader>
 <text xml:lang="en">
<!-- ... -->
 </text>
</TEI>
The latter approach is necessary in the case where the two differ: for example, where an English language header is applied to a French text:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader xml:lang="en">
<!-- ... -->
 </teiHeader>
 <text xml:lang="fr">
<!-- ... -->
 </text>
</TEI>
The same principle applies at any hierarchic level. In the following example, the default language of the text is French, but one section of it is in German:
<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader xml:lang="en">
<!-- ... -->
 </teiHeader>
 <text xml:lang="fr">
  <body>
   <div>
<!-- chapter one is in French -->
   </div>
   <div xml:lang="de">
<!-- chapter two is in German -->
   </div>
   <div>
<!-- chapter three is French -->
   </div>
<!-- ... -->
  </body>
 </text>
</TEI>
Similarly, in the following example the xml:lang attribute on the term element allows us to record the fact that the technical terms used are Latin rather than English; no xml:lang attribute is needed on the q element, by contrast, because it is in the same language as its parent.
<p xml:lang="en">The
constitution declares <q>that no bill of attainder or <term xml:lang="la">ex post
     facto</term> law shall be passed.</q> ...</p>
Note that in cases where it is advisable or necessary to identify the language of the text that is pointed at, the (non-global) attribute targetLang should be used, for example in
<ptr target="x12targetLang="fr"/>
the pointer references text written in French.

The values used for the xml:lang and targetLang attributes must be constructed in a particular way, using values from standard lists. See further vi.1. Language Identification.

Additional information about a particular language may be supplied in the language element within the header (see section 2.4.2 Language Usage).

1.3.1.1.3 Rendition Indicators
The rend, rendition, and style attributes are all used to give information about the physical presentation of the text in the source. In the following example, rend is used to indicate that both the emphasized word and the proper name are printed in italics:
<p> ... Their motives <emph rend="italics">might</emph> be pure
and pious; but he was equally alarmed by his knowledge of the ambitious <name rend="italics">Bohemond</name>, and his ignorance of the Transalpine chiefs:
...</p>
The same effect might be achieved using the style attribute, as follows:
<p> ... Their motives <emph style="font-style: italic">might</emph> be pure and pious; but he was equally alarmed by his knowledge of
the ambitious <name style="font-style: italic">Bohemond</name>, and his ignorance of
the Transalpine chiefs: ...</p>
If all or most emph and name elements are rendered in the text by italics, it will be more convenient to register that fact in the TEI header once and for all (using the rendition element discussed below) and specify a rend or style value only for any elements which deviate from the stated rendition.

The main difference between rend attribute and style is that the value used for the former may contain one or more tokens from any vocabulary devised by the encoder, separated by space characters, whereas the value used for the latter must be a single string taken from a formally-defined style definition language such as CSS. The rend attribute values are sequence-indeterminate set of whitespace-separated tokens, whereas style values allow whitespace and sequence relationships as part of the formally-defined style definition language.

The rendition element defined in 2.3.4 The Tagging Declaration may be used to hold repeatedly-used format descriptions. A rendition element can then be associated with any element, either by default, or by means of the global rendition attribute. For example:
<tagsDecl>
<!-- define italic style using CSS -->
 <rendition xml:id="ITscheme="css">font-style: italic</rendition>
<!-- define a serif font family -->
 <rendition xml:id="FontRomanscheme="css">font-family: serif</rendition>
<!-- set italic style as default for the emph and hi elements -->
 <namespace name="http://www.tei-c.org/ns/1.0">
  <tagUsage gi="emphrender="#IT"/>
  <tagUsage gi="hirender="#IT"/>
<!-- set the default font-family for the text element -->
  <tagUsage gi="textrender="#FontRoman"/>
 </namespace>
</tagsDecl>
<!-- ... -->
<text>
 <body>
  <div>
   <p rendition="#IT">
<!-- this paragraph uses the seriffed font, but is in italic-->
   </p>
   <p>
<!-- this paragraph uses the seriffed font, but is not in italic -->
   </p>
  </div>
 </body>
</text>

The rendition attribute always points to one or more rendition elements, each of which defines some aspect of the rendering or appearance of the text in its original form. These details may most conveniently be described using a formal style definition language, such as CSS (Lie and Bos (eds.) (1999)) or XSL-FO (Berglund (ed.) (2006)); in some other formal language developed for a specific project; or even informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on.

As noted above, the style attribute is provided for encoders wishing to describe the appearance of individual source elements using a language such as CSS directly rather than by reference to a rendition element. Its value may be any expression in the chosen formal style definition language.

Formal definition languages such as CSS typically identity a series of properties (such as font-style or margin-left) for which values are specified. A sequence of such property-value pairs makes up a stylesheet. The TEI uses such languages simply to describe the appearance of a source document, rather than to control how it should be formatted.

In the TEI scheme, it is possible to supply information about the appearance of elements within a source document in the following distinct ways:

  1. One or more properties may be specified as the default for all elements of a given type, using the render attribute to point to rendition elements ;
  2. One or more properties may be specified for individual element occurrences, using the rend attribute with any convenient set of one or more sequence-indeterminate tokens;
  3. One or more properties may be specified for individual element occurrences, using the rendition attribute to point to rendition elements;
  4. One or more properties may be supplied explicitly for individual element occurrences, using the style attribute.

If the same property is specified in more than one of the above ways, the one with the highest number in the list above is understood to be applicable. The resulting properties from each way are then combined to provide the full set of property-value pairs applicable to the given element, and (by default) to all of its children.

For simplicity of processing, the same formal style definition should be used throughout; however, the architecture does permit this to be varied, by using the scheme attribute to indicate a different language for one or more rendition elements. Care should be taken to ensure that such values can be meaningfully combined. Similar considerations apply to the use of the rend attribute, if this is used in combination with either rendition or style.

Note that these TEI attributes always describe the rendition or appearance of the source document, not intended output renditions, although often the two may be closely related.

1.3.1.1.4 Evaluation of Links

Several TEI elements carry attributes whose values are defined as anyURI, meaning that such attributes supply a link or pointer, typically expressed as a URL. Like other XML applications, the TEI allows use of a special attribute to set the context within which relative URLs are to be evaluated. The global attribute xml:base is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. We do not describe it in detail here: reference information about xml:base is provided by Marsh (ed.) (2001)

In essence xml:base is used to set a context for all relative URLs within the scope of the element on which it is specified. For example:
<body>
 <div xml:base="http://www.example.org/somewhere.xml">
  <p>
<!--... -->
   <ptr target="#p1"/>
<!--... -->
  </p>
 </div>
 <div>
  <p>
<!--... -->
   <ptr target="#p1"/>
<!--... -->
  </p>
 </div>
</body>
The first ptr element here is within the scope of a div which supplies a value for xml:base; its target is therefore to be found at http://www.example.org/somewhere.xml#p1. The second ptr, however, is within the scope of a div which does not change the default context, and its target is therefore some element within the current document with the value p1 for its xml:id attribute. Further discussion of this element and its effect on TEI linking methods is provided in chapter 16 Linking, Segmentation, and Alignment.
1.3.1.1.5 XML Whitespace

The global attribute xml:space provides a mechanism for indicating to systems processing an XML file how they should treat whitespace, that is, any sequences of consecutive tab (#x09), space (#x20), carriage return (#x0D) or linefeed (#x0A) characters. Like xml:id this attribute is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. Complete information about this attribute is provided by section 2.10 of the XML Specification; here we provide a summary of how its use affects users of the TEI scheme.

The xml:space attribute has only two permitted values: preserve and default. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is when the document is processed. The second (which is implied when the attribute is not supplied), indicates that whitespace should be handled ‘as appropriate’. Exactly what is deemed appropriate is left unspecified by the XML Recommendation.

These Guidelines assume one of two different ways of processing whitespace will apply in a given case, depending on an element's content model. For an element that can contain only other elements with no intervening non-whitespace characters, whitespace is considered to have no semantic significance, and should therefore be discarded by a processor. For example, in a choice element, such as
<choice>
 <sic>1724</sic>
 <corr>1728</corr>
</choice>
since non-whitespace text is not permitted between the choice start-tag and the sic tags or between the sic and corr tags, any whitespace found there has no significance and can be ignored completely by a processor.

Similarly, the address element has a content model containing only elements: any punctuation or whitespace required between the lines of an address must therefore be supplied by the processor, as any whitespace present in the input document will be ignored.

Elements with content models of this type are comparatively unusual in the TEI: a list of them is provided in the TEI release file stripspace.xsl.model, formatted there for use as an <xsl:strip-space> command for XSL stylesheets.

Most TEI elements permit what is known as mixed-content: that is, they can contain both text and other elements. Here the assumption of these Guidelines is that whitespace will be normalized. This means that all space, carriage return, linefeed, and tab characters are converted into spaces, all consecutive spaces are then deleted and replaced by one space, and then space immediately after a start-tag or immediately before an end-tag is deleted. The result is that this encoding,
<persName>
 <forename>Edward</forename>
 <forename>George</forename>
 <surname type="linked">Bulwer-Lytton</surname>,
<roleName>Baron Lytton of
 <placeName>Knebworth</placeName>
 </roleName>
</persName>
would be rendered as ‘Edward George Bulwer-Lytton, Baron Lytton of Knebworth’. The space before his name has been removed, a space is included between his forenames, the comma is preserved, and the newlines within his name have all been removed.

If the default treatment described above is not appropriate for a mixed content element, the processing required may be described in the encodingDesc element of the TEI header, but generic XML processing tools may not take note of this.

Alternatively, the xml:space attribute may be supplied with a value of preserve in order to indicate that every space, tab, carriage return and linefeed character found within that element in the document being processed is significant. Typically, the result of that processing will be to retain the whitespace characters in the output. Thus if the above example began <persName xml:space="preserve">, the resulting text would most likely be rendered over five lines, indented, and with a blank line following.

The xml:space="preserve" attribute is rarely used in TEI documents because such layout features are generally captured with less risk and more precision by using native TEI elements such as lb or space, or by using the renditional attributes described in section 1.3.1.1.3 Rendition Indicators.

1.3.2 Model Classes

As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. This makes content models simpler and more consistent; it also makes them much easier to understand and to modify.

Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. This is however not entirely the case.

In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the ‘set of places where class members are permitted’ is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places.

These factors are reflected in the way that model classes are named. If a model class has a name containing part, such as model.divPart or model.biblPart then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a div constitute the model.divPart class; those which appear as content of a bibl constitute the model.biblPart class. If, however, a model class has a name containing like, such as model.biblLike or model.nameLike, the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. These semantically-motivated classes often provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class model.pPart.data (‘data elements that form part of a paragraph’) has four semantically-motivated member classes (model.addressLike, model.dateLike, model.measureLike, and model.nameLike), the last of these being itself a superclass with several members.

Although most classes are defined by the tei infrastructure module, a class cannot be populated unless some other specific module is included in a schema, since element declarations are contained by modules. Classes are not declared ‘top down’, but instead gain their members as a consequence of individual elements' declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active.

Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class model.rdgLike, initially empty, is expanded by the textcrit module to include just the TEI rdg element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class.

Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialized element g used to represent a non-Unicode character or glyph is provided as the only member of the model.gLike class when the gaiji module is added to a schema. References to this class are included in almost every content model, since if it is used at all the g must be available wherever text is available; however these references have no effect unless the gaiji module is loaded.

At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class model.pPart.edit groups all the classes of element for simple editorial correction and transcription which can appear within a p or paragraph element. The core module alone adds more than fifty elements to this class; the namesdates module adds another twenty, as does the tagdocs module. Since the p element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. The class system here provides a very convenient way of controlling the resulting complexity. Typically, elements are not added directly to these very general classes, but via some intermediate semantically-motivated class.

Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. These classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level.2 Members of such classes can only ever appear within one element, or one class of elements. For example, the class model.addrPart is used only to express the content model for the element address; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address.

1.3.2.1 Informal Element Classifications

Most TEI elements may also be informally classified as belonging to one of the following groupings:

divisions
high level, possibly self-nesting, major divisions of texts. These elements populate such classes as model.divLike or model.div1Like, and typically form the largest component units of a text.
chunks
elements such as paragraphs and other paragraph-level elements, which can appear directly within texts or within divisions of them, but not (usually) within other chunks. These elements populate the class model.divPart, either directly or by means of other classes such as model.pLike (paragraph-like elements), model.entryLike, etc.
phrase-level elements
elements such as highlighted phrases, book titles, or editorial corrections which can occur only within chunks, but not between them (and thus cannot appear directly within a division). These elements populate the class model.phrase.3

The TEI also identifies two further groupings derived from these three:

inter-level elements
elements such as lists, notes, quotations, etc. which can appear either between chunks (as children of a div) or within them; these elements populate the class model.inter. Note that this class is not a superset of the model.phrase and model.divPart classes but rather a distinct grouping of elements which are both chunk-like and phrase-like. However, the classes model.phrase, model.pLike, and model.inter are all disjoint.
components
elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. These elements populate the class model.common, which is defined as a superset of the classes model.divPart, model.inter, and (when the dictionary module is included in a schema) model.entryLike.

Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions.

As noted above, some elements do not belong to any model class, and some model classes are not readily associated with any of the above informal groupings. However, over two-thirds of the 551 elements defined in the present edition of these Guidelines are classified in this way, and future editions of these recommendations will extend and develop this classification scheme.

A complete alphabetical list of all model classes is provided in Appendix A Model Classes.

1.4 Macros

The infrastructure module defined by this chapter also declares a number of macros, or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models (1.4.1 Standard Content Models); and to stand for attribute datatypes (1.4.2 Datatype Macros).

1.4.1 Standard Content Models

As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements.

  • macro.paraContent (paragraph content) defines the content of paragraphs and similar elements.
  • macro.limitedContent (paragraph content) defines the content of prose elements that are not used for transcription of extant materials.
  • macro.phraseSeq (phrase sequence) defines a sequence of character data and phrase-level elements.
  • macro.phraseSeq.limited (limited phrase sequence) defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents.
  • macro.schemaPattern provides a pattern to match elements from the chosen schema language
  • macro.specialPara ('special' paragraph content) defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements.
  • macro.xtext (extended text) defines a sequence of character data and gaiji elements.

The present version of the TEI Guidelines includes some 551 different elements. Table 3 shows, in descending order of frequency, the seven most commonly used content models.

Content modelNumber of elements using thisDescription
macro.phraseSeq82defines a sequence of character data and phrase-level elements.
macro.paraContent53defines the content of paragraphs and similar elements.
macro.specialPara32defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements.
macro.phraseSeq.limited22defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents.
macro.xtext15defines a sequence of character data and gaiji elements.
macro.limitedContent8defines the content of prose elements that are not used for transcription of extant materials.
macro.anyXML3defines a content model within which any XML elements are permitted

1.4.2 Datatype Macros

The values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI datatype. Each such datatype is defined in terms of other primitive datatypes, derived mostly from W3C Schema Datatypes, literal values, or other datatypes. This indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter 23.4 Conformance).

Where literal values or name tokens are used in a datatype definition, an associated value list supplies definitions for the significance of suggested or (in the case of closed lists) all possible values.

TEI-defined datatypes may be grouped into those which define normalized values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links.

The following datatypes are used for attributes which are intended to hold normalized values of various kinds. First, expressions of quantity or probability:

  • data.certainty defines the range of attribute values expressing a degree of certainty.
  • data.probability defines the range of attribute values expressing a probability.
  • data.numeric defines the range of attribute values used for numeric values.
  • data.interval defines attribute values used to express an interval value.
  • data.percentage defines attribute values used to express a percentage value.
  • data.count defines the range of attribute values used for a non-negative integer value used as a count.

Examples of attributes using the data.probability datatype include degree on damage or certainty; examples of data.numeric include quantity on members of the att.measurement class or value on numeric; examples of data.count include cols on cell and table.

Next, the datatypes used for attributes which are intended to hold normalized dates or times, durations, or truth values:

  • data.duration.w3c defines the range of attribute values available for representation of a duration in time using W3C datatypes.
  • data.duration.iso defines the range of attribute values available for representation of a duration in time using ISO 8601 standard formats
  • data.temporal.w3c defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the W3C XML Schema Part 2: Datatypes Second Edition specification.
  • data.temporal.iso defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the international standard Data elements and interchange formats – Information interchange – Representation of dates and times.
  • data.truthValue defines the range of attribute values used to express a truth value.
  • data.xTruthValue (extended truth value) defines the range of attribute values used to express a truth value which may be unknown.
  • data.language defines the range of attribute values used to identify a particular combination of human language and writing system.

Note that in each of these cases the values used are those recommended by existing international standards: ISO 8601 as profiled by XML Schema Part 2: Datatypes Second Edition in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; BCP 47 in the case of language; and ISO 5218 in the case of sex.

The following datatypes have more specialized uses:

  • data.namespace defines the range of attribute values used to indicate XML namespaces as defined by the W3C Namespaces in XML Technical Recommendation.
  • data.outputMeasurement defines a range of values for use in specifying the size of an object that is intended for display.
  • data.pattern (regular expression pattern) defines attribute values which are expressed as a regular expression.
  • data.point defines the data type used to express a point in cartesian space.
  • data.pointer defines the range of attribute values used to provide a single URI, absolute or relative, pointing to some other resource, either within the current document or elsewhere.
  • data.version defines the range of attribute values which may be used to specify a TEI or Unicode version number.
  • data.versionNumber defines the range of attribute values used for version numbers.
  • data.replacement defines attribute values which contain a replacement template.
  • data.xpath defines attribute values which contain an XPath expression.

By far the largest number of TEI attributes take values which are coded values or names of some kind. These values may be constrained or defined in a number of different ways, each of which is given a different name, as follows:

  • data.word defines the range of attribute values expressed as a single word or token.
  • data.text defines the range of attribute values used to express some kind of identifying string as a single sequence of unicode characters possibly including whitespace.
  • data.name defines the range of attribute values expressed as an XML Name.
  • data.enumerated defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities.
  • data.sex defines the range of attribute values used to identify human or animal sex.
  • data.xmlName defines attribute values which contain an XML name.

Attributes of type data.word, such as age on person, are used to supply an identifier expressed as any kind of single token or word. The TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include cholmondeley, été, 1234, e_content, or xml:id, but not grand wazoo. Attributes of this kind are sometimes used to associate (by co-reference) elements of different types.

Where identifiers are defined externally, for example as part of a database or file system, the inability to include whitespace or other special characters in a value may be problematic. In other cases, it may also be simply more convenient to supply a short sequence of natural language words including spaces as a single value. For these reasons, we also provide a datatype data.text which does permit whitespace and indeed any other Unicode character. Legal values include cholmondeley, été, 1234, e-content, xml:id, and grand wazoo. This datatype should be used with care since XML will not normalize whitespace characters within it: for example the values n="a b" (two spaces) and n="a b" (three spaces) would be considered distinct. This case should be distinguished from that of an attribute permitting multiple values, each of which may be separated by whitespace which will be normalized (see further 22.4.5.1 Datatypes).

Attributes of type data.name are similar to those of type data.word, but with the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. Hence, they may not begin with digits or punctuation characters. Legal identifiers include cholmondeley, été, e_content, or xml:id, but not grand wazoo or 1234. Attributes of this kind are typically used to represent XML element or attribute names.

Attributes of type data.enumerated, such as new on shift or evidence supplied by att.editLike, have the same definition as data.word above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. This list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be data.enumerated, but an explicit list of the possible values.

An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the datatype element used to document the attribute in question rather than as a distinct ‘datatype’. See further 22.4.5.1 Datatypes.

1.5 The TEI Infrastructure Module

The tei module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order:

Module tei: Declarations for classes, datatypes, and macros available to all TEI modules

The order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. The XML DTD fragment implementing this TEI module makes extensive use of parameter entities and marked sections to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null (‘notAllowed’) values. These issues are further discussed in chapter 23.5 Implementation of an ODD System.

Notes
1
The colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name.
2
In former editions of these Guidelines, such elements were known metaphorically as ‘crystals’.
3
Note that in this context, phrase means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense.

[English] [Deutsch] [Español] [Italiano] [Français] [日本語] [한국어] [中文]




TEI Guidelines Version 2.8.1a. Last updated on 25th August 2015, revision 66d11fa. This page generated on 2015-08-25T17:23:40Z.