Text Encoding Initiative
Tenth Anniversary User Conference

``What Not to Tag''

John Lavagnino

Introduction

My aim in this paper is to talk about our choices in encoding texts, and, in particular, to focus on decisions about things not to do. One well-known reaction to the sight of the imposing bulk of the TEI Guidelines is the cry of despair at the thought that every word must be mastered and applied wherever and whenever the appropriate text features crop up. Of course, this is not the intention at all; but the decision about what to use is still a real problem, particularly for projects---the most common sort, I think---that do not have a specific use for their texts in mind, but instead aim to provide a generally useful digital collection.

I believe the Guidelines already offer a lot of help on this, in that each chapter does explain what the tags it discusses are for, and often makes clear the range of disciplines that work with the kind of information in question. And indeed the guidance the Guidelines offer by virtue of what they do and do not talk about doing is considerable: in view of the common argument that you can represent pretty much any structured information using SGML, the focus provided by the Guidelines on a subset of this universe of information is pretty good. Perhaps a summary guide that went chapter by chapter and talked about these questions a bit more, about why you would or would not need to study each chapter in more depth, would be valuable. But, for better or for worse, what I've chosen to do is instead to work on developing more general thoughts on what not to tag.

If I had to sum up what I have to say in one sentence, it would be: Don't tag what you don't understand. But a somewhat more Wagnerian statement of my point would be: Don't tag things that aren't fully worked out or elaborated, and don't tag the random, the occasional, the unique, or---to use an Aristotelean term that I'm going to be adopting---the accidental.

The unelaborated

There are lots of things that spring to mind as not good to tag, for reasons that don't give us deep insights into tagging; but let me address them briefly. To begin with, there are kinds of information that don't lend themselves well to SGML encoding, such as images. It is not a technical problem to encode images using SGML, but it does seem that using special image-file formats is generally a better approach. It's not just that the programs that exist for displaying and manipulating images universally work that way; after all, from the point of view of some people the existing programs for displaying and manipulating texts universally use Microsoft Word's many file formats and not SGML. It's that the size of images is something that still matters a lot in practical terms, for storage and particularly for transmission, so that using formats with specialized compression is of real value; and the manipulations we tend to subject images to are mostly very different from those we subject texts to, so there is little advantage to using a common format. We don't find ourselves wanting to feed novels into PhotoShop or images into TACT.

Here's another thing not to tag: it is in many cases not a good idea to tag random underlining or foolish margin annotations in a book, for example, and this not because of any problem about encoding such things but because we shouldn't generally waste our time on the random or foolish when there are other ways we could be spending our time. I own a copy of Lucky Jim with a previous owner's marginal annotations, and one of the most common is the single word ``Humor''. The question of value comes into thinking about what to tag in deeper ways, but here it's just a case of a pretty general truth.

But there is a more important version of this same point which Willard McCarty has made quite well already, and which is worth repeating. He proposed that there's a kind of tagging, which he refers to as ``magisterial tagging'', which is ``the mistaken practice of inserting tags wherever one thinks some phenomenon occurs without following a consistent editorial policy and providing a full explanation of it'' (McCarty 1997); for example, putting in a tag whenever it occurs to you that there's a metaphor. Willard's right that this isn't terribly valuable; I imagine that there may be some utility to it in the way that an inaccurate edition or electronic text has some utility---a work of this kind that's incomplete and inaccurate can still sometimes help you find something you're looking for (see Shipps on the use of similarly inaccurate or incomplete reference works such as concordances). If you want to find the place where Whitman says ``I am a habitan of Vienna'', an error-filled electronic text of Whitman may be of help because it might not be erroneous at this particular point. But such a text requires great care in use: you always need to check your results against a more reliable source, and you can never make any reliable statements about how often something happens. Electronic resources, I believe, are of most value when they are most optimized for searching and other computer-aided analyses; there are practical advantages to electronic texts that are just digital paper, like the ease of publication and updating, but I don't feel these would justify the effort of scholars or a conference like this one. The goal of facilitating the use of computers in collaboration, and not just as a publishing mechanism, is behind opposition to magisterial tagging.

But as a personal working practice, and not as a generally useful resource, the text with magisterial tagging is likely to be quite important: it can simply be a way of marking groups of things that you plan to discuss together or that you want to remember and return to for any reason (this is something I'll be coming back to). My working practices may be such that writing ``Humor'' in the margins here and there, on a paper or electronic text, actually helps me with my work. But such notations will be of fairly limited value to the outside world.

It is with the kind of information that in the world of TEI discussions has generally been called ``interpretive'' that magisterial tagging most obviously comes up. But one of the impulses for the practice when it occurs in the creation of texts for general use---a feeling that you've noticed something and ought to record it because it might perhaps be helpful to someone---and one of the consequences---information that can't be very well characterized or authenticated, and that therefore can wind up wasting a researcher's time rather than saving it---are characteristic of something that looks a lot more free of interpretation. I'm thinking of an impulse many seem to feel to provide information of a kind that they don't understand well but which they gather is useful to some people somewhere and which they believe to be objective. The tagging of metrical information and verse form in poetry can be an example of this; it's treacherous because it's easy to determine the meter and verse form for quite a lot of English poems, but the odd cases pose substantial problems. It is especially treacherous if you approach the task with the assumption that every poem will fit snugly into one or another existing category, rather than keeping in mind the importance of watching for poems that just don't fit any of the categories you have devised. And defining those categories will call for a fair amount of work. Manuals on versification do not have the kind of taxonomic drive that you need if you are going to be tagging a heap of verse in this way: they weren't written with databases in mind, even though they may look very classification-minded. The practice of poets, also, is deplorably creative and flexible, and they like to vary verse forms in any number of ways (especially since the beginning of the nineteenth century), thereby raising the question of what exactly the defining characteristics of each form are. Suppose I write a double sestina, is that still in the sestina class? Is a sonnet that ends with an alexandrine couplet still a sonnet? Is it worth trying to distinguish all the different sorts of sonnet, or is there a point at which it ceases to be worth it? Even rhyme, which is something we assume we all know about, is not so straightforward a phenomenon to identify, since the growth in use of various sorts of slant rhyme or assonance rhyme. In Robert Pinsky's translation of Dante's Inferno, for example, ``contain'', ``run'', and ``soon'' rhyme (canto 20, lines 65--69). Identifying the lines that rhyme in a long poem in terza rima is not very hard, but doing that in a group of shorter poems that could use all sorts of verse forms could be tricky for the inexperienced.

I have occasionally heard of plans to tag or otherwise make use of notes assembled in written form by prominent scholars: an approach that would seem ideal for getting around the problem of insufficient understanding of the subject. This sort of project is certainly worth doing, but its value will lie more in making accessible the collection of insights of the authors, rather than in providing a ready-made collection of appropriately systematized information. Some scholars have worked towards creating catalogues and bibliographies of the sort that would benefit greatly from publication in electronic form, but most of the unpublished notes that exist are not likely to be ideal for the electronic medium. It is more than the fact that they don't cover large bodies of material exhaustively; it is that, even when they do, scholars are more often engaged in collecting information and assembling notes on particular topics, and directed towards the needs of specific research projects, than on collecting information generally. We have all noticed that the information available in the world is incomplete: the feeling that more needs to be said is fundamental to the desire to do scholarly work. But trying to say it all impedes every getting anything done; so that productive scholars also tend to be scholars whose researches have a very specific focus most of the time.

For digital libraries, the most useful kind of information is usually that which is fully elaborated, carefully crafted for consistency and completeness of well-defined kinds; and this is not so different from the nature of the most useful reference books generally. What it means for electronic-text creation, though, is that it's a significant challenge to make information that meets that standard, and moreover we aren't going to just find it lying around.

The accidental

An early name for what we now generally call ``descriptive markup'' (as characterized most fully by Coombs, Renear, and DeRose, 1987) was ``generic markup'' (Goldfarb 567--568), and that name persists in SGML in the term ``generic identifier''. The obvious problem here, of course, is that it isn't always easy to figure out what the set of genera is that will cover everything in a particular body of texts.

For literary texts, I think the genus-problem that seems biggest comes up with things that are hard to fit into a genus because there aren't enough of them. Among the thousands of perfectly unproblematic paragraphs or verse lines in a text, we often encounter just one or two blocks of type that seem difficult or impossible to handle, because they're just not like anything else. Sometimes these are objects that occur frequently in other texts, just not in this one, like the mathematical formulae that suddenly start to appear in Gilbert Sorrentino's novel Mulligan Stew after page 300; that sort of thing is not really a problem.

An example of what I'm thinking about occurs in Vladimir Nabokov's novel Transparent Things: on page 14, there is an arrangement of type intended to represent the sign on a picture-taking booth: it consists of the characters 3P in large type, followed by two pieces of words in smaller type: ``hotos'' and ``oses''. Although there are a number of discussions of advertisements elsewhere in Nabokov's work, this is the only instance of a type-facsimile of a sign.

It is not very hard to think of perfectly workable ways to encode this. The problem is that they all seem to have a strong element of the arbitrary. Would we want something that applied just to other signs in which two words share one letter, or should our tag work for signs that didn't have that characteristic? Do we need to think about various possible arrangements of letters and words and offer ways to do them all, or only this particular arrangement? We can come up with a very general scheme, but then we still have no way of knowing whether it would actually help us out much in other cases if they ever come up; or we could come up with a very restricted scheme that only works when you have a few large letters and exactly two continuations that share the last of the large letters---raising the question of why we should have a whole separate tag for something so narrowly defined that we probably will never see it again.

Aristotle, in the Metaphysics, talked about the attempt to determine the essence of things, their real being; the ``accidental'', the contingent aspects of something, are not part of its essence. And indeed Aristotle argues that no science or systematic form of inquiry deals with accidents (1026b--1028a). We don't have to go along with Aristotle's search for the essence of all things to find this a suggestive account of our enterprise of generic markup, where we have decided from the start to search for the essence of texts. We aren't in a case like the photo sign able to determine the essence of the thing, or to distinguish that from the accidentals---something that we have learned to do quite readily for such things as paragraphs. We find the photo sign hard to handle because we don't really understand it; we can read it, of course, but from the point of view of generic markup an inability to describe the genus is a failure to understand it in the way that the enterprise demands.

We can find acceptable practical solutions, of course. I would suggest that, even though there is nothing but letters and numbers in this object, it is still best done as an image, with some transcription of the text but without an attempt to provide tags intended to facilitate a rendition of the text resembling the original. This strikes me as being simply more efficient than devising tags which we then spend the labor to interpret to render something that only happens once anyway. But our difficulty with this sort of thing is more consequential than its frequency---in this case, one problematic construct in Nabokov's entire output of eighteen novels---may suggest.

There is a generalizing and universalizing element to the humanities: but there is also an element that is very attached to the unusual, the particular, and the individual; or, to adapt Aristotle's term, the accidental. It is the specifically aberrant instance that often attracts the most interest from us. But when these aberrant instances have features, such as unusual rendition, that call on us to do something as encoders, then we face a problem, since we cannot effectively identify the generic in such features---thus showing our enterprise at its weakest---and yet these features attract interest out of proportion to their frequency, since it's their lack of frequency that is so important. What we encounter here is significant because it is a situation where the discrepancy between the attitude of the encoder and the attitude of many scholars who might use our texts is particularly acute. Not tagging the unique seems to me the right thing to do, since our whole approach is founded on identifying the generic; but we need to talk with other scholars about why it is we think so differently from them.

Works Cited

Kingsley Amis, Lucky Jim (New York: Viking, 1958).
Aristotle, Metaphysics, in Jonathan Barnes, editor, The Complete Works of Aristotle (Princeton: Princeton University Press, 1984).
James H. Coombs, Allen Renear, and Steven J. DeRose, ``Markup Systems and the Future of Scholarly Text Processing'', Communications of the ACM 30:11, November 1987, 933--947.
Dante, The Inferno, translated by Robert Pinsky (New York: Farrar, 1994).
Charles Goldfarb, The SGML Handbook (Oxford University Press, 1990).
Willard McCarty, ``Theft of fire: meaning in the markup of names'', ACH/ALLC 97, Kingston, Ontario, Canada, 7 June 1997. Published on the Web.
Vladimir Nabokov, Transparent Things (New York: McGraw-Hill, 1972
Anthony W. Shipps, The Quote Sleuth: A Manual for the Tracer of Lost Quotations (Urbana: University of Illinois Press, 1990).
Gilbert Sorrentino, Mulligan Stew (New York: Grove Press, 1979).
Walt Whitman, ``Salut au Monde!'', Leaves of Grass, edited by Sculley Bradley and Harold W. Blodgett (New York: Norton, 1973), 137--148.

Back to Technical Program