Responsibility: Juanma Garrido, Silvia Quazza
In the following a review of some existing coding schemes for prosody is presented. Section 6.2 gives a brief overview of prosodic phenomena, while section 6.3 discusses purposes and problems of prosodic transcription. Finally, section 6.4 summarizes and compares a number of current schemes for prosodic annotation, described in more detail in Annex.
The term prosody covers a wide variety of facts, concepts and phenomena, defined by researchers working with different theories and frameworks. One of the first problems that arise when attempting the study of prosody (and of course, its representation) is the definition of the concept itself and its scope. The description of prosody in any language can be approached from two opposite (and complementary) starting points:
1) From a linguistic point of view, the description of prosody can be viewed as the description of a series of suprasegmental units (syllables, stress groups, intonational units) and phenomena (stress, intonation, rhythm).
2) From a phonetic point of view, the description of prosody is mainly approached as the description of the different phonetic correlates (length, loudness, F0 variations) of these linguistically relevant prosodic events.
Considering this distinction, the prosodic phenomena can be classified in two main groups: a first group of linguistic prosodic events, and a second group of phonetic prosodic events. They are closely related to each other but can be described separately. These two subsets are reviewed in the following subsections.
In the linguistic descriptions of prosody (mainly from a phonological point of view), usually two types of prosodic items are handled: a set of prosodic units (phonological units with a scope wider than a segment), and a set of prosodic phenomena which are superimposed on these units.
Several types of prosodic units (differing mainly in their scope) have been proposed in the prosodic studies:
3) Intonation groups
4) Intermediate groups
5) Stress groups
It is not the aim of this review to present a detailed description of each unit. Although some of these units have been proposed after experimental research (as in the case of the paragraphs), that is, using phonetic evidences, most of them are used in phonological analysis. Apart from sharing the feature that their scope is in all cases wider than a single segment, all of them have in common the fact that they have been proposed as the natural domain of specific suprasegmental or segmental processes (see, for example, [Nespor & Vogel 86]).
We consider here as prosodic phenomena the suprasegmental features of intonation, stress, rhythm and speech rate. They are not units, but take place usually at a specific domain. They are also holders of some kind of linguistic (or paralinguistic) meaning.
As stated in [Roach 83, p. 112], no definition [of prosody] is completely satisfactory, but any attempt at a definition must recognize that the pitch of the voice plays the most important part. No precise and universal definition of intonation has been given yet, but there is a general agreement about some facts: first, that intonation is clearly related to F0, although it determines changes in other phonetic parameters (for example, the length of prepausal syllables); there is also a general agreement in relating intonation to phenomena which occur at sentence level, leaving the word tone for those F0 phenomena which are relevant at word level ([Lehiste 70]). From a phonological point of view, intonation phenomena are usually described in the following terms ([Pierrehumbert 80], for example):a) Pitch accents
b) Boundary tones
c) Phrase accents
In other cases, however, the phonological components of intonation are described using different concepts. This is the case, for example, of the British school, that uses the terms head, body and tail ([Palmer 22], [Crystal,69]).
In the case of stress, there is a wider agreement about its nature and phonetic correlates: it is usually associated with the presence of a special degree of prominence on specific syllables of the discourse. Several types of stress have been defined in the literature, some of them language-specific:a) lexical (primary)
c) stød (accents I and II in Swedish and other Scandinavian languages).
d) emphatic (focus, contrast)
Rhythm can be defined as the perceptive effect produced by the periodical repetition of some phonetic phenomenon along the discourse. The nature of the rhythm may be different depending on the language: it can be based on the isochrony of syllables (syllable-timing), or in the placement of stressed syllables at regular intervals (stress-timing). It is then related to other prosodic phenomena (stress) and units (syllables), and produces variations in several phonetic parameters (duration of sounds or syllables, F0, intensity).
4) Tempo, speech rate
Tempo and speech rate depend on the speed at which the speaker produces their utterances. Speech rate is often measured as the number of sounds uttered per second. It produces then mainly changes in the length of the sounds, although differences in the shape of pitch movements due changes of speech rate have also been reported.
6.2.2 Phonetic Correlates of Prosody
Prosodic units and phenomena are physically realized in the speech chain by modifying a set of phonetic parameters. These phonetic cues (F0, length changes, pauses, loudness) are called here phonetic correlates of prosody.
184.108.40.206 F0 Events
F0 changes are typically related to intonation phenomena, but stress and rhythm &endash; as well as many other non-linguistic-factors &endash; play also a role in the definition of the final F0 contour of a utterance. These F0 changes (or events) seem to occur at different levels of description. At the first level (called here local), some of them seem to affect syllables or groups of syllables. However, other F0 events seem to affect wider units, such as intonation phrases or even sentences or paragraphs. These type of events are called here global.1) Local F0 events
From a phonetic point of view, local F0 events can be described either as series of F0 levels or F0 contours (movements). They are two different approaches to the description of the same phenomenon, the evolution of the F0 frequency along utterances.2) Global F0 events
Several F0 variations seem to be related to more global phenomena, having a scope larger than the syllable or the stress group. They are concepts used mainly in phonetic descriptions of intonation:a) global falling (declination) / rising
b) F0 reset
c) pitch range
However, these concepts still need to be integrated in phonological theories of intonation, which have been focused mainly on the description of local phenomena.
The length of a sound is the result of the interaction of several linguistic (stress, intonation, rhythm, speech rate) and non-linguistic factors (position in the utterance, phonetic context). Each sound of a given language seems to have also some kind of intrinsic duration, which is varied in the discourse by this set of factors. The length of a sound is then only partially related to prosody, because it depends also on segmental factors (the nature of each sound, the context where it appears).
220.127.116.11 Intensity - Loudness
As in the case of length, the intensity of a sound depends on several factors, being possibly stress and intonation those which affect mostly the final intensity of a sound. Each sound of a language seems to have also its intrinsic intensity, which can be estimated by removing from the amplitude of a sound the influence of these affecting factors.
The insertion of pauses in the discourse is one of the ways of marking prosodic phrasing; it is then closely related to intonational phenomena. Speech rate may also determine the location of pauses. And there are other non-linguistic factors that can determine the insertion of a pause: physiological, as the need of breathing; or psycholinguistic, as hesitations
18.104.22.168 Voice Quality
Voice quality is a phonetic cue that is usually related to the idiosyncratic characteristics of the speakers vocal tract. However, some changes in the voice quality may have a linguistic function, or may be determined by linguistic phenomena. This is the case, for example, of the changes in the spectrum of a sound affected by stress.
6.3 Prosodic Transcription
It is clear from the above review of prosodic concepts that prosody is a complex phenomenon that can be approached at different levels of description and can be studied for different purposes. From a linguistic point of view, it can be an object of analysis in itself, to be modelled seeking its patterns and functions, or it can be analyzed as a correlate of discourse structure. In speech technology research, prosody has been studied mainly to achieve natural sounding synthetic speech, trying to associate the proper prosodic events with the input text and realize them with the proper manipulations of acoustic parameters. Also speech recognition is finding some interest in acoustic correlates of prosody as cues to text structure.
Each experimental study has adopted some kind of prosodic representation suited to its purposes, from abstract labels to acoustic measures. But due to the different perspectives of prosodic research a unique coding scheme for prosody is hard to conceive. Recently, the need for a standard coding scheme has been felt, in order to allow for easy data exchange in the era of large speech corpora.
But although several formal systems of representation of prosody have been used in the description of the prosodic events of the different languages, at this moment, it does not seem to exist a unique and complete system to represent all the prosodic phenomena listed in the previous section.
Some attempts to propose a standard coding scheme have been made, perhaps the most successful in terms of diffusion being ToBI. But the discussion on advantages and drawbacks of different schemes should take into account not only the complexity of the object - the different aspects of prosody - but also the various possible objectives of prosodic research.
If the purpose is an analysis of discourse, some diacritics marking prosodic boundaries or accents could be enough. For a study of the relations between prosody and discourse structure in a language for which accurate prosodic modelling is already available, symbolic labels concisely representing the prosodic patterns of that language are the proper choice. On the other hand, if one wants to gather experimental data to investigate prosodic patterns and build up a prosodic model, a more detailed phonetic transcription is necessary. For linguistic studies such transcription could be based on auditory analysis, but for speech technology implementations it should be assigned precise acoustic meaning.
Given that, it is rather a difficult task to review a number of different coding schemes, and compare them on the basis of quantitative categories such as the number of transcribers and transcriptions or the results of some evaluation test. In Annexes, we are not attempting a complete review, rather we give examples of transcription systems very different in nature and objectives. Some of them are general approaches to the study of prosody, which have been followed more or less thoroughly by many researchers in their experiments and studies. Others have been defined in the scope of some specific Project as a convention for prosodic labelling of corpora. In such cases, the purpose of the Project and the intended use of the corpus determine the kind of prosodic representation: corpora acquired for dialogue research, for example, often are not focused on prosody and need only abstract labels to mark some macroscopic prosodic features related with discourse events. Finally, some of the reviewed coding schemes have been defined with the explicit aim of providing a standard.
A final remark about the phenomena annotated in the different coding schemes: while it is admitted that prosody is a complex matter where intonation, rhythm and loudness are intertwined, the discussion on prosodic notation generally focuses on intonation, at least when coming to phonetic descriptions. Some phonological representations make explicit reference to speech rate, lengthening or more sophisticated rhythmical categories, and most coding schemes mark phrase boundaries and accents, which globally refer also to duration/intensity phenomena. But in phonetic-level prosodic transcriptions the main point - perhaps because it is the most problematic aspect - is intonation. Generally, for annotated speech corpora a phonetic segmentation is available, so that duration is implicitly marked and intensity can be computed from the signal. The peculiarities of a coding scheme often concern its representation of fundamental frequency, so that a relevant feature of a notation system is its underlying theory of intonation or its reference methodology for intonation analysis.
Fully acoustic approaches such as the classical one by Fujisaki [Fujisaki 71], where the intonation profile is seen as a superposition of mathematically defined curves, can't be said to have developed into notational systems, although they provide descriptions of data. On the opposite side, linguistic approaches such as the traditional British School ([Crystal,69], [O'Connor 73]), based on auditory analysis and strong theoretical assumptions, have been largely used in phonological research and have also been recently adopted in corpora labelling (see TSM). In this view, (English) intonation is subdivided into tone units, where the main intonation event, the nuclear tone, occurring on the last accented syllable, is described in its height and shape, for example as a high fall or a low fall-rise. Another family of phonological approaches, whose reference is [Pierrehumbert 80] (and whose first object is again English), describes intonation in terms of levels rather shapes: what seems relevant is the tone level reached by the different points in the pitch contour, which is described in terms of the contrast between high and low (H, L) and with the association with accents (*) and boundaries (%). The use of this notation (more than the underlying principles) is widespread, at least in scientific communication, and this theory has inspired the proposed standard ToBI. Experimental phonetic research and speech technology generally are more inclined to follow data-oriented bottom-up methodologies. For these approaches, an intonation model for a given language should keep a precise - implementable - phonetic/acoustic content. The starting point is the f0 curve, which is first stylized and then phonetically described by means of generalizations from the acoustic/perceptual data. The curve may be seen as a sequence of pitch movements or contours - as in the IPO view (see Annexes) - or as a series of interpolated target points or pitch levels connected by a continuous curve - as in the INTSINT approach.
Examples of coding schemes more or less explicitly inspired by such different intonation theories are included in the review presented in the Annexes. The review, by no means exhaustive, gives brief descriptions of the following schemes:1. PROSPA
14. PROZODIAG (Lund)
For detailed surveys of prosodic transcription and encoding systems the reader is referred to [Llisterri 94, 96b], [Léon & Martin 70] -which contains a chapter devoted to classical approaches to prosodic transcription -and to [Gibbon 90], reviewing most of the work in this area carried out within the SAM (Speech Assessment Methodologies) project. A discussion of this topic is also found in the text representation chapter of the EAGLES Handbook on Spoken Language Systems [Gibbon et al. 1997].
As discussed above, prosodic research is too complex in contents and points of view to be codified in a standard coding scheme. The description of the different transcription systems reviewed in the Annexes should give an idea of the variety of theories and purposes underlying the attempts to give a representation of prosodic phenomena.
Comparing the reviewed schemes is not a trivial task. An attempt has been made to describe them according to a general pattern, but this has not always been possible, due to the different nature of the schemes: some of them are well defined, used in a single project to label a single corpus, others can be considered methodologies or theories. Even a quasi-standard like ToBI has indeed lots of variants, imitations or adaptations, some of which may loosen its basic assumptions (e.g. by admitting 'movements' beside 'pitch levels', [Mayo et al. 97]). There is no agreement about the prosodic phenomena which have to be represented. Some systems are intended only for f0 representation (e.g. INTSINT, IPO, TILT, TSM, ToBI, PROSPA), while others provide labels to mark rhythm, loudness, voice quality (e.g. TEI). Most systems delimit prosodic units (some implicitly, as breaks in the f0 curve), but units types range from the single 'tone unit' (e.g. PROSPA) to complex hierarchies (e.g. SAMPA). Approaches to the transcription of intonation can be acoustic-phonetic or phonological or allowing for different abstraction levels (e.g. IPO, INTSINT, TILT, PROZODIAG) and conceive the pitch profile in terms of 'levels' or 'movements'. Some systems are developed in the framework of specific prosodic theories or methodologies (e.g. IPO, INTSINT, ToBI, TSM, PROZODIAG). In some cases labels are strictly linked to language-dependent models (e.g. ToBI, PROZODIAG), while in other cases they are more general or 'phonetic', although abstract (e.g. PROSPA, IPA, SAMPA). Manual labelling is for many schemes based both on auditory analysis and on visual inspection of the f0 curve and waveform, but for some schemes labels are not aligned with the signal but merely associated with linguistic units (e.g. TEI, IPA, Göteborg). Only a few systems have a 'real' coding book, in most cases the scheme is described in the literature. Formal evaluations of the performance have been carried out in very few cases and only one coding scheme (TEI) has been developed within a standard markup language (SGML). Some systems insert labels directly in the orthographic or phonetic transcription, while others have different tiers for prosodic annotation. Few systems have specific annotation tools, while many are compatible with standard signal analysis environments such as ESPS/Waves+.
In order to help a possible comparison, the following Table provides a synopsis of the different schemes, just a summary of their relevant features and underlying principles. For each scheme, the Table specifies which is its underlying intonation theory, which are the labelled prosodic units and phenomena (see section 6.2) and how labels are aligned with speech. Some abbreviations are used, 'p.' stands for 'phrase', '>' stands for 'labels are symbolically associated with' and '|' means 'labels are time-aligned with'. Schemes are roughly ordered according to their level of abstraction. The first schemes listed in the Table are those conceived in bottom-up approaches, where the analysis of intonation starts from the f0 curve, 'stylized' and represented with labels or parameters keeping a precise acoustic content, and reaches, as a second step, a more abstract phonological representation (e.g. contour labels for manual labelling in TILT, 'pitch configurations' in IPO, tonal labels for accent, focus, juncture in PROZODIAG). Such systems align their prosodic labels with the speech signal, in some cases at the phonetic boundary of relevant units (stressed vowel, syllable) or at turning points in the f0 curve (peaks, valleys). The link with the f0 curve is less strict for the schemes listed at the bottom of the Table. Labels are often inserted in the phonetic or orthographic representation or refer to linguistically defined units. Phonological assumptions may be more or less strong, but labels tend to have a qualitative interpretation (e.g ToBI labelling rules are strict and rely on a predefined language-dependent phonological model, but labels may be aligned with the f0 curve; systems like TEI or Göteborg are model-independent but are more qualitative and their links with the signal are looser).
Scheme Prosodic Units Prosodic Phenomena Alignment Intonation Theory
intonational events described with starting f0, duration, amplitude, shape (numerical values) and classified as accents and boundaries, rises, falls, connections
> accents, boundaries
| signal, vowel onset
Taylor: sequence of intonational events (movements)
intonation: pitch movements described with direction, timing, rate of change, size (categorical values)
> accents, boundaries
IPO: f0 stylization with straight pitch movements; search for recurring f0 patterns (language dependent models)
intonation: transcription of the f0 curve by means of target points (classified according to pitch level)
Hirst: pitch levels, absolute tones and relative tones
local events+global trend
Hirst: pitch levels, global trends
global: register and range (numerical values) local: tonal labels for accent, focus, juncture
Bruce: pitch levels
Phrase Parenthesis Interruption
speech rate lexical stress sentence accent intonation:
downstep, intonation cont. (peaks, valleys)
> phrase, word, stressed syll
| 3 positions in accented vowel
Kohler: pitch movements
'tones', described with starting level and shape of the contour
> syllables, accents, tone units
| accented syllables
British School: nuclear tones: pitch movements on accented syllables
Clitic Word Intermediate p. Intonational p.
pitch and phrase accents, boundary tones, downstep
Pierrehumbert: pitch levels
Word Intermediate p. Full p.
Interruption (Syntactic-prosodic units)
phrase acc., secondary acc., emphasis (intonation: see ToBI)
global tones, local tones, nuclear tones duration: phoneme lengthening pauses
both levels and movements
Syllable Minor p. Major p.
primary, secondary duration: 3 classes intonation: local f0 variations, global (downstep, etc.)
> phonetic transcription
symbols both for levels and movements
Syllable Morpheme Word Tone group Intonational p. Rhythm group Phonological p.
(primary, second., .scandinav.) duration: phoneme lengthening intonation: contours pauses
> phonetic transcription
global contour, local accents
> phrases, accents
movements; for each phrase: global slope, local pitch accents, slope of the 'tail' (after last accent)
global, syllable lengthening, speech rate pauses loudness intonation: contours, pitch range, trend voice quality
movements (global trend, global range, local contours)
stress duration: lengthening pauses speech properties
Choosing one of the existing schemes as a possible standard for prosodic annotation, requires a clear picture of which phenomena we want to represent and which use we intend to make of the annotated corpora. The simple search for a de facto standard may not be the best strategy. A widespread system like ToBI, which is intended only for intonation transcription, is indeed open to criticism and can't be said to be an unquestionable standard. Its extension to languages other than English is not trivial and often requires to define in advance their intonation model, rather than deriving it from the annotated corpora. The separation between phonetic and phonological representations is not clear in ToBI, which may be considered as an "uneasy compromise" between the two [Nolan et al. 97].
In conclusion, the choice (or definition) of a standard for prosodic transcription of discourse should take into account the following points.
a) It can be inferred from the great variety of phenomena underlying the term 'prosody' that a first step towards the choice of a standard should be the selection of the prosodic phenomena to be covered by the scheme. Also, it has to be decided if it is more adequate to define a 'general purpose' notation scheme, which could be used to annotate the prosody of any kind of text, or to restrict the scope of the scheme to those phenomena which play a relevant role in discourse.
b) It can be concluded from the present review that prosodic analysis can be approached from several points of view, using different theoretical models and for very different purposes. One way of handling this variety without loosing 'standardity' is to allow different levels of transcription, which should include at least a phonetic representation of prosody (not limited to intonation, but including also other information, such as length), a phonological representation (in terms of pitch accents or boundary tones, for example, but including also other information, such as stress or the location of prosodic boundaries) and a functional representation (indicating the uses of prosodic phenomena to express different linguistic or pragmatic functions, and even cross-level references).
c) In order to ensure the usability of the annotated corpora both in language and speech applications, it seems also important to choose a scheme which allows the alignment of the notation symbols both with the speech signal and the orthographic (or phonetic) transcription of the annotated utterance.