The universal character set in the Uralic Phonetic Alphabet
Choosing the right character
Combining letters and diacritics
Normalization
Encoding and typography
Typographical characteristics of fonts
Typographical characteristics of programs
Small capitals, superscripts and subscripts
Italics
Combining letters and diacritics
Normalization
Encoding and typography
Typographical characteristics of fonts
Typographical characteristics of programs
Small capitals, superscripts and subscripts
Italics
The universal character set includes all the characters necessary in both the diachronic and the synchronic research into Finnish and its cognate languages. The character set also covers the characters of the current and historical languages of the surrounding language groups.
In linguistics, the introduction of the universal characters enables the application of modern electronic research methods even to data recorded using the phonetic alphabet. We can benefit most of this by using the universal characters in a uniform manner in our field of research. That way, the methods of recording do not differ from each other and the cooperation and electronic contacts are as smooth as possible.
The publications standardizing the universal character set (Unicode and standard ISO/IEC 10646) sometimes present glyphs that are misleading; in some cases different characters have been assigned the same glyphs. For this reason it is essential that the characters should be identified by their names and code points, instead of merely identifying them by the glyph.
A special keyboard has been designed for typing the characters in the Uralic Phonetic Alphabet. This ”UPA keyboard” (from Uralic Phonetic Alphabet, cf. IPA) is based on the Finnish keyboard and is freely downloadable on this site.
The Uralic Phonetic Alphabet is based on the Latin alphabet, supplemented with characters loaned from the Greek and Cyrillic alphabet. Further, it contains numerous diacritics connected to the basic characters. The universal character set contains an IPA section (0250–02AF) and a phonetic extension (1D00–1DFF) with characters that belong primarily to the Uralic Phonetic Alphabet.
Back to headlines
Choosing the right character
The development of the universal character set was started as early as 1986. Gradually, more and more characters were accepted and sometimes it happened that one of the new characters was almost the same or even completely the same as another, already existing, character. However, it has been necessary to accept such overlapping characters for different reasons, since they have different backgrounds. One example is the turned e:01DD ə LATIN SMALL LETTER TURNED E
04D9 ə CYRILLIC SMALL LETTER SCHWA
These two characters belong to two different alphabets, which is why they should be kept separate. The position of the characters is also different in the Latin alphabet than in the Cyrillic alphabet: the latter character has a capital equivalent (Ә), whereas the former character does not have one. However, there is yet a third character which looks exactly the same:
0259 ə LATIN SMALL LETTER SCHWA
The character has been adopted and encoded separately since it is one of the IPA characters: thus, its positions and usage differ from those of the above-mentioned characters. There are even other such cases that seem like duplicates in the universal character set. In the name of uniform and compatible recording, we need to make sure the characters are not used in a confusing manner. In such cases it is the character’s name that counts.
Thus, in cases where a given character exists already (encoded in the universal character set) both as part of the Latin alphabet and another alphabet, the first option is recommended. If there remain alternatives even within the Latin alphabet, it is recommended that a character belonging to the IPA should be used. Therefore, the Uralic Phonetic Alphabet uses the following character to mark a reduced vowel:
0259 ə LATIN SMALL LETTER SCHWA
The UPA keyboard guides its users to choose the right characters. Of the above letters, it allows its users to type just the character 0259 ə, which belongs to the IPA.
Back to headlines
Combining letters and diacritics
From the point of view of the phonetic alphabet, there are two types of characters in the universal character set: letters and diacritics. The phonetic alphabet makes use of letters, as well as diacritics that may be added to the letters. However, for historical reasons there are some already existing combinations of letters and diacritics in the universal character set. These combinations can thus be typed in two ways in principle, either by entering the combination directly or by entering the parts of the combination one by one.The existing combinations of letters and diacritics in the universal character set can be explained primarily by a desire to ensure compatibility with the previous, less extensive, character sets. Such less extensive sets include Latin-1, which used to be in very wide use in the Western Europe. Just as these smaller character sets were established mainly for characters belonging to standard languages, the existing combinations in the universal character set consist of characters from standard languages. In fact, the most appropriate way to list the restricted set of characters in a given standard language is to list all the letters; for example, the characters of the Finnish standard language are: a…š…z, ž, å, ä, ö.
The existing combinations of letters and diacritics can be and, in fact, should be used for standard languages. In contrast, the combinations should not be used for transcription, since transcription differs from literary languages in this respect. The basis of this system is the combination of letters and diacritics for the phonemes that they describe according to the degree of precision the context requires. There are no above-mentioned existing combinations of letters and diacritics for standard languages. This is why these instructions guide to a typing characteristic of the phonetic alphabet, where the letters and possible diacritics are typed separately. However, there is one exception to this rule, which concerns the typing of certain (front) vowels. The difference between the back and front vowels is so significant in the Uralic languages that the letters ä, ö, ü are basic letters without diacritics (this was also E. N. Setälä’s proposal for the Uralic phonetic alphabet in 1902, FUF 1: 36). Even the Swedish letter å is considered a base character without diacritics here.
In order to ensure the compatibility and desired result, the following rules should be followed absolutely:
- The letter, the base character, is typed first, followed by the related diacritics.
- The diacritics below the letter are typed first, followed by the diacritics above the letter.
- The diacritics closest to the letter are typed first, followed by the diacritics further away from the letter.
A strictly regulated order is the prerequisite for the achievement of letter-diacritic combinations that are typographically satisfactory. The encoding is not a scientific interpretation of the quality of the phoneme which is described, but rather a means of enabling the electronic processing of written data.
The universal character set contains all the necessary characters. You can combine the base characters and diacritics freely. However, the combination involves one restriction which should be borne in mind. You can only combine diacritics with the base characters by piling up the diacritics in vertical direction; you cannot place them next to each other horizontally.
Back to headlines
Normalization
The previous section explained why the principles applied to the writing of a standard language are different from those applied to transcription. With a given standard language, existing combinations are used, e.g. the letters á or ù. In transcription, these letters must be written as two characters, e.g., a and accent ´or u and gravis `.To solve this problem, a technological solution titled normalization has been developed for the universal character set. Several different programs and data systems are gradually adopting this solution. In an environment supporting normalization, it does not make a difference how a given character is typed – as one single character á or as two characters a + ´. Normalization identifies these feeds as one and the same character.
Normalization is good news and bad news for the Uralic Phonetic Alphabet. The good news is that the author does not have to wonder how he /she should write a letter with diacritics. The bad news is that the author cannot generally influence the form in which his/her text is recorded, i.e. whether á is recorded as one character á or as two characters a + ´. Usually this does not matter, but in the drafting of language corpora and in the processing of electronic language data, normalization can prove problematic.
Back to headlines
Encoding and typography
Whilst using the universal character set, it is very important to remember that a character which is defined in an unambiguous way in terms of its code point can look very different in different fonts. This is precisely why choosing the right character must not be based just on the glyphs displayed in the tables of the universal character set or the characters of a given (special) font. This must be remembered especially when diacritics are combined with base characters.It is good for the encoding that the same things are always expressed in the same way. Thus, we have decided to mark, e.g., palatalization with the same symbol in connection with all the characters in these instructions (0301 COMBINING ACUTE ACCENT), although this symbol is not placed in the usual way with the high base characters (mainly b, d, f, h, k, l, t, β, δ), next to the upper part of the letter (e.g. ĺ instead of ĺ.). The placing of the palatalization symbol and its correct angle are, in fact, a matter that should be solved separately with each font and font style. An ordinary writer is not capable of making such typographical decisions.
The uniform use of the universal character set in transcription opens up completely new research prospects. At the same time, it enables a carefully planned typographical design of publications (by people who are well acquainted with the subject matter).
Back to headlines
Typographical characteristics of fonts
In transcription, it is essential that the diacritics added to the base characters remain identifiable by their appearance and do not start to resemble other symbols. However, such solutions, which are not suited to the Uralic Phonetic Alphabet, are part of the typographical characteristics of certain fonts. In this respect, the letters d, l and t are especially problematic, when a caron is added to them.The encoded combinations with a caron differ from the expected appearance in the standard, i.e. the universal character set. The fact is that, in these cases, the caron resembles a dot or even a palatalization symbol in most fonts. This typographical solution is in line with the typographical practices of Czech and Slovak. Only in very few fonts does the caron remain a caron when it is combined with the above letters. This special typographical solution is a strong argument for not using the ready encoded letter-diacritic combinations in transcription. At the same time it shows that we should not trust the glyphs in the standards, but the characters should always be checked by looking at their names.
For example, in the fonts supporting the Czech and Slovak practices, the characters
010F ď LATIN SMALL LETTER D WITH CARON
013E ľ LATIN SMALL LETTER L WITH CARON
0165 ť LATIN SMALL LETTER T WITH CARON
generate such letter-diacritic combinations that (in spite of their names) resemble characters indicating palatalized consonants in Finno-Ugric terms. The fact that the typographical result neatly resembles Uralic palatalization (ľ ľ ť ť) in certain fonts does not change the fact that these letters are encoded with a caron (which does not indicate palatalization but signifies a strong friction noise). Nor does the fact that the correctly encoded palatalized consonants (e.g. l + 0301 COMBINING ACUTE ACCENT: ĺ) do not look good in the Uralic Phonetic Alphabet change the system.
The correct location of the palatalization dot, just as the outward appearance of diacritics in general, are typographical issues and we should not let them affect our choice of characters (i.e. how the data are encoded). It is of primary importance that we
- type (= encode) letters and diacritics in a logical and consistent way and
- use a font whose typographical characteristics fulfill the requirements of the Uralic Phonetic Alphabet.
Back to headlines
Typographical characteristics of programs
The typographical solutions of different programs or their lack also affect how correctly written (=encoded) diacritics are printed on the screen or paper. It is unfortunately not enough that the system and the application software are able to process the universal character set (=”that they support Unicode”). Most application software cannot pile up the diacritics correctly (e.g. Word).Page layout software and certain other programs (e.g. Mac OS X Mellel) also manage to process diacritics in a typologically correct way.
Back to headlines
Small capitals, superscripts and subscripts
Traditionally, small capitals and small letters have been produced by reducing the size of the font. Respectively, the superscripts and subscripts, i.e. the characters below and above the baseline ascend and descend in relation to the baseline. These are typographical means, but from the perspective of transcription, they carry meaning. Small capitals represent different phonemes than lowercase letters. This is why small capitals and scripts have their own code points in the universal character set. Using them, ʙ (0299) and b (0062) can be distinguished from each other by their mere encoding.At the turn of the millennium, the universal character set was supplemented with a large number of Uralic diacritics that had been lacking from it. These characters also included many ascending small capitals. From the point of view of the Uralic Phonetic Alphabet, these ascending letters are not “uppercase” but small capitals. They should not be used and, in fact, their selection is insufficient for the needs of the Uralic Phonetic Alphabet. You cannot type these ascending small capitals using the UPA keyboard.
Back to headlines
Italics
Data transcribed according to the Uralic Phonetic Alphabet has been italicized traditionally in the same way as other linguistic data. Italicization is a typographical means and it does not affect the encoding of the characters. For the purposes of electronic publishing and the use of, e.g., electronic databases, it may be justified not to italicize to enhance readability. In contrast, the traditional italicization is worth preserving in printed texts.Back to headlines





