Describing Handwriting, Part VII: Chinese (Han) Script
[Please note that this post contains Chinese characters which may not display correctly on some browsers.]
Chinese, Tibetan and Devanagari are all scripts that I have wanted for some time to test on the DigiPal model for describing handwriting, and the recent meeting of Le Groupe de recherches transversales en paléographie at the École pratique des hautes études has inspired me to return to this topic. Indeed brief discussion at the meeting suggests that the model might work much better than I had expected for Tibetan, but this is something I hope to return to later. In the meantime, I have thought for some time now that it would apply very easily to writing in Chinese, not least because the very early history of the DigiPal project is directly connected to this writing-system. The DigiPal annotation tool was in large part inspired by the 'Chopper' sofware developed for the International Dunhuang Project for the analysis of documents in Chinese and Tibetan. Furthermore, my underlying model for handwriting was partly inspired by the Character Description Language (字形描述语言 (字描语)) which was developed by the Wenlin Institute for the description of written Chinese and which is now incorporated into Appendix F of Unicode Standard 6.1 (see also discussion of this in Stokes 2012). Their system includes components and sub-components which can ultimately be broken down into strokes, and they report that they have used this to describe nearly 100,000 characters to date in an XML database. However, unlike the DigiPal model, CDL is designed to describe glyphs, i.e. characters independent of the writer. The question therefore arises whether the DigiPal model can be used to describe scribal practices in Chinese. It has been a long time since I studied Chinese, and I have never engaged with the calligraphy or palaeography of the language, so I await the advice of experts in the matter to develop this fully. However, I have considered it at some length and discussed it in principle with one expert in the field, and it does seem that the model transfers very well indeed.
The Chinese writing-system is based on a set of elements which are typically combined to form characters. In general, characters can be divided into the radical, namely the element which usually helps to establish the meaning, and the rest of the character. For example, māmā (mother) is written 妈妈, where the first part (女) is the radical and gives the meaning of ‘female’ or ‘woman’ and the second part (马) gives the general pronunciation of ‘ma’. The first part is also a character in its own right (nǔ). In contrast, the question particle ma (吗) has the same second part, here indicating the same general pronunciation, though with a different tone. The radical this time is 口 (kǒu) which represents the mouth and here again hints at the meaning, namely relating to speech.
Both the radical and the ‘rest of the character’ can themselves usually be broken down into sets of smaller components which tend to recur, and so on until finally arriving at a small set of basic stroke-types (zhá: 札) which – with minor variations – are traditionally recognised as the basis of the writing system: horizontal, vertical, curving, dot, or turning. These basic types are further refined and combined to produce a list of thirty-nine fundamental strokes, and it is this that now forms Appendix F of the Unicode Standard and from which the Wenlin institute have produced their descriptions of 100,000 characters (Bishop and Cook 2004). Some examples are:
|冫 (lower stroke of)
|人 (right stroke of)
|Falling to right (nà)
|Horizontal + hook (héng-gōu)
How, then, does this map to the DigiPal model? In principle, very simply. DigiPal characters map straightforwardly to Chinese characters (where ‘character’ here can be defined by the researchers but the Unicode point would be a good starting-place). As the CDL specification notes, characters do show some variation in strokes, but this can be represented by DigiPal allographs. DigiPal Components map directly to (and were inspired by) CDL components, and here sub-components are clearly necessary. So far so good.
How to proceed from here depends somewhat on the research questions and what one wants to characterise with the tool. An obvious approach would be to characterise just the five or six main strokes as components, and to classify the remaining strokes by using features, and this is indeed possible. However, the DigiPal model is intended not just to describe allographs but also to characterise individual scribal variation. It therefore seems more useful to specify each of the thirty-nine stroke types as distinct components, and then to use features to indicate individual scribal practice within this. This may seem a slight variation from the method used in DigiPal, since stroke types in Chinese include factors such as ‘tapering’ or ‘hooked’, and these would be Features in DigiPal, at least as applied to medieval Latin script. However, unlike Latin, in Chinese script the tapering or hooked nature of the stroke is often an inherent and necessary part of the character: it is incorrect to write a héng-gōu stroke instead of héng, whereas the presense or absence of a foot on i is more of a stylistic choice. If we retain the general principle that components should be necessary and features are individual then it follows that each sub-type of stroke should be defined as a distinct component. Features could then include factors such as the thickness of the stroke (including the change thereof), decorative elements, degree of straightness or curvature, and indeed anything else that palaeographers may wish to consider.
This model seems conceptually correct and allows for a wide variety of queries in much the same way that we have for Latin script. What are its weaknesses, then? Certainly it is only applicable to ‘set’ script and would not work at all for calligraphic Chinese, but very a large amount of historical content is ‘set’ in this sense. A larger concern could be its practicality. As the CDL description mentions, the range of characters in Chinese is essentially open-ended, particularly when a wide historical and geographical range is considered. To use this model in the DigiPal system, then, one must define all of the characters and components occurring in the corpus – a list which presumably cannot be known in advance – and then select from this list of potentially tens of thousands of characters when annotating. This could be helped significantly by the CDL specification, since the Wenlin Institute’s list of 100,000 characters could be imported very straightforwardly into an instance of the DigiPal framework. One would then need a mechanism for finding the character in question, and certainly the current system of a drop-down list for every character is not feasible, but this challenge is not specific to DigiPal but applies in general to the entry of Chinese text into a computer, and many possible solutions have been proposed (including by the CDL itself). This approach of annotating entire characters was followed by researchers of the Dunhuang project and also seems more useful here than the alternative of annotating specific components or even strokes separately, so, for instance, drawing boxes around vertical strokes, or occurrences of the 口 (kǒu) component. Such an approach is perfectly possible and may be preferable for some research questions but is significantly less powerful. Indeed it seems clear that components here are strongly affected by their position: the 潎 (piē: 'falling to left, not very curved’ stroke) in 人 (rén, ‘person’) is different from the one in 你 (nǐ, ‘you’) for instance, even though the former character is indeed the radical of the latter.
Again, I await the thoughts of Chinese palaeographers on this proposal. Although undoubtedly needing revision and refinement, it does seem to me that it would apply very easily to a lot of historical material, such as much of that from Dunhuang. I would be very happy indeed to discuss this further with anyone working with Chinese script, regarding the usefulness and applicability of the model and also what features you might want to record. In the meantime I had better start reading up on Tibetan!
- Bishop, Tom, and Richard Cook (2003). A Specification for CDL: Character Description Language. Wenlin Institute. Available at http://www.wenlin.com/cdl/cdl_spec_2003_10_31.pdf
- Bishop, Tom, and Richard Cook (2004). Character Description Language (CDL): The Set of Basic CJK Unified Stroke Types. Wenlin Institute. Available at http://www.wenlin.com/cdl/cdl_strokes_2004_05_23.pdf
- 'Types of Script in Chinese Manuscripts from Central Asia', International Dunhuang Project: The Silk Road Online. Available at http://idp.bl.uk/education/paleography/chinese/script_types.html
- Stokes, P.A. (2012). 'Palaeography and the "Virtual Library". In B. Nelson and M. Terras (eds), Digitizing Medieval and Early Modern Material Culture. Tempe, AZ: Arizona Center for Medieval and Renaissance Studies. 137–69.
- Unicode (2012). 'Appendix F: Documentation of CJK Strokes'. In Julie D. Allen et al. (eds), The Unicode Standard, Version 6.1: Core Specification. Mountain View, CA: The Unicode Foundation. Available at http://www.unicode.org/versions/Unicode6.1.0/appF.pdf