——马家骅 2019.4.30 凌晨于 State College
Classic Chinese poetry is a field that scholars in Chinese language and literature are interested and researching on. Poetry in Tang dynasty (7th – 10th centuries) is the high point and a golden age in Chinese history. Scholars in different eras study and appreciate poetry in Tang dynasty and think of it as the most prosperous period of poetry. Research about tonic patterns and imagery were done widely by scholars, but none have tried machine learning method to justify their hypothesis. Thus, we introduced machine learning method to predict the belonging period of poets based on the meter of their poems.
Introduction to Complete Tang Poems
Quan Tangshi (Complete Tang Poems) is the largest collection of Tang poetry, which contains around 49,000 lyric poems by more than twenty-two hundred poets in Tang Dynasty. It was commissioned in 1705 (Qing dynasty) at the direction and published under the name of the Kangxi Emperor. Although CTP is the major reservoir of surviving Tang poems, the compiling of CTP was done in haste. The editors did not justify the variant versions of poems. In this case, there were better versions of poems in some individually edited volumes. Most of the poems in CTP are listed in Tang dynasty catalog with one or two dynasty earlier or later (420 AD – 1127 AD). However, some poems were collected by mistake. The most well-known instance is 唐温如 Tang Wenru, who was a person in Yuan dynasty (1271 AD – 1368 AD) but was considered as a poet in Tang dynasty. The poems in CTP are arranged in different sections. Most of the sections are arranged by author with brief autobiography. Others are arranged by emperors or consorts, Yuefu (Music Bureau-style poems), women, monks, priests, spirits, ghosts, dreams, prophecy, proverbs, mystery, rumor, and drinking.
Background & Hypothesis
We aim to research on how is Tang poetry related to the development of Tang dynasty and how to combine machine learning with the creation of poetry work. Although many scholars analyzed vocabulary, imagery, and meter in Quan Tangshi from ancient times to the present, the analysis was mostly based on the accumulation of knowledge and experience. Recently there was an article analyzing Quan Tangshi with text mining. Although the article has not published on any academic journal, it is a good reference to find out how to improve the model and analyze in a more professional way. For this reason, RuiLong Gong, a student graduated from Peking University with double major (Chinese Language and Literature, Chemistry), was invited to provide constructive comments on the project.
Scholars debated the development of meter in Tang poetry and belonging period of poets who have no life records, none have tried machine learning method to prove their hypothesis. Most of the scholars used style and vocabulary to make guesses on belonging period of poets. Tonic pattern analysis could be a possible way to identify belonging period of poets whose birth and death records are unknown. We hypothesized that the tonic pattern of poetry in Tang dynasty had some distinct difference between different eras. We planned to use machine learning to justify our hypothesis.
Background Knowledge of Classic Chinese Poetry
Before introducing data, there is some background knowledge that needs to be introduced. In classical Chinese poetry, there are two types of poetry: Gushi (Ancient poetry) and jintishi (regulated verse/ modern-form poetry). Although called modern-form, it was modern for people after Tang dynasty (around 600 AD). Gushi does not have to follow any tonic rules and syllable format. It was developed from Han dynasty (200 BC – 200 AD). There are three different types: Four-syllables poem, five-syllables poem, and seven-syllables poem. Jintishi is the type of poetry that is regulated with tonic patterns and syllable restriction. There are two types of jintishi: Jueju (Four-line poem) and lvshi (eight-line poem). Each of the type has five-syllables and seven-syllables forms. Based on the number of characters in jintishi, there are four types: Five-syllables jueju (twenty characters), five-syllables lvshi (forty characters), seven-syllables jueju (twenty-eight characters), and seven-syllables lvshi (fifty-six characters).
To understand the tonic pattern, it is necessary to introduce the four tones and basic rules in classical Chinese poetry. The four tones, which was used in classic Chinese, are 平 (ping, level tone), 上 (shang, rising tone), 去 (qu, departing tone), and 入 (ru, entering tone). They can be categorized in two main categories: 平 (ping), which means level tone; and 仄 (ze), which means deflected, or not ping. With the alternative level and deflected tones, the sense of rhythm and beauty would be expressed. Parallelism is another important rule in jintishi. Two parallel lines must match each word in each line with the word which is in the same position in the other line (comparison or contrast). The odd verse is called bottom verse, and the even verse is called top verse. For example, giving a most common format of a five-syllables jueju: 平平平仄仄， 仄仄仄平平， 仄仄平平仄， 平平仄仄平. The parallelism shows that every even verse should have the opposite tones with the previous verse, and every odd verse should have the same tones with the previous verse at second, forth, and sixth character. Rhyme is mandatory and must be level tone. It occurs at the last character of every even verse. There are four fundamental verse types (eighteen in total, varies from special condition which is called aojiu, rescuing mistake): A: (平平)仄仄平平仄, B: (仄仄)平平仄仄平, C: (仄仄)平平平仄仄, D: (平平)仄仄仄平平. For seven-syllables, it is simply by adding 平平 or 仄仄 opposite from the first syllable of each five-syllables verse. With the combination following parallelism rule, there are four fundamental verse combination: ABCDABCD, BDABCDAB, CDABCDAB, and DBCDABCD. Overall, if the first verse of a jintishi is given, by understanding the rule, one can easily derive the remaining tonality of the poem.
At first we tried to create a Tang poem generator by using the CPT data we had. However, we found out the poems that created by generator did not follow the classic Tang poem rules and the poems did not have the imagery like the ones in CPT. Then we moved on to poem rating system, then we realized that imagery from the poem was hard to detect, and there were words might mean two different things. We did not have the database for those Tang words. From the data we had, we looked up on the strains and tried to find out what we could do with the data we had. Some scholars have claimed that for each time period the strain was different. Therefore, we agreed on exploring the strain in different eras in Tang dynasty.
Data & Data Processing
At the beginning of the semester, we found data from three different sources: guoxue, souyun, and github. Guoxue and souyun were two websites that had a huge collection of classic Chinese poetry and literature. The source on souyun was a version of photocopy. Therefore, we decided not to use it. The source on guoxue was an electronic version which was broken up on a list of urls. After searching on sources on github, we found three different versions of Quantangshi in three different formats: json, sql, and txt. We decided to use the data from a GitHub repository created by jackeygao(JG). He collected the data by web scraping. He stored the data in json file, and each json file has one thousand poems in it. There were 57 json files which in total we had around 57000 poems. The data contained four features: authors, paragraph, strains and title. There were around 2000 poets in the data, and the data did not mention their belonging time period of Tang dynasty.
We divided all poems into three categories: Five-syllables poems, seven-syllables poems, and others. We manually labelled these 2000 poets by looking up on the internet and from books. There were four major time period in Tang dynasty which were Early Tang (before 712 AD), Flourishing Period (712-755 AD), Mid Tang (756-824 AD), and Late Tang (after 825 AD). However some poets lived through more than one time period, and his/her poems could not be categorized into a certain era. We had to add three more time periods which were Early to Flourishing period, Flourishing to Mid period and Mid to Late period. There were poets which were unknown or we could not find any information about them, we labelled them as unknown. In the end we had eight different time periods and we used them to label each poet. There were imperial examinations in Tang dynasty and they were chances for people to work for state bureaucracy. One major part of the exam was to write poems. The majority of the poets in Tang dynasty were studying hard for the exams, and after they passed the exams they would also write poems about their work. We labelled the time period they worked for the state bureaucracy. Next, we labelled each poem based on the number of syllables in a verse, and the number of verses it had. We first labelled the five-syllables poems as wuyan and seven-syllables poem as qiyan. Then we labelled four-verses poem as jueju and eight-verses poem as lvshi. We dropped the rest of the poems which were not labelled since they were the poems which did not follow the rules of jintishi (regulated poem). The strain column was important to us because we wanted to use it to predict and analyze each period of Tang dynasty. The strain column contained tonic patterns. Within strain column there were 平，仄，question mark, and circle mark. 平 was the level tone; 仄 was other three tones; question mark represented the content was missing or the character that was not used anymore; the circle mark represented the character can be either 平 or 仄. We decided to drop the poems which have the question mark and circle mark in the strains and then we changed 平 to p and 仄 to z. We split strains into every verse, and stored the split four or eight new columns for each verse (jueju and lvshi). The odd number of verse was called bottom verse, and even number of verse was called top verse.
Five-syllables poems are the most fundamental poems, so we decided to use five-syllables lvshi data to train and test our model. Lvshi has eight verses of strains. After pre-processing the data, we had 9780 wuyan lvshi poems. Within this amount of poems we had 114 of them were unknown and the rest had time labeled between seven time periods in Tang. We focused on using each line of strain to predict time period. We dropped the column that had no strain or time. There were 114 unknown time period poems also dropped. After checking the data, we found out that the data was unbalanced since there were a large number of poems in late Tang. We decided to keep the unbalanced data set for training and also a new data set by using under sampling. Since the least number of poems (284) were from early flourishing Tang period. We randomly took 284 poems from each time period and formed a new data set. We used both data sets to create models. We used XGBoost and Adaboost to train our models since they were gradient decent boosting method. There were only eighteen unique combinations of pz based on the rule, and each line of strain was categorical data. The subset data was unbalanced, because most of the poems were from late Tang period. We transformed the categorical data using Onehotencoder. We also transformed our time column into integers using Labelencoder. For XGBoost we used object = ‘multi:softmax’ with N-estimator = 200 and for Adaboost we used N-estimator = 200. The XGBoost and Adaboost had an accuracy of 39\% and 38\% for unbalanced data set and 21\% and 20\% for balanced data set. We also used 10 fold cross validation method to test our model. It gave a similar accuracy with around 0.2\% lower than the regular prediction accuracy score. For the unbalanced data set models we found out that the model always predicted late Tang period, it was because the data had more late Tang poems than the rest. We ignored the models from the unbalanced data set. Since both models’ accuracy from balanced data set was pretty close, we wanted to see how those two models decided the belonging period of each poem. We looked at the prediction probability of each model and we found out the Adaboost had a uniform distribution of probability. The XGBoost prediction probability was slightly better than Adaboost, since it was not uniform distribution.
Then we looked at the XGBoost tree plot to see how XGBoost classifier made the prediction. We can see that the model first used f1 as the ancestor then down to f67 and f42. Which the first line of strain determined the first difference to predict each time period. Since the predicted values were also transformed by the Labelencoder, it was hard to see which result it ended up with but we were able to see how trees were divided and how XGBoost made the decision.
Our models were not very accurate, so we believed that using only strains to predict the poem’s period was not enough. Since our classification model have some overlapping between one and another such as Early to flourishing period include both early and flourishing period, it made the classifier hard to predict the time period of the poems. Even though the accuracy of the model looked pretty low, the probability from XGBoost showed that model believed that the poem was slightly closer to a certain range of time period with three relevant time periods score higher than the rest. Even though we cannot make good prediction on time period of the poems, we could narrow down the whole Tang dynasty into smaller time ranges. Those seven strains are strictly following the rules. Figure 3-9 show the strain frequency of early Tang, early-flourishing Tang, flourishing Tang, flourishing-mid Tang, mid Tang, mid-late Tang, and late Tang. The graph on the left shows the bottom verse and the graph on the right shows the top verse. There are seven normal verse types. As introduced in the background knowledge section, the four fundamental types are: A: 仄仄平平仄, B: 平平仄仄平, C: 平平平仄仄, D: 仄仄仄平平. Since the first character in A, C, D could be either 平 or 仄, there are three variation: 平仄平仄仄, 仄平平仄仄, and 平仄仄平平. Other than those seven forms of strains, other strains are all special conditions. Although the usage of the seven regular strains is the same among all different eras in Tang, analyzing the change of using irregular format of strains would help find the tonic pattern in Tang dynasty. From figure 3-9, it can be found that the most variant usage of tonality was in early Tang period. At the beginning of Tang dynasty, the rules started to form, but people had not yet totally obeyed them. As the development of Tang dynasty and the imperial examination system, the tonic pattern of poetry became more and more formatted. The frequency of the usage of strains also shows the development of jintishi. It flourished over time in Tang Dynasty and reached the peak in Late Tang period.
The result from our model was unsatisfied. There was relationship between meter and the Development of the Tang Dynasty in CTP but the relationship was not distinct which could be found by machine learning methods. Although the result was not as expected, we tried the method that was never used before. If we can combine the analysis of tonic pattern, vocabulary and imagery, it might result in more accurate prediction. The jintishi (regulated verse) was developed from early Tang dynasty and flourished over time in Tang dynasty. It reached the peak in late Tang period. The database of belonging period of poets that consumed almost twenty percent of our time was the work that none have done before. The multidisciplinary study helped us learn a lot from both data science field and Chinese poetry field. We look forward to studying on analysis on classical Chinese poetry in better ways in the future.
Work Done & Time Log
We worked as a team for the whole project, we all contributed to each part of the project. For the poem generator part we did a equally amount of work, we were trying to build a generator together. For the final idea of the project we did the data processing part together. Hai did most of the machine learning while Jiahua analyzed the parameter tuning on the models. Since Jiahua knew more about the background and knowledge of Tang poems, he guided Hai how to label and how to find the poems that followed the classic rules. Without each one of us contribution we were not able to finish the project. We also had to thank Ruilong Gong for providing constructive feedback and suggestions. When we found we were lost or we had to change the aim of our research, Ruilong would come up with professional ideas and guide us.
The code of this project can be found at: https://github.com/NaxxHua/qtstonicanalysis
[张�之99]刘德重 张�之 沈起炜.中国历代人名大辞典.上海古籍出版社,1999.
[Lia05]廖继莉(Jili Liao).唐诗声律研究(Tang poetry classic tones re-search). 2005.url:http://www.doc88.com/p-1148032515955.html.
[Kro11]Paul Kroll.”Poetry of the T’ang dynasty”, in Mair, Victor (ed.),The Columbia History of Chinese Literature. New York: ColumbiaUniversity Press, 2011.
[jac]jackeyGao(JG).chinese-poetry.url:https : / / github . com /chinese – poetry / chinese – poetry / tree / master / json. (ac-cessed: 03.19.2019).
[国学网]国学网.全唐诗.url:http://www.guoxue.com/qts/qts_sml.htm. (accessed: 02.18.2019).
[搜韵]搜韵.全唐诗.url:https://sou-yun.com/eBookIndex.aspx?id=2484. (accessed: 02.18.2019).
[苏格兰]苏格兰折耳喵.【数据挖掘实操】用文本挖掘剖析近5万首《全唐诗》([Data Mining Practical Operation]Using Text Mining toanalyze Full Tang Poetry).url:https://mp.weixin.qq.com/s/cJ20QSSKhST69CANsF_Ltw. (accessed: 03.15.2019).