Analyzing Classical Japanese Waka through Embedding Vectors
Yasuhiro Kondo ( Aoyama Gakuin University)
yhkondo@cl.aoyama.ac.jp
This note is the English translation of my note found in this link.
Analysis of the Style of Japanese Waka Poetry Collections
Each classical Japanese waka poetry collection possesses its unique character. For example, the "Manyoshu" celebrates nature and contains 'simple' poems, while the "Kokinshu" reflects the 'elegant' traditions of the imperial court. Though there are various ways to describe these characteristics, it's undeniable that each collection has its distinct style of poetry. This entry discusses the analysis of these styles using computers, specifically AI. It's an explanatory article based on my paper "Describing the Linguistic Valiations in Waka Collections - An Analysis Using Large Language Models" published in the 19th volume, 3rd issue of the "Studies in the Japalnese Language" (December 2023). The full paper will be available on JSTAGE in June, 2024.
What are Embedding Vectors?
In this entry, we convert each waka poem into numerical values called embedding vectors using large language models (LLMs) like ChatGPT. By comparing these numerical values, we examine the overall characteristics of the poems.
Let's start with a simple explanation of 'embedding vectors.' Consider a basic example of vectorizing co-occurrence distribution.
高い 長い 登る 流れる 頂上 橋
山 [ 1 0 1 0 1 0 ]
川 [ 0 1 0 1 0 1 ]
Suppose we represent the likelihood of the words 'mountain' and 'river' appearing with other words as 1 or 0. In this case, 'mountain' could be represented as a six-dimensional vector [1 0 1 0 1 0], and 'river' as [0 1 0 1 0 1]. Adding frequency of occurrence to this can increase the accuracy. These vectors merely indicate the distribution of surrounding words, but as you can see, words with similar vectors clearly have similar meanings. This becomes evident if you examine the vector for the word 'hill' in the same way (This concept originates from Z. Harris's 'Distributional Hypothesis.' Z. Harris was Chomsky's mentor).
Vector Creation with LLMs
Such vectors are created using deep learning tools like Word2Vec, a revolutionary invention by Dr. Mikolov and Dr. Sutskever from Google in 2013. This has evolved further, and now vectors can be created using recent deep learning architectures like transformers, and BERT. Currently, OpenAI has released a vector creation-specific LLM based on GPT3, the basis for ChatGPT, called text-embedding-ada-002, publicly available on OpenAI's cloud. By accessing this via API, one can easily obtain 1536-dimensional vectors of words or sentences. This LLM has been trained on a vast multilingual corpus owned by OpenAI. When you feed it words or sentences, it can discern similarities across languages, an incredibly powerful feature. It also performs quite well with classical Chinese and Japanese, so we will use it for our analysis (Though similar analyses can be done with local vector-creating LLMs, that's a topic for another time).
Vector Searching Waka with Chinese Poetry
Using this capability, for instance, you can directly search waka from Chinese poetry. This method, known as vector search, is commonly used in business to search manuals and FAQs, but it's also useful in literature and linguistics. Here's an example of searching waka in the "Kokinshu" that closely resemble the meaning of a Chinese poem by Fujiwara no Atsumoto from the "Wakan Roeishu." (This is done internally by vectorizing each and then measuring vector similarity using cosine similarity.) This demonstrates how competent the system is in understanding the meanings of classics, including Chinese literature. Next, we will look at the entire collection of "Kokinshu."
Principal Component Analysis of Vectors from "Kokinshu"
To understand the semantic structure of the entire "Kokinshu," we first need to vectorize it. However, 1500 dimensions are too many for human comprehension. Such continuous, multidimensional data can be reduced and the essential characteristics extracted through principal component analysis. This time, we compressed the vectors of each poem in "Kokinshu" into two dimensions and mapped them onto the X-axis (first dimension) and Y-axis (second dimension). Each dot represents the vector of each poem, with the maximum and minimum values of songs for each axis displayed. The most significant dimension is the first, followed by the second.
Image: Two-dimensional scatter plot of "Kokinshu"
Carefully examining each dot (especially the songs at the top and bottom of each axis) reveals the characteristics of each dimension.
For instance, in the first dimension (X-axis), the top-ranking songs depict human emotions, the so-called 'human affairs(人事)' while the bottom-ranking songs depict nature, 'scenic objects'(景物) (autumn is common here, but there are others.
In contrast, in the second dimension (Y-axis), the top-ranking songs involve sounds of birds or rivers, while the bottom-ranking songs are about flowers. In essence, this creates a 'birds' and 'flowers' opposition axis.
Summarizing this in a simple XY-axis diagram, we get:
bird
|
|
scenic -------+--------- human
|
|
flower
Surprisingly, it's demonstrable through LLM embedding vectors that 'Human Affairs' and 'Scenic Objects,' as well as 'Birds' and 'Flowers,' form the primary axes of semantic structure in "Kokinshu." It's evident that LLM can 'read' "Kokinshu" correctly. The initial sight of these results was startlingly impressive. Encouraged by this, we also analyzed "Manyoshu."
Principal Component Analysis of Vectors from "Manyoshu"
Like "Kokinshu," we vectorize "Manyoshu" and then examine the first and second principal components. We'll skip the scatter plot but show only the top and bottom-ranking songs.
In the first dimension, like "Manyoshu," there's a division between 'human affairs' and 'scenic objects' (though songs about 'birds' are prominent, there are others).
However, the second dimension is different. In "Manyoshu," the top-ranking songs are predominantly about 'mountains,' while the bottom-ranking ones are about 'seas.'
Summarizing this on the XY-axis, we get:
mountain
|
|
scenic ----+--------human
|
|
sea
That is, in "Manyoshu," the second dimension represents a 'mountain-sea' opposition. This contrasts with the 'bird-flower' opposition in "Kokinshu." This aligns well with one conventional view of waka history, but the clarity of this result underscores the remarkable interpretive power of LLMs.
What caused the shift from "Manyoshu" to "Kokinshu"? - Viewing through the Vector of Chinese Poetry
Having clearly demonstrated the stylistic shift from "Manyoshu" to "Kokinshu" through vector analysis, what could be the cause? Let's consider this through vector analysis as well. The period between "Manyoshu" and "Kokinshu" was the so-called 'dark age of national style,' where Chinese poetry and literature were highly revered. Hence, it makes sense to examine Chinese poetry. For this purpose, we'll use Chinese poetry from the "Wakan Roeishu," a collection favored by people in the Heian period (including a rich selection from the Bai's Anthology), and conduct vector analysis. Here, we only present the result after vectorizing and conducting principal component analysis of the Chinese poetry section from the "Wakan Roeishu."
visual
|
|
scenic-----+-------human
|
|
auditory
Without an understanding of ancient Chinese, it might be hard to grasp that it aligns this way. The first dimension, like Japanese poetry collections, opposes 'human affairs' and 'scenic objects.' The second dimension contrasts things visible to the eye, like 'grass,' 'trees,' 'humans,' with things audible to the ear, like 'oriole songs,' 'monkey cries,' 'all-around enemy songs,' culminating in the 'visual' vs. 'auditory' opposition shown above.
This is very fitting as an intermediary between "Manyoshu" and "Kokinshu." That is, if we look at the change in the Y-axis (second dimension),
(Manyoshu) (Chinese Poetry) (Kokinshu)
mountain visual ⇒ flower
|
|
scenic--------+----------- -human
|
|
sea auditory ⇒ bird
It's conceivable that Japanese poetry transitioned from the world of "Manyoshu," where poems were sung amidst natural elements like 'mountains' and 'seas,' through the Chinese poetry world's new aesthetic system of 'visual' and 'auditory' senses, to "Kokinshu," where 'flowers' and 'birds' emerged as new, representative materials of this evolved aesthetic sense (Though the Y-axis is reversed, it doesn't pose a statistical problem). Notably, in the Heian period, besides 'flowers' and 'birds,(花鳥)' the aesthetic pair of 'wind' and 'moon' (風月)also existed, with 'wind' being auditory and 'moon' being visual, aligning them with this sequence. Perhaps the semantic roots of the phrase 'flowers, birds, wind, moon' (花鳥風月)lie in this context?
Studying Embedding Vectors Contributes to Understanding AI Thought Processes
Thus, using LLM or AI vectors, we can analyze classical waka, but this isn't limited to just waka. Such analyses are also possible with dictionaries and novels (I presented about dictionaries at the Vocabulary and Dictionary Research Society. Analyses of novels are scheduled for publication in the "Quantitative Linguistics" journal of the Quantitative Linguistics Society. Additionally, various related books are planned for publication. Content expanding on this entry is also planned, so stay tuned. I'll keep you updated on X (Twitter) and other platforms).
At the same time, as we inevitably engage with AI, it's crucial for us to understand how AI perceives this world. There's a question about what humanities should do in the era of AI, and as shown today, researching AI's thought processes and concept understanding, and applying and adapting these to resolve humanities-related issues on the human side, could be one approach. Of course, the pace of AI development is astonishing, and it's uncertain how far human-centric research can go, but these are also things I contemplate as I continue my research.