
#D-TW06 形態素解析と係り受け (Tokenizing & Parsing Japanese)

(English Follows)

  1. 形態素解析とは

  2. 形態素解析を行う

  3. 形態素解析の後





I wrote the second half of data cleaning part in the last article. That was about how I delete expressions in each tweet which is not necessary. I felt that I wrote about not sexy part at that time but I realized that natural language processing is mostly not very sexy…  Yes this article is about tokenizing and this also has lots of back and forth type of manual work… Ok please raise your hand if you are the one who love this type of diligent job. I need your help. Lol

  1.  What is tokenizing

  2.  How do you tokenize Japanese

  3. After tokenizing

What is tokenizing
Tokenizing basically is to split sentence into smaller unit and add attribute information for each words (Too simple). Well single word in Japanese is not as obvious as English (See above chart). So the accuracy of "splitting words" are critical (and difficult too). Then you have to add "part" information also original form of the word. Simply say you may want to count "consider" and "considered" as the same word. To do so, you have to know what is thI ce original form of the word. Then you need "part" information to, lets say, to terminate words which you don't need like punctuation. 

How do you tokenize Japanese
There are few famous tools to tokenize Japanese and I chose GINZA for my analysis. The most famous one is, I think, MeCab. Also for example google and yahoo have their own cloud API service. Please chose whatever fits to your purpose. In terms of accuracy, especially for the recent terms, google has better quality than others. But the issue is I wanted to use if for huge twitter data base analysis and it should easily go over the free quota of google API. Since this is my side project, I wanted to make it as low cost as possible too! So I gave up using google API. Also I wanted to analyze other language after Japanese (not yet started…). GINZA is based on SpaCy which can be used for English too. So I chose GINZA. But I regretted a bit since there are very less reference for it…

After tokenizing
While tokenizing helps you to get single word, you may want to connect them again. For example, my first purpose it "don't want to work" tweets. So I wanted to differentiate "Good" weather and "Bad" weather as the ranking rather than having "good" and "bad" and "weather" separately. So I reconnect them to phrase level if it is needed. This "needed" was the most trouble some part. I had to write each case one by one…

Finally I reached to the starting point of the analysis! By the way, in the next article, I will talk a bit about "forecasting" part which I used for "Don't want to work Index" (Here
