A Homebrew Reuse Analyzer 自家製再利用分析器

2021年6月2日 06:47

A Homebrew Reuse Analyzer
自家製再利用分析器

Larry Koller, CommScope
August 1, 2020
CommScope社、ラリー・コラー著
２０２０年８月１日
Computers are good at brute-force tasks. For example, they can compare thousands of paragraphs with each other, looking for matches or near-matches without getting tired or bored.
コンピュータは腕力仕事が得意です。例えば、それらは、疲れたり、あるいは退屈にならないでマッチするもの、あるいは近いものを探して、何千というパラグラフをお互いと比較することができます。
A content developer can use the results to:
コンテンツ開発者が下記をするためにその結果を使うことができます：
• find and correct inconsistencies
矛盾を見いだして、そして修正すること
• create collection files for reusable text.
再利用可能な本文のために収集ファイルを作ること
I built a simple reuse analyzer from existing open-source tools and code, using a script to loop through any number of topics (after stripping markup).
いかなる数のトピックも（マーク付けを外した後で）、ループするためにスクリプトを使って、私は現在のオープンソースツールやコードから単純な再利用分析器を構築しました。

Behind the scenes
現場の背後で
Reuse analysis uses a technique called fuzzy matching. In a traditional comparison, the result is always a Boolean — true or false. Fuzzy matching gives a floating-point result between zero and one, where 1 is a perfect match, 0 is no match at all, and 0.95 might be “close enough.”
再使用解析はファジーマッチングと言われる方法論を使います。伝統的な比較で、その結果は常にブール代数－真偽です。ファジーマッチングがゼロと１、１が完ぺきなマッチであるところの間に浮動小数点の結果を与えます、０はまったくマッチではありません、そして０.９５が「十分近い」かもしれません。
For example, the following two strings are not identical, but should be in a technical document:
例えば、次の２つの文字列は同一ではありません、しかし技術文書では同じであるべきです：
Click OK to close the dialog.
対話を閉じるために OK をクリックしてください。
Click OK to close the window.
表示画面を閉じるために OK をクリックしてください。
Comparing these strings returns a score of 0.93 — in other words, 93% identical.
これらの文字列を比較することは０.９３のスコアを返します－換言すれば、９３％同一です。
Fuzzy matching, at least in this implementation, uses an algorithm called the Levenshtein distance. This is the number of single-character changes (or edits) — additions, changes, or deletions — required to change one string to another. The algorithm looks complex but can be expressed in less than 30 lines of code. A WikiBooks page provides implementations in many different programming languages.
ファジーマッチングが、少なくともこの実装で、Levenshtein 距離と呼ばれる算法を使います。これは１つの文字列をもう１つに変えることを必要とされる単一キャラクター変更（あるいは編集）－付加、変化、あるいは削除－の数です。アルゴリズムは複雑に見えますが、しかし３０行未満のコードで表現可能です。WikiBooksのページは多くの異なったプログラム言語での実装を提供します。
Calculating the score is equally simple: if l1 and l2 are the lengths of the two strings, and d is their Levenshtein distance, the score is: (l1+l2–d)/(l1+l2).
スコアの計算はシンプルです：もしl1とl2は２つの文字列の長さで、ｄが彼らの Levenshtein 距離であるならそのスコアは次の通りです：（l1＋l2－ｄ）/（l1＋l2）。
There are other fuzzy matching techniques, but I used this one as a starting point.
他のファジーマッチングテクニックがありますが、私はこれを出発点として使いました。

Preparing the content for analysis
解析のためにコンテンツを準備します
Ideally, the content needs to be stripped of all markup. The text of one block element should all be on one line. My original thought was to write (or find) a DITA-OT plugin that would publish a bookmap to CSV, where each record would contain the file name and one block (or paragraph, if you prefer) of text.
理想的には、そのコンテンツはすべてのマーク付けを取り去る必要があります。１つのブロック要素の本文はすべて１行上にあるべきです。私の最初の考えは、CSV （あるいは段落、もしあなたがそちらをより望むなら）へのブックマップを発行するであろう DITA-ＯＴのプラグイン書く（あるいは見い出す）ことでした。そこでは、それぞれの記録がファイル名と１つのテキストブロック（あるいは段落、好むなら）を含むであろう。
This took more effort than the analysis script, believe it or not. After a brief experiment with a “plain text” plugin, I decided to try exporting to Markdown, a transform built-in to DITA-OT 3.1 and newer. From there, a utility called pandoc stripped the remaining markup and eliminated line-wrapping. The commands can be placed in a shell script:
これは解析スクリプトより多くの努力を要しました、信じようが信じまいが。「平文」プラグインでの短い実験の後に、私はコードを外して取り出すことやDITA -ＯＴの３.１そしてもっと新しいものへの組み込み変換の試みをすることを決めました。そこから、pandocと呼ばれるユーテリテイが残っているマーク付けをはずして、そして行のラッピングを排除しました。そのコマンドはシェルスクリプトに置くことができます：
dita --format=markdown_github --input=book.ditamap --args.rellinks=none
cd out
for i in *.md; do
f=`basename $i .md`
pandoc --wrap=none -t plain -o $f.txt $i
done
delete index.txt
A medium-sized book contains perhaps 2000 to 3000 block elements. Creating a book-of-books would be useful to look for reuse possibilities over multiple books.
中サイズの本は、多分２０００から３０００のブロック要素を含みます。「複数の本のある本を」作成することは複数の本に関して再利用可能性を探すために有用でしょう。

ここから先は

6,143字

¥ 100

ログイン

この記事が気に入ったらサポートをしてみませんか？