【検証】Claude でコピペ時にファイル添付になる際の文字数

2024年11月6日 08:12

tl;dr

Claude である程度の長文をペーストするとファイル添付になる
`<source>paste.txt</source>` という形式で Context に与えられる
ファイル添付になる境界は 4,000 文字

既にご存知の方には何も新しいことはここに書いてないかと思うのですが、Claude ウェブ版では長文を貼り付けるとファイル添付扱いになります。ChatGPT に対して長文を入力するときはいつもバッククォート三つで括って入れているので、記事や論文をよく要約する私にとって Claude はストローク数が減ってとても嬉しいのです。

Claude が XML 記法を用いてプロンプトを構成していることは有名ですが、ファイル添付の場合も <source> タグ内に入る仕組みになっています。

どのくらいの文字数・トークン数であればファイル添付扱いになるか、気になってはいたものの 2k くらいかなと判断し、放置していました。そんな中、後述のちょうど良い長さの文章を見つけたため、検証してみました。

結論から言うと 4,000 文字が境界でした。4,000 文字未満であればファイル添付扱いにならず、そのまま入力されます。

初めはトークン数で分けているのかとあたりをつけていたのですが、トークン数の方が多いにも関わらず、ファイル添付になる例を発見したため、文字数もカウントしてみました。

| トークン数 | 文字数    | 入ったか |
| ---------- | --------- | -------- |
| 1237 token | 3998 文字 | ok       |
| 1217 token | 4562 文字 | ng       |

蛇足ではありますが、以前 Claude / Anthropic API のトークン数を数えるツールを作成したのでもしよろしければ是非お試しください！

https://t.co/tQohHSW3kx

Anthropic API Token Counter を公開しました！

プロンプト・Tool Use・画像入力・PDF 入力まで対応

自分用のため API Key を入力せずとも動きますが、あんまりたくさん使われる方はご自身の API Key を発行、入力してもらえると助かります🙏（ログは取っていません）
— ぬこぬこ (@schroneko) November 2, 2024

最後に。参考までに検証に用いたテキストを下記に記載します。Safari Reader から抽出、3,998 文字でした。偶然にも良い例を見つけました。

phyworld/phyworld
In-Distribution and Out-of-Distribution Data

DIrectory id_ood_data contains code for generating training and evaluation data to test scaling abilities for in-distribution (in-dist) and out-of-distribution (ood) scenarios. It supports generating videos for scenarios including uniform motion, collision, and parabolic motion.

To generate collision videos at different data size levels:

# Collision videos with increasing data sizes
python3 two_balls_collision.py --data_name in_dist_v2 --data_size_level 1 --num_workers 64
python3 two_balls_collision.py --data_name in_dist_v2 --data_size_level 2 --num_workers 64
python3 two_balls_collision.py --data_name in_dist_v2 --data_size_level 3 --num_workers 64
To generate videos of uniform motion at different data size levels (e.g., 30k, 300k, 3M videos):

python3 one_ball_uniform_motion.py --data_name in_dist_v2 --data_size_level 0 --num_workers 64
python3 one_ball_uniform_motion.py --data_name in_dist_v2 --data_size_level 1 --num_workers 64
python3 one_ball_uniform_motion.py --data_name in_dist_v2 --data_size_level 2 --num_workers 64
To generate parabolic motion videos:

python3 one_ball_parabola.py --data_name in_dist_v2 --data_size_level 0 --num_workers 64
python3 one_ball_parabola.py --data_name in_dist_v2 --data_size_level 1 --num_workers 64
python3 one_ball_parabola.py --data_name in_dist_v2 --data_size_level 2 --num_workers 64
Note: The num_workers parameter specifies the number of parallel threads used for data generation. Adjust this based on your available CPU resources.

Evaluation Data (In-Distribution and Out-of-Distribution)

To generate evaluation data for visualization across different scenarios:

# Collision videos for evaluation
python3 two_balls_collision.py --data_for_vis

# Uniform motion videos for evaluation
python3 one_ball_uniform_motion.py --data_for_vis

# Parabolic motion videos for evaluation
python3 one_ball_parabola.py --data_for_vis
We build combinatorial data generation on the Phyre codebase. Follow the installation instructions in the Phyre repository to set up the combinatorial_data directory.

Training Data Generation from 60 Templates

Run the following command to generate training data from 60 templates:

# Replace $ID with values 0, 1, 2, 3, 4 and 5, with each ID generating 10 templates, totally 60 templates
python3 data_generator_v2.py --num_workers 64 --run_id $ID --data_dir ./train
Template Subsets for Training

For scaling analysis, you can use a subset of the training data:

6 templates: 10003, 10005, 10016, 10023, 10024, 10053
30 templates: Use the regular expression 100[0-5][02468] to select templates.
Evaluation Data from Reserved Templates

To generate evaluation data from 10 reserved templates:

python3 data_generator_v2.py --num_workers 64 --run_id 6 --data_dir ./eval
Data generation code for in-depth analysis
Develop evaluation code to parse velocity and calculate error metrics from video data
Here's a formatted table for the README under the "Data" section:

Data Type	Train Data (30K/300K/3M)	Eval Data	Description
Uniform Motion	30K, 300K, 3M	Eval	Eval data includes both in-distribution and out-of-distribution data
Parabola	30K, 300K, 3M	Eval	Eval data includes both in-distribution and out-of-distribution data
Collision	To upload	-	-
Combinatorial Data	[In-template 6M](to upload)	Out-of-template	In-template-6M includes train data and in-template eval data. Out-template refers to eval data from reserved 10 templates.
The code has been reorganized, which may lead to errors or deviations from the original research results. If you encounter any issues, please report them by opening an issue. We will address any bugs promptly.

@article{kang2024how,
  title={How Far is Video Generation from World Model? -- A Physical Law Perspective},
  author={Kang, Bingyi and Yue, Yang and Lu, Rui and Lin, Zhijie and Zhao, Yang, and Wang, Kaixin and Gao, Huang and Feng Jiashi},
  journal={arXiv preprint arXiv:2406.16860},
  year={2024}
}

以上です。

追記: 複数貼り付けた時の名前はどうなるの？

地味に気になってたので嬉しい情報。複数貼れるけど2つ目とか名前変わるのかな？ https://t.co/0cKmlSx3fb
— いがなき (@iganaki1018) November 6, 2024

めちゃめちゃおもしろい引用リプをいただきました。試してみます。

答えは paste.txt、paste-2.txt…でした！いがなきさん、コメントいただきありがとうございました！

すべて勉強代に充てさせていただきます！アウトプットします！