llama.cpp の llama-cli コマンドのヘルプまとめ

2024年9月30日 14:36

llama.cpp で使うことのできる llama-cli コマンドのヘルプの翻訳です。原文のあとに翻訳を記載します。llama-cli のインストールは Homebrew より。

brew install llama.cpp

> llama-cli --help
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--verbose-prompt                        print a verbose prompt before generation (default: false)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity, -2 = until
                                        context filled)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
-fa,   --flash-attn                     enable Flash Attention (default: disabled)
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with
                                        if -cnv is set, this will be used as system prompt
--no-perf                               disable internal libllama performance timings (default: false)
                                        (env: LLAMA_ARG_NO_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape                         process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
--no-escape                             do not process escape sequences
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: 1.0)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: 32.0)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-gan,  --grp-attn-n N                   group-attention factor (default: 1)
                                        (env: LLAMA_ARG_GRP_ATTN_N)
-gaw,  --grp-attn-w N                   group-attention width (default: 512.0)
                                        (env: LLAMA_ARG_GRP_ATTN_W)
-dkvc, --dump-kv-cache                  verbose print of the KV cache
-nkvo, --no-kv-offload                  disable KV offload
                                        (env: LLAMA_ARG_NO_KV_OFFLOAD)
-ctk,  --cache-type-k TYPE              KV cache data type for K (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--no-mmap                               do not memory-map model (slower load but may reduce pageouts if not
                                        using mlock)
                                        (env: LLAMA_ARG_NO_MMAP)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggerganov/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE            advanced option to override model metadata by key. may be specified
                                        multiple times.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false
--lora FNAME                            path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE               path to LoRA adapter with user defined scaling (can be repeated to use
                                        multiple adapters)
--control-vector FNAME                  add a control vector
                                        note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE     add a control vector with user defined scaling SCALE
                                        note: this argument can be repeated to add multiple scaled control
                                        vectors
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-hfr,  --hf-repo REPO                   Hugging Face model repository (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hff,  --hf-file FILE                   Hugging Face model file (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
-ld,   --logdir LOGDIR                  path under which to save YAML logs (no logging if unset)
--log-disable                           Log disable
--log-file FNAME                        Log to file
--log-colors                            Enable colored logging
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored.
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefx in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default: top_k;tfs_z;typ_p;top_p;min_p;temperature)
-s,    --seed SEED                      RNG seed (default: 4294967295, use random seed for 4294967295)
--sampling-seq SEQUENCE                 simplified sequence for samplers that will be used (default: kfypmt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--penalize-nl                           penalize newline tokens (default: false)
--temp N                                temperature (default: 0.8)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.1, 0.0 = disabled)
--tfs N                                 tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
--typical N                             locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dynatemp-range N                      dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.0)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if
                                        used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.0)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir) (default: '')
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead


----- example-specific params -----

--no-display-prompt                     don't print prompt at generation (default: false)
-co,   --color                          colorise output to distinguish prompt and user input from generations
                                        (default: false)
--no-context-shift                      disables context shift on inifinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_NO_CONTEXT_SHIFT)
-ptc,  --print-token-count N            print token count every N tokens (default: -1)
--prompt-cache FNAME                    file to cache prompt state for faster startup (default: none)
--prompt-cache-all                      if specified, saves user input and generations to cache as well
--prompt-cache-ro                       if specified, uses the prompt cache but does not update it
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation                   run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: false)
-i,    --interactive                    run in interactive mode (default: false)
-if,   --interactive-first              run in interactive mode and wait for input right away (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--in-prefix-bos                         prefix BOS to user inputs, preceding the `--in-prefix` string
--in-prefix STRING                      string to prefix user inputs with (default: empty)
--in-suffix STRING                      string to suffix after user inputs with (default: empty)
--no-warmup                             skip warming up the model with an empty run
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted:
                                        https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles

example usage:

  text generation:     llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128

  chat (conversation): llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

----- 共通パラメータ -----

-h, --help, --usage 使用方法を表示して終了
--version バージョンとビルド情報を表示
--verbose-prompt 生成前に詳細なプロンプトを表示 (デフォルト: false)
-t, --threads N 生成中に使用するスレッド数 (デフォルト: -1)
(環境変数: LLAMA_ARG_THREADS)
-tb, --threads-batch N バッチ処理とプロンプト処理中に使用するスレッド数 (デフォルト:
--threadsと同じ)
-C, --cpu-mask M CPU親和性マスク: 任意の長さの16進数。cpu-rangeを補完
(デフォルト: "")
-Cr, --cpu-range lo-hi 親和性のためのCPU範囲。--cpu-maskを補完
--cpu-strict <0|1> 厳密なCPU配置を使用 (デフォルト: 0)
--prio N プロセス/スレッドの優先度を設定 : 0-通常, 1-中, 2-高, 3-リアルタイム
(デフォルト: 0)
--poll <0...100> 作業を待つためのポーリングレベルを使用 (0 - ポーリングなし, デフォルト: 50)
-Cb, --cpu-mask-batch M CPU親和性マスク: 任意の長さの16進数。cpu-range-batchを補完
(デフォルト: --cpu-maskと同じ)
-Crb, --cpu-range-batch lo-hi 親和性のためのCPU範囲。--cpu-mask-batchを補完
--cpu-strict-batch <0|1> 厳密なCPU配置を使用 (デフォルト: --cpu-strictと同じ)
--prio-batch N プロセス/スレッドの優先度を設定 : 0-通常, 1-中, 2-高, 3-リアルタイム
(デフォルト: 0)
--poll-batch <0|1> 作業を待つためにポーリングを使用 (デフォルト: --pollと同じ)
-c, --ctx-size N プロンプトコンテキストのサイズ (デフォルト: 0, 0 = モデルから読み込み)
(環境変数: LLAMA_ARG_CTX_SIZE)
-n, --predict, --n-predict N 予測するトークン数 (デフォルト: -1, -1 = 無限, -2 = コンテキストが埋まるまで)
(環境変数: LLAMA_ARG_N_PREDICT)
-b, --batch-size N 論理的な最大バッチサイズ (デフォルト: 2048)
(環境変数: LLAMA_ARG_BATCH)
-ub, --ubatch-size N 物理的な最大バッチサイズ (デフォルト: 512)
(環境変数: LLAMA_ARG_UBATCH)
--keep N 初期プロンプトから保持するトークン数 (デフォルト: 0, -1 = すべて)
-fa, --flash-attn Flash Attentionを有効化 (デフォルト: 無効)
(環境変数: LLAMA_ARG_FLASH_ATTN)
-p, --prompt PROMPT 生成を開始するプロンプト
-cnvが設定されている場合、これはシステムプロンプトとして使用されます
--no-perf 内部のlibllama性能計測を無効化 (デフォルト: false)
(環境変数: LLAMA_ARG_NO_PERF)
-f, --file FNAME プロンプトを含むファイル (デフォルト: なし)
-bf, --binary-file FNAME プロンプトを含むバイナリファイル (デフォルト: なし)
-e, --escape エスケープシーケンス (\n, \r, \t, \', \", \\) を処理 (デフォルト: true)
--no-escape エスケープシーケンスを処理しない
--rope-scaling {none,linear,yarn} RoPE周波数スケーリング方法、モデルで指定されていない限りlinearがデフォルト
(環境変数: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPEコンテキストスケーリング係数、コンテキストをN倍に拡大
(環境変数: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPEベース周波数、NTK認識スケーリングで使用 (デフォルト: モデルから読み込み)
(環境変数: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE周波数スケーリング係数、コンテキストを1/N倍に拡大
(環境変数: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: モデルの元のコンテキストサイズ (デフォルト: 0 = モデル学習時のコンテキストサイズ)
(環境変数: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N YaRN: 外挿混合係数 (デフォルト: -1.0, 0.0 = 完全な補間)
(環境変数: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N YaRN: sqrt(t)または注意の大きさのスケール (デフォルト: 1.0)
(環境変数: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N YaRN: 高補正次元またはアルファ (デフォルト: 1.0)
(環境変数: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N YaRN: 低補正次元またはベータ (デフォルト: 32.0)
(環境変数: LLAMA_ARG_YARN_BETA_FAST)
-gan, --grp-attn-n N グループ注意因子 (デフォルト: 1)
(環境変数: LLAMA_ARG_GRP_ATTN_N)
-gaw, --grp-attn-w N グループ注意幅 (デフォルト: 512.0)
(環境変数: LLAMA_ARG_GRP_ATTN_W)
-dkvc, --dump-kv-cache KVキャッシュの詳細表示
-nkvo, --no-kv-offload KVオフロードを無効化
(環境変数: LLAMA_ARG_NO_KV_OFFLOAD)
-ctk, --cache-type-k TYPE KのKVキャッシュデータ型 (デフォルト: f16)
(環境変数: LLAMA_ARG_CACHE_TYPE_K)
-ctv, --cache-type-v TYPE VのKVキャッシュデータ型 (デフォルト: f16)
(環境変数: LLAMA_ARG_CACHE_TYPE_V)
-dt, --defrag-thold N KVキャッシュデフラグメンテーションの閾値 (デフォルト: -1.0, < 0 - 無効)
(環境変数: LLAMA_ARG_DEFRAG_THOLD)
-np, --parallel N 並列してデコードするシーケンス数 (デフォルト: 1)
(環境変数: LLAMA_ARG_N_PARALLEL)
--mlock システムにモデルをRAMに保持させ、スワップや圧縮を防ぐ
(環境変数: LLAMA_ARG_MLOCK)
--no-mmap モデルをメモリマップしない (読み込みは遅くなるが、mlockを使用しない場合にページアウトを減らせる可能性がある)
(環境変数: LLAMA_ARG_NO_MMAP)
--numa TYPE 一部のNUMAシステムで役立つ最適化を試みる
- distribute: 実行をすべてのノードに均等に分散
- isolate: 実行が開始されたノードのCPUにのみスレッドを生成
- numactl: numactlによって提供されるCPUマップを使用
以前これを使用せずに実行した場合、これを使用する前にシステムページキャッシュをクリアすることをお勧めします
https://github.com/ggerganov/llama.cpp/issues/1437 を参照
(環境変数: LLAMA_ARG_NUMA)
-ngl, --gpu-layers, --n-gpu-layers N VRAMに格納するレイヤー数
(環境変数: LLAMA_ARG_N_GPU_LAYERS)
-sm, --split-mode {none,layer,row} 複数のGPU間でモデルを分割する方法:
- none: 1つのGPUのみ使用
- layer (デフォルト): レイヤーとKVをGPU間で分割
- row: 行をGPU間で分割
(環境変数: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... 各GPUにオフロードするモデルの割合、カンマ区切りの比率リスト、例: 3,1
(環境変数: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX モデルに使用するGPU (split-mode = noneの場合)、または中間結果とKVに使用するGPU (split-mode = rowの場合) (デフォルト: 0)
(環境変数: LLAMA_ARG_MAIN_GPU)
--check-tensors モデルのテンソルデータに無効な値がないかチェック (デフォルト: false)
--override-kv KEY=TYPE:VALUE 高度なオプション: キーによってモデルのメタデータを上書き。複数回指定可能。
型: int, float, bool, str。例: --override-kv tokenizer.ggml.add_bos_token=bool:false
--lora FNAME LoRAアダプターへのパス (複数のアダプターを使用するために繰り返し指定可能)
--lora-scaled FNAME SCALE ユーザー定義のスケーリングを持つLoRAアダプターへのパス (複数のアダプターを使用するために繰り返し指定可能)
--control-vector FNAME コントロールベクトルを追加
注: このパラメータは複数のコントロールベクトルを追加するために繰り返し指定可能
--control-vector-scaled FNAME SCALE ユーザー定義のスケーリングSCALEを持つコントロールベクトルを追加
注: このパラメータは複数のスケーリングされたコントロールベクトルを追加するために繰り返し指定可能
--control-vector-layer-range START END
コントロールベクトルを適用するレイヤー範囲、開始と終了を含む
-m, --model FNAME モデルパス (デフォルト: `models/$filename` ただし$filenameは`--hf-file`または`--model-url`が設定されている場合はそこから、それ以外の場合はmodels/7B/ggml-model-f16.gguf)
(環境変数: LLAMA_ARG_MODEL)
-mu, --model-url MODEL_URL モデルのダウンロードURL (デフォルト: 未使用)
(環境変数: LLAMA_ARG_MODEL_URL)
-hfr, --hf-repo REPO Hugging Faceモデルリポジトリ (デフォルト: 未使用)
(環境変数: LLAMA_ARG_HF_REPO)
-hff, --hf-file FILE Hugging Faceモデルファイル (デフォルト: 未使用)
(環境変数: LLAMA_ARG_HF_FILE)
-hft, --hf-token TOKEN Hugging Faceアクセストークン (デフォルト: HF_TOKEN環境変数の値)
(環境変数: HF_TOKEN)
-ld, --logdir LOGDIR YAMLログを保存するパス (設定されていない場合はログなし)
--log-disable ログを無効化
--log-file FNAME ファイルにログを記録
--log-colors カラーログを有効化
(環境変数: LLAMA_LOG_COLORS)
-v, --verbose, --log-verbose 詳細レベルを無限に設定 (すべてのメッセージをログに記録、デバッグに有用)
-lv, --verbosity, --log-verbosity N 詳細度の閾値を設定。より高い詳細度のメッセージは無視されます。
(環境変数: LLAMA_LOG_VERBOSITY)
--log-prefix ログメッセージにプレフィック

Claude 翻訳

GGUF 変換は llama-cli のコマンドではできないっぽい。あくまで推論用。

llama.cpp の llama-cli コマンドのヘルプまとめ

いいなと思ったら応援しよう！