マルチモーダルLLMでユーザビリティテストしてみた。

2024年12月21日 23:30

こんにちは、メディア研究開発センターの植木です。本記事ではマルチモーダルLLMであるGPT-4oを用いてユーザビリティテストがどの程度できるかを検証しました。

初見ユーザーのありがたみ

（突然ですが、）プロダクト開発の現場では、初見ユーザーのフィードバックは極めて重要です。では、その初見ユーザーはどこにいるのでしょうか？

例えば、通勤時にたまたますれ違った人が（昨日完成した新機能の）初見ユーザーであることはほぼ間違いないでしょう。声をかけてみれば案外協力してもらえるかもしれません。

しかしながら、もう一つ厄介な問題があります。それは、その初見ユーザーがその機能を初見である瞬間は、人生でたった一度きりだということです。

（あ、しまった...！）

とユーザビリティテストの最中に気づいても「初見ユーザー」はもうそこにはいません。

マルチモーダルLLMによる「初検査」

そこで登場するのがGPT-4oなどのマルチモーダルLLMというわけです。早速、初見ユーザーとしてユーザビリティテストを（何度も）お願いしましょう。今回は、ほとんどのWebサービス事業者が直面する最初の「鬼門」、新規アカウント登録にフォーカスして実験を進めたいと思います。サイトを初めて訪問したユーザーが、できる限り少ない操作回数で、アカウント登録できるか、をテストしていきましょう。

ユーザビリティ指標と実施タスク

今回、ユーザービリティ指標としては下記二つを採用しました。

タスク完了率　　　　　　　※大きいほどユーザービリティが高い
アクション数（操作回数）　※小さいほどユーザービリティが高い

実施するタスクは「アカウント登録」です。アカウントIDは「小文字かつ10文字以上」というシステム仕様があることとしました。下記にアカウント登録フローをお示しします。テストユーザーは最初、Top Pageにランディングします。

上記のアカウント登録フローでテストユーザーに期待する（理想的な）操作は、

画面右上のSign upボタンをクリック（→Accout ID入力画面に遷移）
Account IDの入力フィールドをクリック（→入力フォーカス状態に遷移）
Account IDを入力
　　「小文字かつ10文字以上」の場合、
　　　（→Sign upボタンアクティブ状態に遷移）
　　そうでない場合、
　　　（→エラーメッセージありの入力フォーカス状態に遷移→再入力）
画面下のSign upボタンをクリック（→アカウント登録完了画面に遷移）

であり最短４回のアクションでアカウント登録が完了します。

今回の実験では、マルチモーダルLLMで（有用な）ユーザビリティテストができるかどうかををシンプルに検証するため、Account IDに入力された文字が「小文字かつ10文字以上」でない場合に表示されるエラーメッセージとして、

A. 明確な入力条件を表示　※ユーザビリティが比較的良い
B. エラー内容があいまい　※ユーザビリティが悪い

という2種類を用意し、それぞれでユーザビリティテストを実施します。エラー内容があいまいなB.の方では、テストユーザーがエラー状態からなかなか復帰できず、あれこれ操作を繰り返す（はず）ですので、そのような一種のイライラ行動が指標にうまく表れてくれば実験成功です。

プロンプト

GPT-4oは、現在表示されている画面のスクリーンショットを受け取って、適切だと思う次のアクション（画面内クリック or 文字入力）を生成します（過去に行ったすべてのアクションもプロンプト内に逐次挿入されます）。

Instruction:
You are a test user of a web service. Using the provided screenshot of a smart device with a resolution of 400x800 pixels, please perform one of the following actions

1. "Click x, y": Simulate a mouse click at the specified coordinates `(x, y)`, where `x` and `y` are integers representing the horizontal and vertical positions, respectively. The top-left corner of the screen is `(0, 0)`, and the bottom-right corner is `(400, 800)`.
2. "Type <somecharacters>": Input text, where `<somecharacters>` represents the text you wish to type using a virtual keyboard.

After taking an action, a new screenshot showing the result of that action will be provided.

Important Notes:
1. Goal: Your task is to successfully "sign up". Plan each action carefully to achieve this goal. Your name is "matsuo"
2. Observe the Interface: Study the provided screenshot to identify visual cues. Consider the functionality, state, and coordinates of interface elements:
   - Button states (e.g., active or inactive).
   - Identify the current state of input fields. You MUST CHECK if an input field is focused or active by looking for a BLINKING cursor. If a field is not focused, click on it to activate it before typing. Masked input is the default behavior.
   - Any visible errors or feedback messages.
3. Response Format: Your response must include four parts:
   - Visual Element Coordinates: List up the visual elements and their precise coordinates (x, y) with their functionality and states.
   - Reasoning: Clearly explain your observations about the interface, including the states of visual elements, their functionality, and estimated coordinates. Include action history if relevant to justify the next step.
   - Action History: Examine the action history to determine its impact on the page or element states. This includes checking whether previous operations caused page transitions, altered element states, or had no effect. Use this review to assess the validity of the next step.
   - Next Action: Specify the next action in one of the following formats in order to easily extract Click x, y or Type <somecharacters> using python script with regular expression. DON'T use markdown bold or italic literal.:
     - Click x, y
     - Type <somecharacters>

Task:
Complete the sign-up process.

Action History:
Latest
  {action_history}
Oldest

アカウント登録シミュレーター

各画面において可能なアクションを予め設定したシミュレーターです。GPT-4oが生成したアクションを精査し、問題なければ画面遷移してそのスクリーンショットをGPT-4oに渡します。例えば、画面内のクリック可能なボタン付近をクリックすれば画面遷移し、入力フィールドフォーカス状態（画面）であれば文字入力操作することで文字入力後の状態（画面）に遷移します。その状態（画面）において不可能な操作をした場合には画面遷移せず、そのままの状態（画面）が再びGPT-4oに渡されます。

class SignUpSimulator:
    def __init__(self):
        self.state = "1"
        self.error_state_name = "4-e"
        self.action_history = []
        self.action_count = 0
        self.max_actions = 15

    def run(self, agent_function):
        while self.action_count < self.max_actions:
            screenshot_path = f"/home/ueki/mllm/screenshot/state{self.state}.png"
            action = agent_function(self.action_history, screenshot_path)
            self.action_count += 1

            # Add action to history
            self.action_history.append(action)

            # Check if the action is valid
            if self.is_valid_action(action):
                self.transition_state(action)
                if self.state == "5":
                    return {"result": "success", "actions_taken": self.action_count, "action_history": self.action_history}
            else:
                # Invalid action, stay in the current state
                continue

        return {"result": "fail", "actions_taken": self.action_count, "action_history": self.action_history}

    def is_valid_action(self, action):
        # Define valid actions for each state
        valid_actions = {
            "1": lambda a: self.is_click_in_range(a, 100, 0, 400, 300),
            "2": lambda a: self.is_click_in_range(a, 0, 100, 400, 600),
            "3": lambda a: a.startswith("Type "),
            "4": lambda a: self.is_click_in_range(a, 0, 450, 400, 800),
            self.error_state_name : lambda a: a.startswith("Type "),
        }
        return valid_actions.get(self.state, lambda x: False)(action)

    def is_click_in_range(self, action, x1, y1, x2, y2):
        if not action.startswith("Click "):
            return False
        click_match = re.match(r'Click\s+(\d+),\s*(\d+)', action)
        if click_match:
            x, y = click_match.groups()
            return x1 <= int(x) <= x2 and y1 <= int(y) <= y2
        return False

    def transition_state(self, action):
        # State transition logic
        if self.state == "1":
            self.state = "2"
        elif self.state == "2":
            self.state = "3"
        elif self.state == "3" and action.startswith("Type "):
            account_id = action[5:]
            if account_id.islower() and len(account_id) >= 10:
                self.state = "4"
            else:
                self.state = self.error_state_name
        elif self.state == self.error_state_name and action.startswith("Type "):
            account_id = action[5:]
            if account_id.islower() and len(account_id) >= 10:
                self.state = "4"
        elif self.state == "4":
            self.state = "5"

さて、結果です。

matsuoさん、yosaさん、kobayashiさんにアカウント登録をお願いしてみました（GPT-4oがその人になりきっていい感じのAccount IDを入力することを期待し、プロンプト内にYour Name is "matsuo"という形式で名前を記述）。アクション数は最大15まで（その時点で打ち切り）とし、それぞれ10回ずつユーザビリティテストしました。

概ね期待していた通りの結果で、ユーザビリティが良くないB.のタスク完了率が下がり、アクション数が多くなる傾向が見られます。

また、アクション数に着目することで、

「アカウント登録、なんか大変だった…」

という初見ユーザーが抱くであろう負の感情を、（CI/CDに組み込んで）自動検証できる可能性も本実験で示せたかと思います。

今回はあくまでかなり単純化したシミュレーションではありましたが、マルチモーダルLLMを用いれば簡易的なユーザビリティテストはすでに「できる」という実感を得ることができました（同時に、細かい部分でのチューニング（Click座標があまり安定しないなど）がまだまだ必要であることもわかりました…）。

これからもマルチモーダルLLMの様々な可能性を探っていきたいと思います。今後ともメディア研究開発センターをよろしくお願いします。

（メディア研究開発センター・植木快）