PYTHONで初めてのクローリング

2024年6月27日 11:30

とあるリストのWEBURLを取得したい。
検索するのは会社名
エクセルの"E列”に日本語で格納されている
URLを取得したら"I列”

さあ行ってみましょう

データを準備

今回はデスクトップにCrawl_companiesというフォルダを作り
company_selected.xlsx
のE列に会社名が入っている状態

VSCODEで仮想環境の構築とPythonのインストール

python3 -m venv venv

source venv/bin/activate

仮想環境を使用することで、依存関係の管理が容易になり、プロジェクト間でのライブラリの競合を防ぐことができます。また、開発環境をクリーンに保ち、再現性の高い環境を構築することができます。これにより、プロジェクトの開発とメンテナンスがスムーズに進むようになります。

openpyxlとbeautifulsuopをインストール

pip install openpyxl requests beautifulsoup4

openpyxlは、PythonでExcelファイル（.xlsx形式）を読み書きするためのライブラリ

BeautifulSoupは、PythonでHTMLやXML文書を解析するためのライブラリです。ウェブスクレイピング（クローリング）に広く使用され、HTMLドキュメントからデータを抽出するのに便利

Pythonコード

テストコード

３社分のURLを取得する

test_crawl_urls.py

import requests
from bs4 import BeautifulSoup
import re
import time
from collections import Counter

def get_google_search_top_url(searchwords, exclusion_list=[]):
    search_str = "+".join(searchwords)
    urlstr = "https://www.google.com/search?q=" + search_str
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    print(f"Requesting URL: {urlstr}")  # デバッグ情報
    response = requests.get(urlstr, headers=headers)
    print(f"Response status code: {response.status_code}")  # デバッグ情報

    if response.status_code != 200:
        return ""

    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    href_counter = Counter()
    
    for selectdiv in soup.select("a"):
        href = selectdiv.get("href")
        if href is None:
            continue
        if href.startswith("https://"):
            isexclusion = any(el in href for el in exclusion_list)
            if isexclusion:
                continue
            href_counter[href] += 1
    
    if not href_counter:
        return ""
    
    # 最も頻度の高いURLを返す
    top_url = href_counter.most_common(1)[0][0]
    return top_url

if __name__ == "__main__":
    # 会社名リスト（実際の会社名が入ります）
    company_names = ["A株式会社", "B株式会社", "株式会社C"]

    # 除外したいURLに含まれている文字列リスト
    exclusion_list = ["google", "wikipedia"]

    company_urls = {}
    for company in company_names:
        searchwords = [company, "公式サイト"]
        url = get_google_search_top_url(searchwords, exclusion_list)
        company_urls[company] = url
        print(f"{company}: {url}")
        # クローリングを控えめに行うために少し待つ
        time.sleep(2)

    # 結果を表示
    for company, url in company_urls.items():
        print(f"{company}: {url}")

ポイント１
デバッグコードを書いてターミナルで取得した結果を分析（目視）
今回はA社に対してhttps://:urlの最頻出を正規として採用する
何度か試したらそれが一番精度が高かったので

本番のPYTHONコード

crawl_company_urls.pyを作成

ポイント１
ポイントはtime モジュールの sleep 関数を使用して、プログラムの実行を2秒間一時停止するつまり２秒待ってから次の検索をする

time.sleep(2)

ポイント２
tqdmライブラリを使用してプログレスバーを簡単に実装
クローリングの進捗をターミナルで確認したい場合



pip install tqdm

こんな感じで進捗のが見える化


(venv) U@H ~/Desktop/crawl_companies $ python crawl_company_urls.py
Processing companies:   2%|█▉

ポイント３

進捗ごとにデータを保存

save_partial_data(excel_file, company_urls)

import openpyxl
import requests
from bs4 import BeautifulSoup
import re
import time
from collections import Counter
from tqdm import tqdm

def get_google_search_top_url(searchwords, exclusion_list=[]):
    search_str = "+".join(searchwords)
    urlstr = "https://www.google.com/search?q=" + search_str
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    print(f"Requesting URL: {urlstr}")  # デバッグ情報
    response = requests.get(urlstr, headers=headers)
    print(f"Response status code: {response.status_code}")  # デバッグ情報

    if response.status_code != 200:
        return ""

    html_text = response.text
    soup = BeautifulSoup(html_text, 'html.parser')
    
    href_counter = Counter()
    
    for selectdiv in soup.select("a"):
        href = selectdiv.get("href")
        if href is None:
            continue
        if href.startswith("https://"):
            isexclusion = any(el in href for el in exclusion_list)
            if isexclusion:
                continue
            href_counter[href] += 1
    
    if not href_counter:
        return ""
    
    # 最も頻度の高いURLを返す
    top_url = href_counter.most_common(1)[0][0]
    return top_url

def get_company_urls(input_excel_file, output_excel_file, exclusion_list=[]):
    # 入力Excelファイルを読み込む
    workbook = openpyxl.load_workbook(input_excel_file)
    sheet = workbook.active

    # 会社名が記載されているE列のデータを取得
    company_names = [cell.value.strip() for cell in sheet['E'] if cell.value]

    # 出力Excelファイルを準備
    output_workbook = openpyxl.Workbook()
    output_sheet = output_workbook.active
    output_sheet.append(["Company Name", "URL"])

    for company in tqdm(company_names, desc="Processing companies"):
        searchwords = [company, "公式サイト"]
        url = get_google_search_top_url(searchwords, exclusion_list)
        output_sheet.append([company, url])
        time.sleep(2)  # クローリングを控えめに行うために少し待つ

    output_workbook.save(output_excel_file)

if __name__ == "__main__":
    input_excel_file = "/Users/kawamotonaoki/Desktop/crawl_companies/company_selected.xlsx"
    output_excel_file = "/Users/kawamotonaoki/Desktop/crawl_companies/partial_results.xlsx"

    exclusion_list = ["google", "wikipedia"]

    get_company_urls(input_excel_file, output_excel_file, exclusion_list)

python crawl_company_urls.py

結果

できたー

今まではこういうことはエンジニアに頼んでいた。

chatGPTでは解決できなくて自分で再頻出のURLを採用するように指示をしたらできたのでそこまで試行錯誤は時間がかかった。
今回はブログを書いている時間もあるが

次からはエンジニアに頼む時間と同じくらいでできそうだ

また教訓としてまずは少しのスクレイピングコードで試してみてから
実際にやってみるという教訓を学んだ

そりゃそうだよな

PYTHONサイコー

WEBスクレイピングは禁止されているサイトもあるので慎重に行いましょう！

では、では

この記事が気に入ったらサポートをしてみませんか？