スクレイピング第4弾Amazonでスクレイピングできるようになった話。

2023年6月6日 00:05

こんばんは、週初めでまだ、スタミナが有り余っている。stです。
本日は、火曜サスペンスのテーマ曲を聴きながら、引き続き、amazonでスクレイピングできるようになった話を書きたいと思います。

１．ここにきて掴んだ感じです。

実はamazonのスクレイピング挑戦は、２回目になります。
一回目は、画像まで取れたのですがほかの情報を取れずに、一度
ギブアップしていました。
ただし、連日のebay,メルカリ,amazonとやってみると。案外サクッとできてしまいました。

2．ebay､メルカリ、amazonのスクレイピングが出来るようになった。

正直、全部、やり方は一緒です。かなりちょろいかな。
序に、画像の修正まで対応できるように、改造まで実施。

3．さて肝心のコードです

url = 'ここに’スクレイピングしたいurlを貼る
import urllib
import time
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import csv
import os
import openpyxl
import glob
from PIL import Image, ImageFilter
from urllib.parse import urljoin
from natsort import natsorted
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

base = 'url解析用'
soup1 = soup.find(class_ = 's-main-slot s-result-list s-search-results sg-row')
soup2 = soup1.find_all('div')
gazo = [["price","link","src"]]
for i in soup2:
毎回だけどここは割愛。知りたい人は教えてね。

path = r'C:\Users\xuesh\amazon.csv'
path2 = r'C:\Users\xuesh\amazon.xlsx'

4.CSVファイルを開く。ファイルがなければ新規作成する。

f = open(path, "w", encoding='utf-8')
writecsv = csv.writer(f, lineterminator='\n')
writecsv.writerows(gazo)#リスト内容書き込み
f.close() # CSVファイル保存

df = pd.read_csv(path)#csv読み込み
df = df.drop_duplicates()
df.to_excel(path2, encoding='utf-8',index=False)
print('エクセルに変換')

5．画像収集

df34 = pd.read_excel(path2)
df35 = df34['src']

path = r"C:\Users\xuesh\amazonimage"
os.makedirs(path, exist_ok=True)
num = 1

for src in df35:
time.sleep(2)
url = src
file_name = "index no" + '0' + str(num) + ".jpg"
response = requests.get(url)
image = response.content
num += 1
with open(os.path.join(path, file_name), "wb") as f:
f.write(image)

name = path2
workbook = openpyxl.load_workbook(name)
sheet = workbook.active
imagelist = glob.glob(r"C:\Users\xuesh\amazonimage\index no**.jpg")
imagelist = natsorted(imagelist)
num = 2
#収集画像をjpeg変換（webpがあるとエクセルでエラーになるため）
for pic in imagelist:
jpg = Image.open(str(pic)).convert('RGB')
jpg.save(str(pic), 'jpeg')

for pic in imagelist:
img = Image.open(pic)
img_resize = img.resize((256, 256))
img_resize.save(str(pic), 'jpeg')

for pic in imagelist:
img_to_excel = openpyxl.drawing.image.Image(pic)#画像を選択
sheet.add_image(img_to_excel, 'C'+ str(num) )#指定の位置に画像を添付
num += 1

#保存
workbook.save(name)

そんでもって、今回収集した画像は以下。

今後各HPをまとめて取れないか検討予定です。

良かったらこちらもどうぞ。

この記事が気に入ったらサポートをしてみませんか？