pythonでプログラミング！- Scraping!

2018年11月15日 06:25

googleのcolabで実際に試していきます。

まずホームページを取得しましょう。

import requests
r = requests.get('https://news.yahoo.co.jp')

これでホームページの中身が見れるようになります。

print(r.headers)
print(r.encoding)

r.headersでホームページのheader情報

{'Date': 'Wed, 14 Nov 2018 12:34:51 GMT', 'P3P': 'policyref="http://privacy.yahoo.co.jp/w3c/p3p_jp.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIGIN', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '9342', 'Content-Type': 'text/html; charset=UTF-8', 'Age': '0', 'Server': 'ATS', 'Connection': 'keep-alive', 'Via': 'http/1.1 edge2502.img.umd.yahoo.co.jp (ApacheTrafficServer [c sSf ])', 'Set-Cookie': 'TLS=v=1.2&r=1; path=/; domain=.yahoo.co.jp; Secure'}

を取得。

r.encodeで

UTF-8

と取得できます。

次はスクレイピングをやってみましょう。ホームページの一部を抜き出します。

from bs4 import BeautifulSoup
import requests

r = requests.get("https://news.yahoo.co.jp/")
soup = BeautifulSoup(r.content, "html.parser")
print(soup.select("li"))

BeautifulSoupを使います。

from bs4 import BeautifulSoup

とすると、指定した領域、今回であれば、

r = requests.get("https://news.yahoo.co.jp/")で取得した、r.contentの部分を対象として変数soupに入れます。

あとはcssセレクタを指定してやるだけです。

print(soup.select("li"))

この場合は"li"リストを抽出します。

[<li class="myPage">
<a href="https://news.yahoo.co.jp/profile/login" onmousedown="this.href='https://news.yahoo.co.jp/profile/login'"><span class="myName">ユーザーページ</span></a>
</li>, <li class="purchase">
<a href="https://headlines.yahoo.co.jp/purchase/" onmousedown="this.href='https://headlines.yahoo.co.jp/purchase/'">購読一覧</a>　.....

な感じで書き出されます。


import requests
from bs4 import BeautifulSoup

r = requests.get("https://news.yahoo.co.jp/")
soup = BeautifulSoup(r.content, "html.parser")

for i in soup.select("p.ttl"):
   print(i.getText())

for i in soup.select("p.ttl"):
print(i.getText())

することで、文字だけ抽出できるようです。ループで回して、getText()でテキストのみ抽出しています。

拉致から41年講演1400回超に写真
自衛隊アフリカの拠点強化へ写真new
スルガ銀遠い再建と信頼回復写真
信長の屋敷跡か礎石見つかる写真
巨人丸に5年30億円超を用意写真new
稀勢の里4連敗引退危機再燃写真new
当選なのに入場不可東映謝罪写真

pythonでプログラミング！- Scraping!

いいなと思ったら応援しよう！