2018年4月22日～Pythonスクレイピング -gkz

2018年4月22日 01:23

PythonでWebスクレイピングする時の知見をまとめておく

http://vaaaaaanquish.hatenablog.com/entry/2017/06/25/202924

・内容

requests
BeautifulSoup
Mechanize
PyQuery
Robobrowser
Selenium.webdriverとPhantomJS
Selenium
PhantomJS
SeleniumとPhontomJSを利用した実装
urllib.parse
chardet, cchardetによるエンコーディング検出
timeout_decoratorによるタイムアウト
retryingデコレータによるリトライ
joblib, multiprocess, asyncio, threadingで並列処理
multiprocessing
joblib
aiohttp, grequests, requests-futures
aiohttp
grequests
requests-futures
scrapy
SimpleHTTPServerとThreadを使ったunittest

以下、略

Pythonのlxmlモジュールをインストールする

・内容

lxmlの詳しい使い方はコチラの記事がとても勉強になります。

今回やりたいことを具体的にすると検索結果のHTMLからid=”web”となる<div>ブロック内の<a>タグから必要な情報を引っ張ってくることなんですが、こーゆー時にはxpathメソッドが使えるみたいです。

xpathの記述方法
https://webbibouroku.com/Blog/Article/xpath

クローラー
https://sutaba-mac.site/scrapy-s2-settings-and-items/

Pycharm + WSL

https://intellij-support.jetbrains.com/hc/en-us/community/posts/360000192424-Cant-connect-via-shh-to-choose-intepreter-WSL-?input_string=using-wsl-as-a-remote-interpreter

Django 名簿アプリ
https://qiita.com/okoppe8/items/54eb105c9c94c0960f14

この記事が気に入ったらサポートをしてみませんか？