
【ChatGPT】英語解説を日本語で読む【2023年4月25日|@Beyond Fireship】


The internet is packed with useful data, but unfortunately that data is often buried deep within a mountain of complex HTML.


The term data mining is the perfect metaphor because you literally have to dig through a bunch of useless dirty markup to extract the precious raw data you're looking for.


One of the most common ways to make money on the internet is with e-commerce and dropshipping, but it's highly competitive and you need to know what to sell and when to sell it.


Don't worry, I'm not about to scam you with my own dropshipping masterclass.


Instead, I'm going to teach you about web scraping with a headless browser called Puppeteer.


It allows you to extract data from virtually any public-facing website, to access precious data, even for websites like Amazon that don't offer an API.


What we'll do is find trending products on websites like Amazon and eBay, build up a data set, and then bring in AI tools like GPT-4 to analyze the data.


We'll write reviews, write advertisements, and automate virtually any other task you might need.


In addition, I'll teach you some tricks with ChatGPT to write your web scraping code way faster, which is historically very annoying code to write.


But first, there's a big problem.


Big e-commerce sites like Amazon don't love big traffic like bots and will block your IP address or make you solve captchas if they suspect you're not a human.


But, that's kind of racist to non-biological life.


Luckily, Brightdata, the sponsor of today's video, provides a special tool called the Scraping Browser.

幸いなことに、今日のビデオのスポンサーであるBrightdataは、Scraping Browserという特別なツールを提供しています。

It runs on a proxy network and provides a variety of built-in features like captcha solving, fingerprints, retries, and so on that allow you to scrape the web at an industrial scale.


That being said, if you're serious about extracting data from the web, you'll very likely need a tool that does automated IP address rotation and you can try Brightdata for free using this code.


After you sign up for an account, you'll notice a product called the Web Scraper IDE.

アカウントにサインアップすると、Web Scraper IDEと呼ばれる製品に気づくでしょう。

We're not going to use it in this video, however, if you're serious about web scraping, it provides a bunch of templates and additional tools that you'll likely want to take advantage of.


As a developer myself, I want full control over my workflow.


So, for that, I'm going to use an open-source tool from Google called Puppeteer.


It is a headless browser that allows you to view a website like an end user, and interact with it programmatically by executing JavaScript, clicking on buttons, and doing everything else a user can do.


That's pretty cool, but if you use it a lot on the same website, they'll eventually flag your IP and ban you from using it.


Then, your mom will be pissed that she can no longer order her groceries from Walmart.com.


That's where the scraping browser comes in.


It's a remote browser that uses the proxy network to avoid these problems.


To get started, I'm creating a brand new Node.js project with npm, and then installing Puppeteer.


Well, actually Puppeteer Core, which is the automation library without the browser itself, because again, we're connecting to a remote browser.

まあ、実際にはPuppeteer Coreをインストールするのだが、これはブラウザそのものを含まない自動化ライブラリであり、今回もリモートブラウザに接続するからだ。

Now go ahead and create an index.js file and import Puppeteer.


From there, we'll create an async function called run that declares a variable for the browser itself.


Inside this try catch block, we'll try to connect to the browser.

このtry catchブロックの中で、ブラウザへの接続を試行します。

If it throws an error, we'll make sure to console log that error.


And then finally, when all of our scraping is done, we'll want to automatically close the browser.


You don't want to leave the browser opened unintentionally.


Now inside of try, we're going to await a Puppeteer connection that uses a browser WebSocket endpoint.


At this point, we can go to the proxy section on the bright data dashboard and create a new scraping browser instance.


Once created, go to the access parameters and you'll notice a host username and password back in the code.


We can use these values to create a WebSocket URL.

これらの値を使用して、WebSocket URLを作成することができます。

You'll have your username and password separated by a colon, followed by the host URL.


Now that we're connected to this browser, we can use Puppeteer to do virtually anything a human can do programmatically.


Let's create a new page and then set the default navigation time out to two minutes.


From there, we can go to any URL on the Internet.


Then Puppeteer has a variety of API methods that can help you parse a Web page like the dollar sign, which feels like jQuery corresponds to document query selector in the browser.


It allows you to grab any element in the DOM, then extract text content from it.


Or as an alternative, you can use page evaluate, which takes a callback function that gives you access to the browser APIs directly.

また、別の方法として、ブラウザのAPIに直接アクセスできるコールバック関数を受け取るpage evaluateを使用することもできます。

Like here, we can grab the document element and get its outer HTML, just like you might do in the browser console.


Let's go ahead and console log the documents outer HTML.


And now we're ready to test our scraper out to make sure everything is working as expected.


Open up the terminal and run the node command on your file and you should get the HTML for that page back as a result.


Congratulations, you're now ready to do industrial scale web scraping.


Now I'm going to go ahead and update the code to go to the Amazon bestsellers page.


And my first goal is to get a manageable chunk of HTML.


What I'm doing is opening up the browser DevTools in Chrome to inspect the HTML directly until we highlight the list of products that we want to scrape.


Ideally, we'd like to get all these products and their prices as a JSON object.


You'll notice all the products are wrapped in a div that has a class of a carousel.


We can use that selector as our starting point.


Chrome DevTools also has a copy selector feature, which is pretty cool, but usually it's a bit of overkill.

Chrome DevToolsにはコピーセレクタの機能もあり、これはかなりクールですが、通常は少しやりすぎです。

Back in the code, we can make sure that the page will wait for that selector to appear.


Then we can use the dollar sign query selector to grab it from the DOM and finally evaluate it to get its inner HTML.


Now let's go ahead and console log that and run the script once again.


At this point, we have a more manageable chunk of HTML and I could analyze it myself.


But the faster way to get this job done is to use a tool like ChatGPT.


We can simply copy and paste this HTML into the chat and ask it to write puppeteer code that will grab the product title and price and return it as a JSON object.


Literally, on the first try, it writes some perfect evaluation code that grabs the elements with the proper query selectors and then formats the data we requested as a JSON object.


Let's copy and paste that code into the project and then run the node script once again.


Now we're in business.


We just built our own custom API for trending products on Amazon and we could apply the same technique to any other e-commerce store like eBay, Walmart, etc.


That's pretty cool.


And if we wanted to extract even more data, we could also grab the link for each one of these products, then use the scraping browser to navigate there and extract even more data.


We loop over each product and use the go-to method to navigate to that URL just like we did before.


However, when doing this, I would recommend also implementing a delay of at least two seconds or so between pages just so you're not sending an overwhelming amount of server requests.


Now that we have all this wonderful data, the possibilities are endless.


Like, for example, we could use GPT-4 to write advertisements that target different demographics for each one of these products.


Or, we might want to store millions of products in a vector database where they could be used to build a custom AI agent of some sort, like an AutoGPT tool that can take this data and then build you an Amazon dropshipping business plan.


The bottom line is that if you want to do cool stuff with AI, you're going to need data.


But in many cases, the only way to get the data you need is through web scraping, and now you know how to do it in a safe and effective way.


Thanks for watching, and I will see you in the next one.

