ðžïž Crawl4AIã§Webã¯ããŒãªã³ã°ãçéèªååãããïŒð
ããã«ã¡ã¯ãçããïŒâšä»åã¯ãWebã¯ããŒãªã³ã°ã楜ã ããªãããã®ææ°ããŒã«ãCrawl4AIããã玹ä»ããŸããWebã®äžçã§ã®ããŒã¿åéãåæããã£ãšç°¡åã«ããããå¹çããã§ããããããªã£ãŠæã£ãããšããããŸãããïŒãããªããªãã«ãŽã£ããã®ããŒã«ãããã®Crawl4AIãªãã§ãïŒã§ã¯ãæ©éãã®é åãèŠãŠãããŸãããïŒ
ð¢ Crawl4AIã£ãŠäœïŒ
Crawl4AIã¯ãWebäžããæ å ±ãåéããããŒã¿ãå¹ççã«æœåºããããã®ãªãŒãã³ãœãŒã¹ããŒã«ã§ãããããŸã§æéãããã£ãŠããäœæ¥ããèªååããŠãããã®ã§ãããªãã®æéãå€§å¹ ã«ç¯çŽã§ãã¡ãããŸãâ³ãããã«ãã€ã³ããªãžã§ã³ããªAIãšãŒãžã§ã³ããæ§ç¯ããŠãæ å ±ãåéã»åæããã®ãã°ããšç°¡åã«ïŒéçºè ã«ãšã£ãŠã¯ããŸãã«å¿ æºã®ã¢ã€ãã ãªãã§ããð§ã
ð Crawl4AIã®äž»ãªç¹åŸŽ
Crawl4AIã«ã¯ã以äžã®ãããªäŸ¿å©ãªæ©èœããã£ã·ãè©°ãŸã£ãŠããŸãã
1. ãªãŒãã³ãœãŒã¹ã§ç¡æð
ãŸãã¯ããïŒCrawl4AIã¯ç¡æã§å©çšã§ãããã§ããã財åžã«ãåªããã§ãããð°ãã ããã誰ã§ãæ°è»œã«å§ãããããã§ãã
2. AIãã¯ãŒãð€
AIã®åãåããŠãWebããŒãžäžã®èŠçŽ ãèªåã§èªèã»è§£æããŠãããŸããããã§ãæéãšåŽåãç¯çŽã§ããããšééããªãïŒæéã足ããªããŠå°ã£ãŠãã人ã«ã¯ç¹ã«ããããã§ãâ°ã
3. æ§é åãããããŒã¿åºåð
ããŒã¿ããã éããã ããããªããŠãJSONãMarkdown圢åŒã«ãã¡ããšæŽçããŠãããããããã®åŸã®åæããã¡ããã¡ãç°¡åã«ãªããŸããããŒã¿ã®èŠãç®ãã¹ãããªããŠãäžç®ã§ãããã®ãå¬ãããã€ã³ãã§ãðã
4. å€æ©èœå¯Ÿå¿ðª
ã¹ã¯ããŒã«æ©èœãè€æ°ã®URLã¯ããŒã«ãã¡ãã£ã¢ã¿ã°ã®æœåºãã¡ã¿ããŒã¿ã®æœåºãã¹ã¯ãªãŒã³ã·ã§ããã®ãã£ããã£ãªã©ãCrawl4AIã«ã¯æ¬åœã«å€åœ©ãªæ©èœãè©°ãŸã£ãŠããŸããäžã€ã®ããŒã«ã§ããã ãã®ããšãã§ããã®ã¯ãããããã§ãããðã
ð Crawl4AIã®å§ãæ¹
䜿ãæ¹ããšã£ãŠãç°¡åïŒä»¥äžã®ã¹ããããèžãã°ãããã«ã§ãWebã¯ããŒãªã³ã°ãå§ããããŸããã
1. ã€ã³ã¹ããŒã«ãšã»ããã¢ããð§
ãŸãã¯Crawl4AIãã€ã³ã¹ããŒã«ããŸããããPythonç°å¢ãæŽã£ãŠããã°ã以äžã®ã³ãã³ããå®è¡ããã ãã§OKã§ãã
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk
ããã§ãCrawl4AIã®åºæ¬çãªã»ããã¢ããã¯å®äºã§ãïŒæ¬¡ã«ãPythonã¹ã¯ãªãããäœæããŠã¯ããŒã©ãŒãèµ·åããŸãããã
2. ããŒã¿æœåºð
Crawl4AIã䜿ã£ãŠããŒã¿ãæœåºããããã®åºæ¬çãªã¹ã¯ãªãããèŠãŠã¿ãŸããããäŸãã°ãOpenAIã®APIäŸ¡æ Œæ å ±ãååŸããå Žåã以äžã®ã³ãŒãã䜿ããŸãã
from crawl4ai import WebCrawler
# ã¯ããŒã©ãŒã®ã€ã³ã¹ã¿ã³ã¹ãäœæ
crawler = WebCrawler()
# ã¯ããŒã©ãŒã®ãŠã©ãŒã ã¢ããïŒå¿
èŠãªã¢ãã«ãããŒãïŒ
crawler.warmup()
# ã¯ããŒã©ãŒãURLã§å®è¡
result = crawler.run(url="https://openai.com/api/pricing/")
# æœåºãããå
容ãMarkdown圢åŒã§è¡šç€º
print(result.markdown)
ãã®ã¹ã¯ãªãããå®è¡ãããšãæå®ããURLããå¿ èŠãªããŒã¿ãååŸã§ããŸããçµæã¯Markdown圢åŒã§åºåãããã®ã§ãæŽçãããç¶æ ã§ããŒã¿ã確èªããããšãã§ããŸãðã
3. AIãšãŒãžã§ã³ããšã®é£æºð€
Crawl4AIãä»ã®AIããŒã«ãšé£æºãããããšã§ãããã«é«åºŠãªããŒã¿åŠçãå¯èœã«ãªããŸããäŸãã°ãPraison CrewAIãšãŒãžã§ã³ããšçµã¿åãããŠäœ¿ãå Žåã以äžã®ãããªã³ãŒãã§ç°¡åã«é£æºã§ããŸãã
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
# ã¢ãã«ã®æéæ
å ±ãæœåºããããã®ã¯ã©ã¹
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the model.")
output_fee: str = Field(..., description="Fee for output token for the model.")
# ã¯ããŒã©ãŒã®èšå®
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
# ããŒã¿æœåºã®å®è¡
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""Crawlãããã³ã³ãã³ããããå
šãŠã®ã¢ãã«ã®æéæ
å ±ãæœåºããŠãã ããã"""
),
bypass_cache=True,
)
# æœåºãããã³ã³ãã³ãã衚瀺
print(result.extracted_content)
C
ãã®ã³ãŒãã§ã¯ãã¯ããŒã©ãŒãæå®ããURLããOpenAIã®ã¢ãã«æéæ å ±ãæœåºããæ§é åãããããŒã¿åœ¢åŒã§åºåããŸãããã®ããã«ãCrawl4AIã䜿ãã°ãè€éãªããŒã¿æœåºãç°¡åã«èªååã§ããŸãã
ð€ AIãšãŒãžã§ã³ãã§ã®æŽ»çšäŸ
å®éã«Crawl4AIã䜿ã£ãŠã©ããªããšãã§ããã®ããæ°ã«ãªããŸãããïŒäŸãã°ãPraison-AIãšãŒãžã§ã³ãã䜿ã£ãŠWebã¹ã¯ã¬ã€ãã³ã°ãããŒã¿ã¯ãªãŒãã³ã°ãããŒã¿åæãçµã¿åããããšããããªããšãã§ããŸãã
è€æ°ã®Webãµã€ãããäŸ¡æ Œæ å ±ãèªåã§æœåºããŠãããããŸãšããã¬ããŒããäœæããããšãã§ããŸããããããããããã®ã¹ãããããã¹ãŠèªååãããŠããã®ã§ãæéã¯æå°éïŒCrawl4AIã掻çšããã°ãé¢åãªäœæ¥ãã解æŸãããŠããã£ãšã¯ãªãšã€ãã£ããªéšåã«éäžã§ããããã«ãªããŸããð§ ã
ð¡ çµè«ïŒCrawl4AIã䜿ã£ãŠã¿ããïŒ
Crawl4AIã¯ãWebã¯ããŒãªã³ã°ãšããŒã¿æœåºãå¹ççã«è¡ãããã®åŒ·åãªããŒã«ã§ãããªãŒãã³ãœãŒã¹ã§ç¡ææäŸãããŠããã®ã§ãããã«ã§ãè©ŠããŠã¿ã䟡å€ããããŸããðããã®é«ãæè»æ§ãšå€æ©èœæ§ã掻ãããŠãããªãã®ãããžã§ã¯ãã次ã®ã¬ãã«ã«åŒãäžããŠã¿ãŸãããïŒ
ããCrawl4AIã«èå³ã湧ãããªãããã²äœ¿ã£ãŠã¿ãŠãã ããïŒãã®ããŒã«ãããªãã®WebããŒã¿åéã®åŒ·åãªçžæ£ã«ãªãããšãééããªãã§ãðªãæ°ããçºèŠãšå¯èœæ§ãåºããããšãé¡ã£ãŠããŸãïŒ
ãã®èšäºãæ°ã«å ¥ã£ãããµããŒããããŠã¿ãŸãããïŒ