ããã¯äœïŒ
SeleniumãšBeautifulSoupã䜿çšããŠãäžè¬çãªWebã¹ã¯ã¬ã€ãã³ã°ãè¡ãåºæ¬çãªæµãã«ã€ããŠã®æè¡ã¡ã¢ã§ãã
ç°å¢
- Apple Silicon M1 MacBook Air
- 16 Gb
- macOS Sequoia 15.0
- chromedriver version : ChromeDriver 129.0.6668.58
- Google Chrome version : 129.0.6668.59ïŒOfficial BuildïŒ ïŒarm64ïŒ
- 2024/09/18 çŸåšã®å®è¡
æé
ãã£ã¬ã¯ããªã®äœæãšä»®æ³ç°å¢ã®äœæ
mkdir ScrapingTest
cd ScrapingTest
python -m venv nenv
ã©ã€ãã©ãªã®ã€ã³ã¹ããŒã«
- Selenium: Webãã©ãŠã¶ãèªåæäœããããã®ã©ã€ãã©ãªã§ããJavaScriptã§åçã«çæãããããŒãžã®ã¹ã¯ã¬ã€ãã³ã°ã«ç¹ã«æå¹ã§ãã
- BeautifulSoup: WebããŒãžã®HTMLãè§£æããæå®ããããŒã¿ãæœåºããããã®ã©ã€ãã©ãªã§ãã
pip install selenium beautifulsoup4 pandas webdriver-manager
Chromeãã©ã€ããŒã®ã€ã³ã¹ããŒã«ã¯ä»¥äžã®èšäºãåç §ããŠãã ããã
ãŸããSeleniumã䜿ã£ãŠChromeãã©ãŠã¶ãæäœããããã«ãChromeDriverãèšå®ããŸããwebdriver_managerã䜿ãããšã§ãChromeDriverãèªåã§ã€ã³ã¹ããŒã«ããææ°ããŒãžã§ã³ã«å¯Ÿå¿ãããããšãã§ããŸãã
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ChromeDriverã®ãµãŒãã¹èšå®
service = Service(ChromeDriverManager().install())
# ChromeDriverãåæå
driver = webdriver.Chrome(service=service)
次ã«ã¢ã¯ã»ã¹ãããããŒãžã远å ããŸããä»åã¯ç§ã®ãµã€ãã«ããŸããããã奜ããªãµã€ãã§è©ŠããŠãã ããã
# ã¿ãŒã²ããã®WebããŒãžã«ã¢ã¯ã»ã¹
driver.get('https://coiai.net')
ããŒãžå
šäœãããŒãããããŸã§æå€§20ç§åŸ
æ©ããŸããæå®ãããèŠçŽ ãããŒããããã®ãåŸ
ã€ããšã§ãåçã«çæãããã³ã³ãã³ãã«ã察å¿ã§ããŸã.
# ããŒãžå
šäœã®ããŒããåŸ
〠(20ç§ãŸã§åŸ
æ©)
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.TAG_NAME, 'body'))
)
BeautifulSoupã䜿ã£ãŠãHTMLãœãŒã¹ãè§£æããããŒãžå
ã®ããŒã¿ãååŸããŸãããã®äŸã§ã¯ãh2ã¿ã°å
ã®ããŒã¿ãæœåºããŸãã
# ããŒãžã®ãœãŒã¹ãBeautifulSoupã§è§£æ
soup = BeautifulSoup(driver.page_source, 'html.parser')
# h2ã¿ã°ã®ã¿ãååŸ
h2_tags = soup.find_all('h2')
ããŒã¿ããªã¹ãã«æ ŒçŽããŸãã
# æœåºããããŒã¿ããªã¹ãã«æ ŒçŽ
test_list = []
# åh2ã¿ã°ã®å
容ããªã¹ãã«è¿œå
for h2 in h2_tags:
test_list.append({
'ãã¹ãããŒã¿': h2.text.strip()
})
ããŒã¿ãCSVã«æžãåºãããã«ããŸããã
# ããŒã¿ãã¬ãŒã ã«å€æ
df = pd.DataFrame(test_list)
# çµæã衚瀺ãŸãã¯CSVã«ä¿å
df.to_csv('test_data.csv', index=False)
print(df)
æåŸã«éãããã©ãŠã¶ãéããŸãã
# ãã©ãŠã¶ãéãã
driver.quit()
ãŸãšã
å
šãŠãèžãŸãããšä»¥äžã®ã³ãŒãã«ãªããŸãã
ãã¡ããé©å®å®è¡ããŠããã ããã°åäœãããšæããŸãã
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ChromeDriverã®ãµãŒãã¹èšå®
service = Service(ChromeDriverManager().install())
# ChromeDriverãåæå
driver = webdriver.Chrome(service=service)
# ã¿ãŒã²ããã®WebããŒãžã«ã¢ã¯ã»ã¹
driver.get('https://coiai.net')
# ããŒãžå
šäœã®ããŒããåŸ
〠(20ç§ãŸã§åŸ
æ©)
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.TAG_NAME, 'body'))
)
# ããŒãžã®ãœãŒã¹ãBeautifulSoupã§è§£æ
soup = BeautifulSoup(driver.page_source, 'html.parser')
# h2ã¿ã°ã®ã¿ãååŸ
h2_tags = soup.find_all('h2')
# æœåºããããŒã¿ããªã¹ãã«æ ŒçŽ
test_list = []
# åh2ã¿ã°ã®å
容ããªã¹ãã«è¿œå
for h2 in h2_tags:
test_list.append({
'ãã¹ãããŒã¿': h2.text.strip()
})
# ããŒã¿ãã¬ãŒã ã«å€æ
df = pd.DataFrame(test_list)
# çµæã衚瀺ãŸãã¯CSVã«ä¿å
df.to_csv('test_data.csv', index=False)
print(df)
# ãã©ãŠã¶ãéãã
driver.quit()
ãã®äŸã§ã¯ãSeleniumãšBeautifulSoupã䜿ã£ãŠãWebããŒãžããh2ã¿ã°å ã®ããã¹ãããŒã¿ãæœåºããCSVã«ä¿åããæ¹æ³ã玹ä»ããŸããããã®ææ³ãå¿çšããã°ãããŸããŸãªWebãµã€ãããå¿ èŠãªããŒã¿ãèªåçã«ååŸããããšãå¯èœã§ãã
Seleniumã䜿ãããšã§ãåçã«çæãããã³ã³ãã³ãã«ã察å¿ã§ãããããããè€éãªããŒãžããã®ã¹ã¯ã¬ã€ãã³ã°ã«ã察å¿å¯èœã§ãããã²è©ŠããŠã¿ãŠãã ããã
ããã§ãäžè¬çãªã¹ã¯ã¬ã€ãã³ã°ã®ããæ¹ã解説ããæè¡èšäºã®ããŒã¹ã宿ããŸãããå¿ èŠã«å¿ããŠããµã³ãã«ããŒã¿ãå ·äœçãªå©çšã±ãŒã¹ã«åãããŠèª¿æŽããŠãã ããã