top of page
  • aldern00b

Scraping Sites with Python

Before we get started, this is an updated version of Crystal's blog writeup found here: Web scraping with Python: A quick guide (educative.io) It was out of date as a few things have changed but you can view and give the comparison.


Okay, let's start with the code if you're a TL;DR kinda guy like me:

#import all the tools we'll need
#pip3 install BeautifulSoup, selenium as well as webdriver-manager
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import statistics

#setup the chrome driver object to get websites
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

#fetch net-a-porter's website data
driver.get("http://www.net-a-porter.com/en-us/shop/clothing/jeans")

#create a BeautifulSoup object with HTML source as driver.page_source
#and pythons built-in parser html.parser as args, this will start The
#scraper to search for specific tags and attribs
soup = BeautifulSoup(driver.page_source, 'html.parser')

#find the itemprop property for both the brand and price
response = soup.find_all(itemprop=["brand", "price"])

#save the price data into a list, then print it
data = []
for item in response:
    data.append(item.text.strip("\n$"))

print(data)

OK so first things first, we have to do some installs here. If you're using windows pop open a powershell window and use pip3 to install. You have to use pip3 to get the latest version of some of these tools.

pip3 install beautifulsoup4
pip3 install selenium
pip3 install webdriver-manager

Next you're going to need to make sure chrome is installed as well as grabbing the chrome Driver found here: ChromeDriver - WebDriver for Chrome - Downloads (chromium.org)


Since Crystals walkthrough, they've depreciated the execution path and you now have to use a service. We call ours here

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

One the browsers driver is called we can tell it to get some website. For this one, we're going to use Crystals same net-a-porter - everyone needs a pair of blue pants. This will pop open Chrome and show the site.

driver.get("http://www.net-a-porter.com/en-us/shop/clothing/jeans")

Next we use BeautifulSoup to parse through the page source

soup = BeautifulSoup(driver.page_source, 'html.parser')

Using soup again, we're going to parse through and look for item properties of the pants that are there. In my example, I'm searching for both the brand and the price of the pants. A very helpful site for understanding this find_all function can be found here: Beautiful Soup | find_all method with Examples (skytowner.com)

response = soup.find_all(itemprop=["brand", "price"])

Next we're going to take all the data it finds and add it to an array called data and then print all the data it finds.

data = []
for item in response:
    data.append(item.text.strip("\n$"))

print(data)

From here we have some action items to do with the data, but maybe we'll work on that another day - the first would be to compare the data between the two sites and see who's cheapest... or we could schedule this and pull data once a day and watch for sales... hmmm, shopper saver app?


15 views0 comments

Recent Posts

See All

Comments


bottom of page