Python

Web scraping using Python

Share Button

This article is intended for people having basic knowledge of Python. Today I am sharing a very basic web scraping technique  using Python. But first, let me tell you few things to keep in mind before web scraping.

  • Always check terms and conditions of the website regarding scraping and content downloading.
  • Websites always change, be ready to rewrite your code.
  • Computers send web requests much quicker than a user, so be nice and avoid sending bulk requests to website’s server.
  • Indentation is very important in Python.

Today we’ll find current stock prices of few stocks by scraping Yahoo Finance. The basic logic followed in every scraping exercise is more or less always same.

  • Step 1 - Go to the website, manually enter the query and see URL structure.
  • Step 2 - View source code of the result page and try to find your desired information.
  • Step 3 - Follow Step 1 and Step 2 for few other queries and try to look for consistency.
  • Step 4 - Write Code and Voila!

Now, Lets code it.

Finding URL pattern

When you go to Yahoo Finance website and search for few stocks (FB, GOOGL, MSFT and YHOO) in Quote Lookup search box you’ll observe a pattern in URL.

1
http://finance.yahoo.com/q?s=YourStockSymbol&ql=1

 
Ignore &ql=1 part because it is not relevant for our purpose. So, our final URL will be

1
http://finance.yahoo.com/q?s=SYMBOL

 
Finding consistency in the code

When you view the source code and observe carefully, you’ll find that current stock price is always wrapped in

1
<span  id="yfs_l84_symbol"></span>

For Yahoo it is yfs_184_yhoo, for Google it is yfs_184_googl and so on.
 
Final coding

For this assignment we need to import two libraries – urllib to read source code of the webpage and re (regular expression) to find our pattern in that source code.

I am defining an array of our symbols and creating a loop which will fetch the stock prices till it reaches the last element of symbols array. len(symbols) gives the length of the array.

Under this loop I am using urlopen function to read source code of our url and regular expression (re) library to find our desired pattern in the source code. symbols[i].lower() converts our symbols to lower case. Regular expression to find something surrounded by a span tag having id xyz is

1
<span id="xyz">(.*?)</span>

 
Our final code looks like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/python
import urllib
import re

symbols = ["FB", "GOOGL", "MSFT", "YHOO"]

i = 0
while i< len(symbols):
    url = 'http://finance.yahoo.com/q?s='+symbols[i]
    htmltext = urllib.urlopen(url).read()
    regex = '<span id="yfs_l84_'+symbols[i].lower()+'">(.*?)</'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print 'Stock Price for '+symbols[i]+' is '+price[0]
    i+=1

You can extend this code for any number of symbols. In case of very large number of symbols, you can create a text file having a list of all the symbols. Then you can read that text file and create an array of symbols in the code by using

1
open('symbols.txt').read().split('\n')

I hope you learned something. Try this exercise, write your own web scraping codes and if you have any query, leave a comment below. Happy coding!

Share Button