Presentation: An Introduction to web scraping using Python

An Introduction to web scraping using Python

Manoj Pandey, Arsh

Audience level:: Novice
Category:: Web Frameworks

Description

Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation.

Abstract

Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you! Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation. Besides looking at how websites are put together, we will also discuss the ethics and legalities of scraping. What is legal? How can you be a friendly scraper, so that the administrator of the website you are scraping won’t try to shut you down? Comparison between different libraries like bs4 vs lxml vs re would also be there. I'd also point the directions in which people can make decisions on which library to use for a particular task. Finally towards the end, I'll speak about the scrapy framework, its features and how we can write a simple scrpaer in scrapy. 1. [BeautifulSoup][1] 2. [lxml][2] 3. [re][3] 4. [scrapy][4] [1]: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ [2]: http://lxml.de/index.html#documentation [3]: https://docs.python.org/2/library/re.html [4]: http://doc.scrapy.org/en/1.0/