Talk Submission

If you are interested in attending this talk at PyCon JP 2016, please use the social media share buttons below. We will consider the popularity of the proposals when making our selection.

talk

An Introduction to web scraping using Python(en)

Speakers

Manoj Pandey , Arsh

Audience level:

Novice

Category:

Web Frameworks

Description

Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation.

Objectives

- What/Why Web Scraping - Scraping vs APIs - Useful libraries available - Which library to use for which job - What is Scrapy Framework - When and when not to use scrapy or which particular framework - Legalities and ethics

Abstract

Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you! Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation. Besides looking at how websites are put together, we will also discuss the ethics and legalities of scraping. What is legal? How can you be a friendly scraper, so that the administrator of the website you are scraping won’t try to shut you down? Comparison between different libraries like bs4 vs lxml vs re would also be there. I'd also point the directions in which people can make decisions on which library to use for a particular task. Finally towards the end, I'll speak about the scrapy framework, its features and how we can write a simple scrpaer in scrapy. 1. [BeautifulSoup][1] 2. [lxml][2] 3. [re][3] 4. [scrapy][4] [1]: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ [2]: http://lxml.de/index.html#documentation [3]: https://docs.python.org/2/library/re.html [4]: http://doc.scrapy.org/en/1.0/
  • このエントリーをはてなブックマークに追加
CONTACT