プロポーザル

これは応募されたプロポーザルです。聞きたいと思うプロポーザルを各ページの下部にあるSNSのボタンで拡散しましょう。拡散された投稿をプロポーザルへの投票としてカウントし、選考時に参考にさせていただきます。

talk

An Introduction to web scraping using Python(en)

スピーカー

Manoj Pandey , Arsh

対象レベル：

初級

カテゴリ：

Web Frameworks

説明

Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation.

目的

- What/Why Web Scraping - Scraping vs APIs - Useful libraries available - Which library to use for which job - What is Scrapy Framework - When and when not to use scrapy or which particular framework - Legalities and ethics

概要

Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you! Want to learn how to scrape the web (and / or organized data sets and APIs) for content? This talk will give you the building blocks (and code) to begin your own scraping adventures. We will review basic data scraping, API usage, form submission as well as how to scrape pesky bits like Javascript-usage for DOM manipulation. Besides looking at how websites are put together, we will also discuss the ethics and legalities of scraping. What is legal? How can you be a friendly scraper, so that the administrator of the website you are scraping won’t try to shut you down? Comparison between different libraries like bs4 vs lxml vs re would also be there. I'd also point the directions in which people can make decisions on which library to use for a particular task. Finally towards the end, I'll speak about the scrapy framework, its features and how we can write a simple scrpaer in scrapy. 1. [BeautifulSoup][1] 2. [lxml][2] 3. [re][3] 4. [scrapy][4] [1]: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ [2]: http://lxml.de/index.html#documentation [3]: https://docs.python.org/2/library/re.html [4]: http://doc.scrapy.org/en/1.0/