Talk Submission
If you are interested in attending this talk at PyCon JP 2016, please use the social media share buttons below. We will consider the popularity of the proposals when making our selection.
talk
Building a Simple Japanese Content-Based Recommender System in Python(en)
Speakers
Charles Vallantin Dulac
Audience level:
Intermediate
Category:
Industry Uses
Description
Online stores such as Amazon but also news/blogs websites suffer from information overload. Customers can easily get lost in their large variety (millions) of products or articles. Recommendation engines help users narrow down the large variety by presenting possible suggestions. In this talk, I will show how to create a simple Japanese content-based recommendation system in Python for blog posts.
Objectives
Discover what is a recommendation system. Learn how does it work. See what are the challenges within most east Asian languages. Learn how to build a simple Japanese content-based recommendation engine (TF-IDF/Word2Vec)
Abstract
Wikipedia states recommendation systems have become extremely common in recent years and are utilized in a variety of areas: some popular applications include movies, music, news, books, research articles, search queries, social tags, and products in general.
Content-based recommendation engines typically produce a list of recommendations in one of two ways – through collaborative or content-based filtering. Collaborative filtering approaches building a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.
This current talk will focus on building a simple Japanese content-based recommendation engine.
To abstract the features of the items in the system, an item presentation algorithm is applied. A widely used algorithm is the TF–IDF representation (also called vector space representation). It assumes that words in the sentence are separated by spaces in spite of that the assumption is not true in most east Asian languages. MeCab, a fast Japanese morphological analyzer, helps us to solve this problem by extracting from the document all the unique tokens constituting the markup and allow us to apply a TF–IDF on top of them.
To create suggestions based on these keywords, we use Word2Vec which is an unsupervised algorithm for learning the meaning behind words. Word2vec takes an input as a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space.
By computing vectors of our keywords and computing distances between them, we can find similar words and suggest articles which contain them.