Data Community DC and District Data Labs are hosting a Natural Language Processing with R workshop on Saturday November 21st from 9am – 5pm. Register before November 7th for an early bird discount!
R is a powerful language for statistical computing. A prolific user community backs R with an extensive library of packages. If you can think of it, somebody has already written a library for it. R also has a superb IDE, R Studio, facilitating reproducible research.
This course is for people with some R programming experience. Students may or may not have any NLP experience. We will introduce base R’s text manipulation capabilities, use some frameworks for NLP in R, and call some commonly-used NLP algorithms. The course culminates in a full NLP project performed in R.
What You Will Learn
This course introduces R’s capabilities for text manipulation and natural language processing. NLP is an emerging field, and we will focus on some core NLP tasks and R libraries.
- Text manipulation with base R
- Document clustering
- Part of speech tagging
- Sentence parsing
- Named-entity recognition
- Topic modeling
Linguistic data can be large, so we will also learn how to track use of system resources. We will also cover some strategies to optimize memory use and computation time.
- Reproducible research: Setting up an R Studio Project and file structure.
- Review of R, R Studio, and R markdown.
- CRAN task view: Natural Language Processing.
- Importing text documents using the scan function and enc2utf8.
- Basic search and replace functionality with grep, grepl, gsub and more.
- Monitoring system resources.
- Basic counting of things, tf/tfidf/word counts, and word clouds.
- Document clustering.
- Sentence parsing/POS tagging/entity extraction with Apache Open NLP.
- Build a quick and dirty document summarizer
- Introduction to topic modeling.
- Document classification.
- Final project: construct a reproducible data analysis with R markdown and NLP techniques covered.
After this course you will have used several methods and libraries for NLP. You will have completed a final project using several of these NLP techniques. You will have performed your work using reproducible research methods. This will allow you to revisit your work (and publish it on the web if you’d like).
Instructor: Tommy Jones
Tommy is a statistician, mathematician, or data scientist; depending on the problem or audience. He holds an MS in mathematics and statistics from Georgetown University and a BA in economics from the College of William and Mary. He is the Director of Data Science at Impact Research.
Tommy has previously performed economic and statistical modeling and analysis at the Science and Technology Policy Institute, the Federal Reserve Board, and the Institute for the Theory and Practice of International Relations. He has expertise in regression analyses, time series modeling and forecasting, natural language processing, data mining, and other quantitative techniques.