Blog Archives

RStudio addin – extend RStudio in your way

August 10, 2016
By
RStudio addin – extend RStudio in your way

RStudio addins - first attemptRecently I found RStudio began to provide addin mechanism. The examples looked simple, and the addin API easy to use. I immediately started to try writing one by myself. It will be a good practice project for writing R package, and I can implement some features I wanted but not in RStudio’s high priority list.

Read more »

Data Cleaning Part 2 – Geocoding Addresses, Double The Performance By Cleaning

February 3, 2016
By
Data Cleaning Part 2 – Geocoding Addresses, Double The Performance By Cleaning

SummaryThis is my second post on topic of Data Cleaning. Cleaning addresses format turned out to have a substantial positive impact on Geocoding performance. Deep understandings of address format standard is needed to deal with all kinds of special cases.

Read more »

Data Cleaning Part 1 – NYC Taxi Trip Data, Looking For Stories Behind Errors

January 31, 2016
By
Data Cleaning Part 1 – NYC Taxi Trip Data, Looking For Stories Behind Errors

SummaryData cleaning is a cumbersome but important task for Data Science project in reality. This is a discussion on my practice of data cleaning for NYC Taxi Trip data. There are lots of domain knowledge, common sense and business thinking involved.

Read more »

Script And Workflow For Batch Geocoding Millions Of Address With PostGIS Tiger Geocoder

November 19, 2015
By
Script And Workflow For Batch Geocoding Millions Of Address With PostGIS Tiger Geocoder

SummaryI discussed all the problem I met, approaches I tried, and improvement I achieved in the Geocoding task. There are many subtle details, some open questions and areas can be improved. The final working script and complete workflow are hosted in github.

Read more »

Geocoding 18 million addresses with PostGIS Tiger Geocoder

November 17, 2015
By

SummaryThis post discussed the background, approaches, windows and linux environment setup for my Geocoding task. See more details about the script and workflow in next post.

Read more »

Red Cross Smoke Alarm Project

November 11, 2015
By
Red Cross Smoke Alarm Project

SummaryThis is a write-up for my volunteer Data Science project for the American Red Cross. The project used public data to help Red Cross directing limited resources to homes that more vulnerable to fire risk and loss. My work in the project:Discovered the hidden information in NFIRS dataset, obtained and analyzed 10G NFIRS data. Major contribution on model design, implemented NFIRS related...

Read more »

Exploring House Price Estimation And Imagining A Better Real Estate Website

October 23, 2015
By
Exploring House Price Estimation And Imagining A Better Real Estate Website

SummaryThis personal project was inspired by my own experience in housing market. I was wondering what I can achieve with public data and Data Science methods. Built upon extensive research of domain knowledge, my model is very simple yet powerful. It used the data with most information, followed the trends in space and time. The prediction accuracy is comparable to Zillow...

Read more »

Simple Python performance timing by checkpoints

October 20, 2015
By

SummaryThis is a simple python script that can measure python program running time in fine levels. It’s simpler than a full profiler, easier to use than other currently available similar scripts.

Read more »

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)