Articles by statcompute

Query Pandas DataFrame with SQL

November 1, 2014 | statcompute

Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames. Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub (https://github.... [Read more...]

Flexible Beta Modeling

October 27, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. library(betareg) library(sas7bdat) df1 <- read.sas7bdat('lgd.sas7bdat') df2 <- df1[df1$y < 1, ] fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2') ### LATENT-CLASS BETA REGRESSION: AIC = -565 ### mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1)) print(mdl1) #betamix(formula = fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, # minprior = 0.1)) # #Cluster sizes: # 1 2 #157 959 summary(mdl1, which = 'concomitant') # Estimate Std. Error z value Pr(>|z|) #(Intercept) -1.35153 0.41988 -3.2188 0.001287 ** #x1 2.92537 1.13046 2.5878 0.009660 ** #x2 2.82809 1.42139 1.9897 0.046628 * summary(mdl1) #$Comp.1$mean # Estimate Std. Error z value Pr(>|z|) #(Intercept) -0.8963228 1.0385545 -0.8630 0.3881108 #x2 3.1769062 0.6582108 4.8266 1.389e-06 *** #x3 -0.0520060 0.0743714 -0.6993 0.4843805 #x4 4.9642998 1.4204071 3.4950 0.0004741 *** #x5 0.0021647 0.0022659 0.9554 0.3393987 #x6 0.0248573 [...] [Read more...]

Model Segmentation with Recursive Partitioning

October 26, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. library(party) df1 <- read.csv("credit_count.csv") df2 <- df1[df1$CARDHLDR == 1, ] mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel) print(mdl) #1) AGE <= 22.91667; criterion = 1, statistic = 48.255 # 2)* weights = 1116 #Terminal node model #Binomial GLM with coefficients: #(Intercept) MAJORDRG MINORDRG INCOME OWNRENT # -0.6651905 0.0633978 0.5182472 -0.0006038 0.3071785 # #1) AGE > 22.91667 # 3)* weights = 9383 #Terminal node model #Binomial GLM with coefficients: #(Intercept) MAJORDRG MINORDRG INCOME OWNRENT # -1.4117010 0.2262091 0.2067880 -0.0003822 -0.2127193 ### TEST FOR STRUCTURAL CHANGE ### sctest(mdl, node = 1) # AGE SELFEMPL #statistic 4.825458e+01 20.88612025 #p.value 1.527484e-07 0.04273836 summary(mdl, node = 2) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #(Intercept) -0.6651905 0.2817480 -2.361 0.018229 * #MAJORDRG 0.0633978 0.3487305 0.182 0.855743 #MINORDRG 0.5182472 0.2347656 2.208 0.027278 * #INCOME -0.0006038 [...] [Read more...]

Estimating a Beta Regression with The Variable Dispersion in R

October 19, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. pkgs <- c('sas7bdat', 'betareg', 'lmtest') lapply(pkgs, require, character.only = T) df1 <- read.sas7bdat("lgd.sas7bdat") df2 <- df1[which(df1$y < 1), ] xvar <- paste("x", 1:7, sep = '', collapse = " + ") fml1 <- as.formula(paste("y ~ ", xvar)) fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar)) # FIT A BETA MODEL WITH THE FIXED PHI beta1 <- betareg(fml1, data = df2) summary(beta1) # Coefficients (mean model with logit link): # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.500242 0.329670 -4.551 5.35e-06 *** # x1 0.007516 0.026020 0.289 0.772680 # x2 0.429756 0.135899 3.162 0.001565 ** # x3 0.099202 0.022285 4.452 8.53e-06 *** # x4 2.465055 0.415657 5.931 3.02e-09 *** # x5 -0.003687 0.001070 -3.446 0.000568 *** # x6 0.007181 0.001821 3.943 8.06e-05 *** # x7 0.128796 0.186715 0.690 0.490319 # # Phi coefficients (precision model with identity link): # Estimate Std. Error z value Pr(>|z|) # (phi) 3.6868 0.1421 25.95 <2e-16 [...] [Read more...]

By-Group Aggregation in Parallel

October 4, 2014 | statcompute

Similar to the row search, by-group aggregation is another perfect use case to demonstrate the power of split-and-conquer with parallelism. In the example below, it is shown that the homebrew by-group aggregation with foreach pakage, albeit inefficiently coded, is still a lot faster than the summarize() function in Hmisc package. [Read more...]

Vector Search vs. Binary Search

October 1, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. # REFERENCE: # user2014.stat.ucla.edu/files/tutorial_Matt.pdf pkgs <- c('data.table', 'rbenchmark') lapply(pkgs, require, character.only = T) load('2008.Rdata') dt <- data.table(data) benchmark(replications = 10, order = "elapsed", vector_search = { test1 <- dt[ArrTime == 1500 & Origin == 'ABE', ] }, binary_search = { setkey(dt, ArrTime, Origin) test2 <- dt[.(1500, 'ABE'), ] } ) # test replications elapsed relative user.self sys.self user.child # 2 binary_search 10 0.335 1.000 0.311 0.023 0 # 1 vector_search 10 7.245 21.627 7.102 0.131 0 To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. [Read more...]

Row Search in Parallel

September 28, 2014 | statcompute

I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel. In the R code below, a comparison is done between the standard row search and ... [Read more...]

Chain Operations: An Interesting Feature in dplyr Package

July 28, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. library(data.table) library(dplyr) data1 <- fread('/home/liuwensui/Downloads/2008.csv', header = T, sep = ',') dim(data1) # [1] 7009728 29 data2 <- data1 %.% filter(Year = 2008, Month %in% c(1, 2, 3, 4, 5, 6)) %.% select(Year, Month, AirTime) %.% group_by(Year, Month) %.% summarize(avg_time = mean(AirTime, na.rm = TRUE)) %.% arrange(desc(avg_time)) print(data2) # Year Month avg_time # 1 2008 3 106.1939 # 2 2008 2 105.3185 # 3 2008 6 104.7604 # 4 2008 1 104.6181 # 5 2008 5 104.3720 # 6 2008 4 104.2694 To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here [...] [Read more...]

Efficiency of Importing Large CSV Files in R

February 10, 2014 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ### system.time(read.csv('../data/2008.csv', header = T)) # user system elapsed # 88.301 2.416 90.716 library(data.table) system.time(fread('../data/2008.csv', header = T, sep = ',')) # user system elapsed # 4.740 0.048 4.785 library(bigmemory) system.time(read.big.matrix('../data/2008.csv', header = T)) # user system elapsed # 59.544 0.764 60.308 library(ff) system.time(read.csv.ffdf(file = '../data/2008.csv', header = T)) # user system elapsed # 60.028 1.280 61.335 library(sqldf) system.time(read.csv.sql('../data/2008.csv')) # user system elapsed # 87.461 3.880 91.447 To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. [Read more...]

Julia and SQLite

February 8, 2014 | statcompute

Similar to R and Pandas in Python, Julia provides a simple yet efficient interface with SQLite database. In addition, it is extremely handy to use sqldf() function, which is almost identical to the sqldf package in R, in SQLite package for data munging. [Read more...]

rPython – R Interface to Python

October 13, 2013 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. > library(rPython) Loading required package: RJSONIO > ### load r data.frame ### > data(iris) > r_df1 <- iris > class(r_df1) [1] "data.frame" > head(r_df1, n = 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > ### pass r data.frame to python dict ### > python.assign('py_dict1', r_df1) > python.exec('print type(py_dict1)') <type 'dict'> > ### convert python dict to pandas DataFrame ### > python.exec('import pandas as pd') > python.exec('py_df = pd.DataFrame(py_dict1)') > python.method.call('py_df', 'info') <class 'pandas.core.frame.DataFrame'> Int64Index: 150 entries, 0 to 149 Data columns (total 5 columns): Petal.Length 150 non-null values Petal.Width 150 non-null values Sepal.Length 150 non-null values Sepal.Width 150 non-null values Species 150 non-null values dtypes: float64(4), object(1)NULL > python.exec('print py_df.head(3)') Petal.Length Petal.Width Sepal.Length Sepal.Width Species 0 1.4 0.2 5.1 3.5 setosa 1 1.4 0.2 4.9 3.0 setosa 2 1.3 0.2 4.7 3.2 setosa > [...] [Read more...]

Prototyping Multinomial Logit with R

August 21, 2013 | statcompute

Recently, I am working on a new modeling proposal based on the competing risk and need to prototype multinomial logit models with R. There are R packages implementing multinomial logit models that I’ve tested, namely nnet and vgam. Model outputs with iris data are shown below. However, in my ... [Read more...]

GRNN and PNN

June 23, 2013 | statcompute

From the technical prospective, people usually would choose GRNN (general regression neural network) to do the function approximation for the continuous response variable and use PNN (probabilistic neural network) for pattern recognition / classification problems with categorical outcomes. However, from the practical standpoint, it is often not necessary to draw a ... [Read more...]

General Regression Neural Network with R

June 16, 2013 | statcompute

Similar to the back propagation neural network, the general regression neural network (GRNN) is also a good tool for the function approximation in the modeling toolbox. Proposed by Specht in 1991, GRNN has advantages of instant training and easy tuning. A GRNN would be formed instantly with just a 1-pass training ... [Read more...]

Improve The Efficiency in Joining Data with Index

June 9, 2013 | statcompute

When managing big data with R, many people like to use sqldf() package due to its friendly interface or choose data.table() package for its lightening speed. However, very few would pay special attentions to small details that might significantly boost the efficiency of these packages by adding index to ... [Read more...]

Estimating Finite Mixture Models with Flexmix Package

June 9, 2013 | statcompute

In my post on 06/05/2013 (http://statcompute.wordpress.com/2013/06/05/estimating-composite-models-for-count-outcomes-with-fmm-procedure), I’ve shown how to estimate finite mixture models, e.g. zero-inflated Poisson and 2-class finite mixture Poisson models, with FMM and NLMIXED procedure in SAS. Today, I am going to demonstrate how to achieve the same results with flexmix package ... [Read more...]
1 4 5 6 7 8

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)