Articles by Ron Pearson (aka TheNoodleDoodler)

A question of model uncertainty

March 9, 2014 | Ron Pearson (aka TheNoodleDoodler)

It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time. The first is that I taught a night class for the University of Connecticut’s Graduate School of Business, introducing R to students with little or ... [Read more...]

Assessing the precision of classification tree model predictions

August 6, 2013 | Ron Pearson (aka TheNoodleDoodler)

My last post focused on the use of the ctree procedure in the R package party to build classification tree models. These models map each record in a dataset into one of M mutually exclusive groups, which are characterized by their average response. For responses coded as 0 or 1, this average ... [Read more...]

Classification Tree Models

April 13, 2013 | Ron Pearson (aka TheNoodleDoodler)

On March 26, I attended the Connecticut R Meetup in New Haven, which featured a talk by Illya Mowerman on decision trees in R. I have gone to these Meetups before, and I have always found them to be interesting and informative. Attendees range from those who are just starting to ... [Read more...]

Finding outliers in numerical data

February 16, 2013 | Ron Pearson (aka TheNoodleDoodler)

One of the topics emphasized in Exploring Data in Engineering, the Sciences and Medicine is the damage outliers can do to traditional data characterizations. Consequently, one of the procedures to be included in the ExploringData package is FindOutliers, described in this post. Given a vector of numeric values, this procedure ... [Read more...]

Data Science, Data Analysis, R and Python

December 15, 2012 | Ron Pearson (aka TheNoodleDoodler)

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:“Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;“Data Scientist: The Sexiest Job of the 21st Century,” by Thomas ... [Read more...]

Characterizing a new dataset

October 27, 2012 | Ron Pearson (aka TheNoodleDoodler)

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly. So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ... [Read more...]

Spacing measures: heterogeneity in numerical distributions

September 22, 2012 | Ron Pearson (aka TheNoodleDoodler)

Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data). In addition, numerically coded values can be effectively categorical, either ordered, or unordered. A specific example that illustrates ... [Read more...]

Implementing the CountSummary Procedure

September 8, 2012 | Ron Pearson (aka TheNoodleDoodler)

In my last post, I described and demonstrated the CountSummary procedure to be included in the ExploringData package that I am in the process of developing. This procedure generates a collection of graphical data summaries for a count data sequence, based on the distplot, Ord_plot, and Ord_estimate functions ... [Read more...]

Base versus grid graphics

July 21, 2012 | Ron Pearson (aka TheNoodleDoodler)

In a comment in response to my latest post, Robert Young took issue with my characterization of grid as an R graphics package. Perhaps grid is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display ... [Read more...]

Graphical insights from the 2012 UseR! Meeting

July 7, 2012 | Ron Pearson (aka TheNoodleDoodler)

About this time last month, I attended the 2012 UseR! Meeting. Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe. This year’s meeting was held on the VanderbiltUniversity campus in Nashville, TN, ... [Read more...]

Classifying the UCI mushrooms

June 10, 2012 | Ron Pearson (aka TheNoodleDoodler)

In my last post, I considered the shifts in two interestingness measures as possible tools for selecting variables in classification problems. Specifically, I considered the Gini and Shannon interestingness measures applied to the 22 categorical mushroom characteristics from the UCI mushroom dataset. The proposed variable selection strategy was to compare these ... [Read more...]

Interestingness comparisons

May 19, 2012 | Ron Pearson (aka TheNoodleDoodler)

In three previous posts (April 3, 2011, April 12, 2011,and May 21, 2011), I have discussed interestingness measures, which characterize the distributional heterogeneity of categorical variables. Four specific measures are discussed in Chapter 3 of Exploring Data in Engineering, the Sciences and Medicine: the Bray measure, the Gini measure, the Shannon measure, and the Simpson measure. ... [Read more...]

David Olive’s median confidence interval

April 21, 2012 | Ron Pearson (aka TheNoodleDoodler)

As I have discussed in a number of previous posts, the median represents a well-known and widely-used estimate of the “center” of a data sequence. Relative to the better-known mean, the primary advantage of the median is its much reduced outlier sensitivity. This post briefly describes a simple confidence interval ... [Read more...]

Gastwirth’s location estimator

March 3, 2012 | Ron Pearson (aka TheNoodleDoodler)

The problem of outliers – data points that are substantially inconsistent with the majority of the other points in a dataset – arises frequently in the analysis of numerical data. The practical importance of outliers lies in the fact that even a few of these points can badly distort the results of ... [Read more...]

Measuring associations between non-numeric variables

February 4, 2012 | Ron Pearson (aka TheNoodleDoodler)

It is often useful to know how strongly or weakly two variables are associated: do they vary together or are they essentially unrelated? In the case of numerical variables, the best-known measure of association is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century. ... [Read more...]

Moving window filters and the pracma package

January 14, 2012 | Ron Pearson (aka TheNoodleDoodler)

In my last post, I discussed the Hampel filter, a useful moving window nonlinear data cleaning filter that is available in the R package pracma. In this post, I briefly discuss this moving window filter in a little more detail, focusing on two important practical points: the choice of the ... [Read more...]

Cleaning time-series and other data streams

November 27, 2011 | Ron Pearson (aka TheNoodleDoodler)

The need to analyze time-series or other forms of streaming data arises frequently in many different application areas. Examples include economic time-series like stock prices, exchange rates, or unemployment figures, biomedical data sequences like electrocardiograms or electroencephalograms, or industrial process operating data sequences like temperatures, pressures or concentrations. As a ... [Read more...]

Harmonic means, reciprocals, and ratios of random variables

November 11, 2011 | Ron Pearson (aka TheNoodleDoodler)

In my last few posts, I have considered “long-tailed” distributions whose probability density decays much more slowly than standard distributions like the Gaussian. For these slowly-decaying distributions, the harmonic mean often turns out to be a much better (i.e., less variable) characterization than the arithmetic mean, which is generally ... [Read more...]

The Zipf and Zipf-Mandelbrot distributions

October 23, 2011 | Ron Pearson (aka TheNoodleDoodler)

In my last few posts, I have been discussing some of the consequences of the slow decay rate of the tail of the Pareto type I distribution, along with some other, closely related notions, all in the context of continuously distributed data. Today’s post considers the Zipf distribution for ... [Read more...]

Is the “Long Tail” a Useless Concept?

September 28, 2011 | Ron Pearson (aka TheNoodleDoodler)

In response to my last post, “The Long Tail of the Pareto Distribution,” Neil Gunther had the following comment: “Unfortunately, you've fallen into the trap of using the ‘long tail’ misnomer. If you think about it, it can't possibly be the length of the tail that sets distributions like Pareto ... [Read more...]

1 2 »

R-bloggers

R news and tutorials contributed by hundreds of R bloggers