Blog Archives

A question of model uncertainty

A question of model uncertainty

It has been several months since my last post on classification tree models, because two things have been consuming all of my spare time.  The first is that I taught a night class for the University of Connecticut’s Graduate School of Business, introducing R to students with little or no prior exposure to either R or programming.  My hope...

Read more »

Assessing the precision of classification tree model predictions

Assessing the precision of classification tree model predictions

My last post focused on the use of the ctree procedure in the R package party to build classification tree models.  These models map each record in a dataset into one of M mutually exclusive groups, which are characterized by their average response.  For responses coded as 0 or 1, this average may be regarded as an estimate of...

Read more »

Classification Tree Models

Classification Tree Models

On March 26, I attended the Connecticut R Meetup in New Haven, which featured a talk by Illya Mowerman on decision trees in R.  I have gone to these Meetups before, and I have always found them to be interesting and informative.  Attendees range from those who are just starting to explore R to those who have multiple CRAN...

Read more »

Finding outliers in numerical data

Finding outliers in numerical data

One of the topics emphasized in Exploring Data in Engineering, the Sciences and Medicine is the damage outliers can do to traditional data characterizations.  Consequently, one of the procedures to be included in the ExploringData package is FindOutliers, described in this post.  Given a vector of numeric values, this procedure supports four different methods for identifying possible outliers.Before...

Read more »

Data Science, Data Analysis, R and Python

The October 2012 issue of Harvard Business Review prominently features the words “Getting Control of Big Data” on the cover, and the magazine includes these three related articles:“Big Data: The Management Revolution,” by Andrew McAfee and Erik Brynjolfsson, pages 61 – 68;“Data Scientist: The Sexiest Job of the 21st Century,” by Thomas H. Davenport and D.J. Patil, pages...

Read more »

Characterizing a new dataset

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.  So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post...

Read more »

Spacing measures: heterogeneity in numerical distributions

Spacing measures: heterogeneity in numerical distributions

Numerically-coded data sequences can exhibit a very wide range of distributional characteristics, including near-Gaussian (historically, the most popular working assumption), strongly asymmetric, light- or heavy-tailed, multi-modal, or discrete (e.g., count data).  In addition, numerically coded values can be effectively categorical, either ordered, or unordered.  A specific example that illustrates the range of distributional behavior often seen in a collection...

Read more »

Implementing the CountSummary Procedure

Implementing the CountSummary Procedure

In my last post, I described and demonstrated the CountSummary procedure to be included in the ExploringData package that I am in the process of developing.  This procedure generates a collection of graphical data summaries for a count data sequence, based on the distplot, Ord_plot, and Ord_estimate functions from the vcd package.  The distplot function generates both the Poissonness...

Read more »

Base versus grid graphics

Base versus grid graphics

In a comment in response to my latest post, Robert Young took issue with my characterization of grid as an R graphics package. Perhaps grid is better described as a “graphics support package,” but my primary point – and the main point of this post – is that to generate the display you want, it is sometimes necessary to use commands...

Read more »

Graphical insights from the 2012 UseR! Meeting

About this time last month, I attended the 2012 UseR! Meeting.  Now an annual event, this series of conferences started in Europe in 2004 as an every-other-year gathering that now seems to alternate between the U.S. and Europe.  This year’s meeting was held on the VanderbiltUniversity campus in Nashville, TN, and it was attended by about 500 R aficionados,...

Read more »