Plotting PDQ Output with R
With just a single queue (K = 1) the system saturates very quickly. The throughput curve shoots up the y-axis until it hits the ceiling at X = 2.0 requests/per-unit-time. Consequently, the linear rising slope on the early part of the throughput curve is almost indistinguishable from the optimal load-line at N* = 1.016 clients. This rapid saturation effect is less pronounced in a system with more queues because there are more service stages and completion therefore takes longer. But it requires a considerable number of additional queueing centers to get a noticeable difference, e.g., K = 20, 50. Observe also that the optimal load-line moves to the right and is positioned on the x-axis at a value very close to K. I'll let you ponder why that must be true.
The plot also explains the rationale for the approach I took in Chap. 10 of the Perl PDQ book where I modeled the scalability measurements of a multi-tier web application. In addition to the measured tiers, I ended up introducing 12 "dummy" queues in order to produce the correct round-trip latency, whilst retaining Z = 0 think time in accord with the original web application test scripts. The stunningly powerful conclusion was that there must've been additional latencies that were not included in the original measurements on the test rig. Otherwise, the data that were measured could not be reconciled with each other. Although I couldn't determine what the sources of those hidden latencies were, I could state quite categorically that they were real. You cannot possibly reach this kind of penetrating conclusion without a performance model. Data comes from the Devil, models come from God.
I didn't include the corresponding plots showing the effect of the dummy queues (similar to the above) in my Perl PDQ book because it was so tedious to write the data out to a file and then import it into Excel (which is what I was using back then). With PDQ-R, it's a snap to do it in about 50 lines.
R in The Windy City
In honor of me moving to Chicago, the powers who abide have decided to hold the first annual “R/Finance conference for applied finance using R” conference in Chicago this year. The dates are April 24-25, 2009.
R/Finance 2009: Applied Finance with R
To those who made the decision on location, I’m pleased but slightly embarrassed that you let my relocation decision have such a profound impact on your venue choice.
And to the three readers that feedburner tells me regularly read this blog (hi Mom), if you are attending this conference please let me know and I’ll buy you a beer or three.
-JD
Data Analysis Workflow… Part 1 of Infinity
One of the many things that I sit around pondering when I should be doing productive things is the idea of analytical workflow. I have only worked with one analytical guru who I felt really gave thought and structure to workflow and its impact on analyist productivity. When I talk about workflow I mean the whole process from the time the analytical guy thinks, “Hey, I need to understand the velocity of new purchases between different types of sales campaigns.” until he writes down his findings in a presentation or even just a notebook. In the middle I assume this guy extracts some data from a warehouse or live system, does some work on said data, tests some theories, does more stuff, goes and gets coffee, comes back and plays some flash games, goes home and does it again the next day.
Today I was reading over at Data Evolution about a presentation on how Google and Facebook use R. The following was a summary of what Bo Cowgill of Google said about his workflow:
The typical workflow that Bo thus described for using R was: (i) pulling data with some external tool, (ii) loading it into R, (iii) performing analysis and modeling within R, (iv) implementing a resulting model in Python or C++ for a production environment.
I found this interesting as I have been masticating on the idea of learning Python for some time. I have run into situations where R was slow, but generally I have solved those through rethinking my algorithm. I’m not really a good programmer in R (or any other language for that matter), but I do want/need/like the statistical functions and ease of plotting in R. If I do learn Python I’ll certainly use it to call R… but maybe I should just stick to R.
This has nothing to do with workflow, but the most thought provoking insights in the article above came from Itamar Rosenn at Facebook:
Itamar’s team used recursive partitioning (via the rpart package) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.
… [they also] found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.
So Facebook really wants new users to put more info into FB, use it more, and play with third party apps. I guess that logic is why LinkedIn is always telling me I am only 90% complete on my profile and I would be 95% if I would just, yada yada yada… The more info I put into their walled garden, the more I will play there. And the more ads I will see. Makes sense to me. I guess I follow the same model when I try to get my clients to use my services more and more… I want to be sticky too. But not in a bad way.
Review of ‘Applied Econometrics in R’ in JSS
Absolutely great resource
Learning to Sweave in APA Style
R/Finance conference in Chicago in April: Registration now open
See you in Chicago in April!
Sorry, you said you want a stats revolution?
ALL ABOUT REVOLUTION COMPUTING’S R DISTRIBUTION

Decision Science News was intrigued by a company called REvolution Computing that got some attention of late for spinning their own mix of the R language for statistical computing and giving it away for free. So DSN asked to interview them to see what it’s all about
Decision Science News: So who are you guys and what is your scientific background?
REvolution: Well, at this point our team has grown and we have about 30 employees with diverse backgrounds, from bioinformatics to finance to core statistics and software engineering. However when we got started with REvolution, we were a group that had tremendous experience with high performance computing and building production software. Our first application with R was something called ParallelR , which enables users of R to seamlessly benefit from optimized performance by automatically running on multiple cores, servers, and clusters (we even have a cloud-based deployment). Our team today is a combination of employees and an extended community, from R community participants, to package developers, to researchers and related consultants.
Decision Science News: how did you get started working with R professionally?
REvolution: Traditionally our customers came to us for parallel computing solutions based in languages like C, Fortran, or Java. More and more we started to see pull from our customers toward scripting languages, and R in particular. Some of our pharma partners particularly were compelled by the proposition of optimizing the performance of R, and many of our first references are related to those applications (gene expression, classification, etc. [case study])
Decision Science News: What’s so great about your R compared to the regular download?
REvolution: Well, we’re not competing with the “regular” download – we actively collaborate with the core team, and utilize the codebase. What we have done is on several fronts. First, we have added capability and functionality related to optimization and high performance. Second, we are adding specific support for the 64-bit Windows platforms (and other more obscure OS distributions). Third, we are actively working on an IDE, large data handling, and other interesting capabilities (stay tuned!).
In addition to these aspects, we have packaged REvolution R into a commercially supported distribution around which we also provide training and consulting services. It’s a fully supported product in the same spirit as, say, RedHat Linux.
Decision Science News: Is there any risk that getting ‘locked in’ to your distribution of R? What if your distribution goes away, will our code still run on vanilla R?
REvolution: We prefer to say “mandatory customer loyalty” than lock-in. (KIDDING!) Of course, “open source” is a big part of “commercial open source,” and users of REvolution R can run their codebase on vanilla R.

Project Euler Problem #15
PDQ-R Lives!

This is an important step for PDQ development and is due entirely to the efforts of Phil Feller. Naturally, this capability will be included in the next PDQ release from SourceForge, which we are currently working towards. Stay tuned!
R graphics: margins are way to large
Now compare this plot to the version I prefer much more:library(package="MASS")
Sigma <- matrix(c(10, 10, 10, 20), nrow=2)
mu <- c(100, 100)
tmp <- mvrnorm(n=200, mu=mu, Sigma=Sigma, empirical=TRUE)
plot(tmp, xlab="X variable (unit)", ylab="Y variable (unit)")
par(bty="l", pty="m", mar=c(3, 3, 1, 1), mgp=c(1.75, 0.75, 0))
plot(tmp, xlab="X variable (unit)", ylab="Y variable (unit)")


