PDQ 5.0 Test Suite or … How I Spent My Weekend

June 29, 2009 · Posted in R bloggers · Comments Off 
I was planning to blog about the amazing time I had at Velocity 2009 last week, when this landed in my mailbox (edited for space and privacy):

Subject: Seeking help with PDQ-R ...
Date: Thu, 25 Jun 2009 15:51:21 -0500

My name is James and I've been trying to learn to properly use PDQ after reading two of your books, "Guerrilla Capacity Planning" and "Analyzing Computer System Performance with Perl::PDQ." I'm still getting a grip on PDQ-R. ... I decided to set about of re-creating the queue circuit in the study with PDQ-R as an exercise. ...
The output of my code yields:
[1] "Manual response time for class 1 is 0.864179 seconds"
[1] "PDQ-R response time for class 1 is 0.313637 seconds"
[1] "Manual response time for class 2 is 6.105397 seconds"
[1] "PDQ-R response time for class 2 is 3.552873 seconds"
[1] "Manual response time for class 3 is 4.535833 seconds"
[1] "PDQ-R response time for class 3 is 4.535833 seconds"
If you could give my code a look over and give me some hints I would really appreciate it.

It turns out that James N. had discovered a bug (gasp!) in PDQ, which is why we have users. (jk) The above output refers to a simple model of a database system comprising 3 resources (call them: cpu, disk1 and disk2) and 3 transaction streams (work1, work2, work3) and no limit on the queue lengths, i.e., an open queueing network or circuit. Here's what my rendition looks like:

# PDQ-R model

library(pdq)

# Request rates of the 3 transaction streams into the DBMS
Xsys<-c(50/150, 80/150, 70/150)

# Service demands at each resource
Dcpu<-c(0.096, 0.615, 0.193)
Ddk1<-c(0.088, 0.683, 0.763)
Ddk2<-c(0.119, 0.795, 0.400)

# Start PDQ code with Init call
Init("James' DB Model");

# Define the 3 transaction workloads
workname<-1:3
for (w in 1:3) {
workname[w] <- sprintf("work%d", w)
CreateOpen(workname[w], Xsys[w])
}

# Define the 3 resources
CreateNode("cpu", CEN, FCFS)
CreateNode("dk1", CEN, FCFS)
CreateNode("dk2", CEN, FCFS)

for (w in 1:3) {
SetDemand("cpu", workname[w], Dcpu[w])
SetDemand("dk1", workname[w], Ddk1[w])
SetDemand("dk2", workname[w], Ddk2[w])
}

Solve(CANON)
Report()

To hunt down the problem, I rewrote the PDQ-R model in C, just in case there were any translation problems with SWIG-ing PDQ/lib into PDQ-R, Perl PDQ, PyDQ, etc.

/*
multiclass-open.c

Created by NJG on Thursday, June 25, 2009
Updated by NJG on Sunday, June 28, 2009
*/

#include < stdio.h >
#include < stdlib.h >
#include < string.h >
#include < math.h >
#include "PDQ_Lib.h"


int main(void) {

extern void exit();
extern char s1[];

char *p; // dummy pointer for names
char *devname[3];
char *workname[3];
int i, j, k, n, s, w;
double actualtR[4][3];

// Expected RT values
double expectR[4][3] = {
{0.174, 1.118, 0.351},
{0.351, 2.734, 3.054},
{0.340, 2.270, 1.142},
{0.865, 6.122, 4.546}
};

// Request rates of the 3 transaction streams into the DBMS
double Xsys[] = {50.0/150, 80.0/150, 70.0/150};

// Service demands
double Dcpu[] = {0.096, 0.615, 0.193};
double Ddk1[] = {0.088, 0.683, 0.763};
double Ddk2[] = {0.119, 0.795, 0.400};

// Name the workloads
for(w = 0; w < 3; w++) {
resets(s1);
sprintf(s1, "work%d", w+1);
if ( (p = (char *) malloc(strlen(s1) * sizeof(char)) ) != NULL) {
strcpy(p, s1); // copy into assigned storage
workname[w] = p;
}
else {
printf("malloc failed!\n");
exit(-1);
}
}
free(p);

// Name the resources
for(k = 0; k < 3; k++) {
resets(s1);
if (k == 0) sprintf(s1, "%s", "cpu");
if (k == 1) sprintf(s1, "%s", "dk1");
if (k == 2) sprintf(s1, "%s", "dk2");
if ( (p = (char *) malloc(strlen(s1) * sizeof(char)) ) != NULL) {
strcpy(p, s1); // copy into assigned storage
devname[k] = p;
}
else {
printf("malloc failed!\n");
exit(-1);
}
}
free(p);

/************************** Start PDQ code **********************/
PDQ_Init("Multiclass Test Model");

// Create workloads
for(w = 0; w < 3; w++) {
s = PDQ_CreateOpen(workname[w], Xsys[w]);
}

// Create resources
n = PDQ_CreateNode("cpu", CEN, FCFS);
n = PDQ_CreateNode("dk1", CEN, FCFS);
n = PDQ_CreateNode("dk2", CEN, FCFS);

// Assign demands
for(w = 0; w < 3; w++) {
PDQ_SetDemand("cpu", workname[w], Dcpu[w]);
PDQ_SetDemand("dk1", workname[w], Ddk1[w]);
PDQ_SetDemand("dk2", workname[w], Ddk2[w]);
}

PDQ_Solve(CANON);

printf("Expected Response Times\n");
for(i = 0; i < 4; i++) {
for(j = 0; j < 3; j++) {
printf("%4.3f\t", expectR[i][j]);
}
printf("\n");
}
printf("--------------------------\n");

printf("Actual Response Times\n");
for(i = 0; i < 4; i++) {
// System response times for QNM
if (i == 3) {
for(w = 0; w < 3; w++) {
printf("%4.3f\t",
actualtR[i][w] = PDQ_GetResponse(TRANS, workname[w]));
}
}

// Residence times per resource
if (i < 3) {
for(w = 0; w < 3; w++) {
printf("%4.3f\t",
actualtR[i][w] = PDQ_GetResidenceTime(devname[i], workname[w], TRANS));
}
}
printf("\n");
}

} // main

This code also compares actual (meaning, computed by PDQ) with expected values (embedded as 2-d array) of residence times due to each workload at each resource. The "expected" values can come from any one of a number of sources such as: measurements, other models, other tools, etc. This forms the basis of the test code approach.

The problem seen by James turns out to arise from a conflict between the way resource utilizations are computed for the new multi-server queues (released in PDQ 5.0.1) and multi-class workloads. When the PDQ lib is corrected, the agreement can be observed in this output:

[njg]~/PDQ/Test Suite/C-PDQ% ./mulclass-open
Expected Response Times
0.174 1.118 0.351
0.351 2.734 3.054
0.340 2.270 1.142
0.865 6.122 4.546 <--
--------------------------
Actual Response Times
0.175 1.118 0.351
0.352 2.728 3.048
0.340 2.274 1.144
0.866 6.120 4.543 <--

The last line in each of the above tables corresponds to the "manual" values that James was reporting in his email.

The PDQ test cases had not kept up with new code developments; one of the hazards of only having severely punctuated time to work on PDQ. The new release PDQ 5.0.2 should be available for download later this week. I'll send out an email notice at that time.

August Guerrilla Class: Using R for Performance Analysis

June 29, 2009 · Posted in R bloggers · Comments Off 
Registrations are still open for the Guerrilla Data Analysis Techniques (GDAT) class being held August 10-14, 2009. The focus will be on using R and the new release of PDQ-R for performance analysis and capacity planning.

All Guerrilla classes are held at our Larkspur Landing location in Pleasanton, California; a 45-minute BART ride to downtown San Francisco.


(Click on the image for details)

For those of you coming from international locations, here is a table of currency EXCHANGE rates. We look forward to seeing all of you in August!

Time series data

June 28, 2009 · Posted in R bloggers · Comments Off 
gdp <- read.table("c:/data/GDPC96.txt", header = TRUE)

attach(gdp)

as.Date(date)

plot(gdp~date, data=gdp,pch=16,xlab="",ylab="GDP (2000 dollars)")



RSI(2) Evaluation

June 28, 2009 · Posted in R bloggers · Comments Off 
Despite my best efforts, it's been a month since the last post of this series. The first post replicated this simple RSI(2) strategy from the MarketSci Blog using R. The second post showed how to replicate the strategy that scales in/out of RSI(2).

This post will use the PerformanceAnalytics package to evaluate the rules that scale in/out of positions. I've also provided a simple function that provides some summary statistics. There is a lot of code, so I put it at the end of the post.

Table 1 contains output from my simple trade summary function (the wins and losses are in percentages, i.e. 0.69 is 69 basis points). The short side of the rule traded more often and had a lower win rate. The short side overcame its lower win rate via much higher mean and median win/loss ratios.

Table 1: RSI(2) Trade Statistics - RSI steps = 5, Size steps = 0.25
Signal# Trades% WinMean WinMean LossMedian WinMedian LossMean W/LMedian W/L
-1.00133580.69-0.440.53-0.251.552.12
-0.75173490.62-0.390.37-0.251.601.48
-0.50143540.43-0.360.28-0.191.191.51
-0.25158560.21-0.190.14-0.131.151.08
0.0012620NaNNaNNANANaNNA
0.25117530.26-0.310.18-0.210.830.86
0.50137580.51-0.580.31-0.350.870.89
0.75143620.88-0.890.50-0.710.990.70
1.00119631.34-1.410.80-1.110.950.71

Table 2 shows the output from the PerformanceAnalytics table.Drawdowns() function. The largest percentage drawdown occurred in late 2008, but only lasted a few weeks.

The table also shows the system is prone to drawdowns that trough quickly and take months to recover from. A week of bad trades can take months to recover from.

Table 2: RSI(2) Drawdowns - RSI steps = 5, Size steps = 0.25
FromTroughToDepthLengthTo TroughRecovery
2008-10-062008-10-102008-10-28-0.15717512
2001-08-302001-09-212002-01-23-0.091961284
2002-07-192002-07-232002-08-20-0.08823320
2000-03-222000-04-142000-07-05-0.076731855
2009-02-172009-02-232009-04-27-0.07049544
2003-03-142003-03-212003-05-09-0.05540634
2000-10-092000-10-122000-12-06-0.05242438
2002-08-292002-09-242002-10-10-0.051301812
2008-01-022008-01-222008-03-11-0.045481434
2001-04-182001-06-182001-08-10-0.045814338

Table 3 shows the output from the PerformanceAnalytics table.DownsideRisk() function. The ratio of gain/loss deviation is encouraging. I have to defer to the PerformanceAnalytics documentation and vingettes to describe the rest of the table.

Table 3: RSI(2) Downside Risk - RSI steps = 5, Size steps = 0.25
StatisticReturn
Semi Deviation0.0050
Gain Deviation0.0094
Loss Deviation0.0076
Downside Deviation (MAR=10%)0.0099
Downside Deviation (rf=0%)0.0092
Downside Deviation (0%)0.0092
Maximum Drawdown-0.1572
VaR (99%)0.0160
Beyond VaR0.0160
Modified VaR (99%)0.0705

The chart below shows the output from the PerformanceAnalytics charts.PerformanceSummary() function. It shows the equity curves and drawdown from peak for the long and short sides of the strategy. The middle graph shows the *daily* returns for the combined strategy.


The code below has everything that created the results above. It also contains the same results for a modified RSI(2) strategy. The modified strategy uses RSI steps of 10 and sizing steps of 0.3 (i.e. RSI<10 -> size=1, 10<20 -> size=0.7, etc.).

# Attach packages. You can install packages via:
# install.packages(c("quantmod","TTR","PerformanceAnalytics"))
library(quantmod)
library(TTR)
library(PerformanceAnalytics)

# Pull S&P500 index data from Yahoo! Finance
getSymbols("^GSPC", from="2000-01-01")

# Calculate the RSI indicator
rsi <- RSI(Cl(GSPC), 2)

# Calculate Close-to-Close returns
ret <- ROC(Cl(GSPC))
ret[1] <- 0

# This function gives us some standard summary
# statistics for our trades.
tradeStats <- function(signals, returns) {
# Inputs:
# signals : trading signals
# returns : returns corresponding to signals

# Combine data and convert to data.frame
sysRet <- signals * returns * 100
posRet <- sysRet > 0 # Positive rule returns
negRet <- sysRet < 0 # Negative rule returns
dat <- cbind(signals,posRet*100,sysRet[posRet],sysRet[negRet],1)
dat <- as.data.frame(dat)

# Aggreate data for summary statistics
means <- aggregate(dat[,2:4], by=list(dat[,1]), mean, na.rm=TRUE)
medians <- aggregate(dat[,3:4], by=list(dat[,1]), median, na.rm=TRUE)
sums <- aggregate(dat[,5], by=list(dat[,1]), sum)

colnames(means) <- c("Signal","% Win","Mean Win","Mean Loss")
colnames(medians) <- c("Signal","Median Win","Median Loss")
colnames(sums) <- c("Signal","# Trades")

all <- merge(sums,means)
all <- merge(all,medians)

wl <- cbind( abs(all[,"Mean Win"]/all[,"Mean Loss"]),
abs(all[,"Median Win"]/all[,"Median Loss"]) )
colnames(wl) <- c("Mean W/L","Median W/L")

all <- cbind(all,wl)
return(all)
}

# This function determines position size and
# enables us to test several ideas with much
# greater speed and flexibility.
rsi2pos <- function(ind, indIncr=5, posIncr=0.25) {
# Inputs:
# ind : indicator vector
# indIncr : indicator value increments/breakpoints
# posIncr : position value increments/breakpoints

# Initialize result vector
size <- rep(0,NROW(ind))

# Long
size <- ifelse(ind < 4*indIncr, (1-posIncr*3), size)
size <- ifelse(ind < 3*indIncr, (1-posIncr*2), size)
size <- ifelse(ind < 2*indIncr, (1-posIncr*1), size)
size <- ifelse(ind < 1*indIncr, (1-posIncr*0), size)

# Short
size <- ifelse(ind > 100-4*indIncr, 3*posIncr-1, size)
size <- ifelse(ind > 100-3*indIncr, 2*posIncr-1, size)
size <- ifelse(ind > 100-2*indIncr, 1*posIncr-1, size)
size <- ifelse(ind > 100-1*indIncr, 0*posIncr-1, size)

# Today's position ('size') is based on today's
# indicator, but we need to apply today's position
# to the Close-to-Close return at tomorrow's close.
size <- lag(size)

# Replace missing signals with no position
# (generally just at beginning of series)
size[is.na(size)] <- 0

# Return results
return(size)
}

# Calculate signals with the 'rsi2pos()' function,
# using 5 as the RSI step: 5, 10, 15, 20, 80, 85, 90, 95
# and 0.25 as the size step: 0.25, 0.50, 0.75, 1.00
sig <- rsi2pos(rsi, 5, 0.25)

# Break out the long (up) and short (dn) signals
sigup <- ifelse(sig > 0, sig, 0)
sigdn <- ifelse(sig < 0, sig, 0)

# Calculate rule returns
ret_up <- ret * sigup
colnames(ret_up) <- 'Long System Return'
ret_dn <- ret * sigdn
colnames(ret_dn) <- 'Short System Return'
ret_all <- ret * sig
colnames(ret_all) <- 'Total System Return'

# Create performance graphs
png(filename="20090606_rsi2_performance.png", 720, 720)
charts.PerformanceSummary(cbind(ret_up,ret_dn),methods='none',
main='RSI(2) Performance - RSI steps = 5, Size steps = 0.25')
dev.off()

# Print trade statistics table
cat('\nRSI(2) Trade Statistics - RSI steps = 5, Size steps = 0.25\n')
print(tradeStats(sig,ret))

# Print drawdown table
cat('\nRSI(2) Drawdowns - RSI steps = 5, Size steps = 0.25\n')
print(table.Drawdowns(ret_all, top=10))

# Print downside risk table
cat('\nRSI(2) Downside Risk - RSI steps = 5, Size steps = 0.25\n')
print(table.DownsideRisk(ret_all))

# Calculate signals with the 'rsi2pos()' function
# using new RSI and size step values
sig <- rsi2pos(rsi, 10, 0.3)

# Break out the long (up) and short (dn) signals
sigup <- ifelse(sig > 0, sig, 0)
sigdn <- ifelse(sig < 0, sig, 0)

# Calculate rule returns
ret_up <- ret * sigup
colnames(ret_up) <- 'Long System Return'
ret_dn <- ret * sigdn
colnames(ret_dn) <- 'Short System Return'
ret_all <- ret * sig
colnames(ret_all) <- 'Total System Return'

# Calculate performance statistics
png(filename="20090606_rsi2_performance_updated.png", 720, 720)
charts.PerformanceSummary(cbind(ret_up,ret_dn),methods='none',
main='RSI(2) Performance - RSI steps = 10, Size steps = 0.30')
dev.off()

# Print trade statistics table
cat('\nRSI(2) Trade Statistics - RSI steps = 10, Size steps = 0.30\n')
print(tradeStats(sig,ret))

# Print drawdown table
cat('\nRSI(2) Drawdowns - RSI steps = 10, Size steps = 0.30\n')
print(table.Drawdowns(ret_all, top=10))

# Print downside risk table
cat('\nRSI(2) Downside Risk - RSI steps = 10, Size steps = 0.30\n')
print(table.DownsideRisk(ret_all))


Conservatism of Congressional delegation and %Bush vote

June 27, 2009 · Posted in R bloggers · Comments Off 
Busy day today, so I'll just post this:

plot(bush04 ~ cons_hr, type = "n",
xlab="Mean ACU rating",
ylab="2004 Bush vote",
xlim=c(0,100),
ylim=c(0,100),
cex.lab=1.25,cex.axis=0.75,
col.axis = "#777777",
col.lab = "#777777")
text(y=bush04,x=cons_hr, labels=stateid,cex=0.75)
abline(lm2, lty=2, col="red")
axis(side = 2, at = seq(0,100,by=5), cex=0.75)


R 2.9.1, CRANberries outage, and missing Java support

June 27, 2009 · Posted in R bloggers · Comments Off 
Just a short note that version 2.9.1 of R was released yesterday. And a corresponding Debian release went out as usual on the same day. One sour note: as the Java toolchain is currently broken, I had to disable compile-time support for Java. Just run R CMD javareconf once installed if you need it.

Speaking of broken, I had neither noticed that this R version now returns an additional field (for the repository) in the per-package metadata via available.packages(), nor that this change had broken my oh-so-useful and increasingly popular CRANberrries html and rss summaries of CRAN changes. So with the usual beta and rc releases or R 2.9.1 in Debian starting a week prior, CRANberries had been silent for six days from Friday the 21st to last Thursday. I rectified it once I noticed, and changed the code to no longer fall on its nose at that spot. Sorry for the few days without service.

Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box

June 26, 2009 · Posted in R bloggers · Comments Off 
Our article (by Yu-Sung, Jennifer, Masanao, and myself, and based also on work with Kobi, Grazia, and Peter Messeri) will be appearing in the Journal of Statistical Software, in a special issue on missing-data imputation. Here's the abstract: Our mi...

Filtering cases

June 26, 2009 · Posted in R bloggers · Comments Off 
Something that's very important to be able to do in data analysis and visualization is to filter out cases. Let's say you want to do identical analyses of two different groups, or of one group and then a subset of it. R can do this a little differently; instead of merely filtering out cases you can create an object that is a subset, and then call it when necessary.

Let's look at some data on the U.S. Congress. Keith Poole has developed a two-dimensional procedure that places members of Congress at specific points based on roll call votes. What we'll do now is compare Democrats and Republicans in the 110th Congress.

First, we load the data into R.

voteview <- read.csv ("C:/Data/HouseSmall.csv", header = TRUE) attach (voteview)

The voteview data frame contains data on all Congresses beginning with the 101st. That's more than we want to deal with, and also, we need a way to look at Democrats and Republicans separately. We'll create an object just for Democrats in the 110th Congress. and then one for Republicans.

dems110 <- subset(voteview, party == 100 & cong == 110)

reps110 <- subset(voteview, party == 200 & cong == 110)


Now let's create a graph to compare them.

plot (c (-1.5, 1.5), c(-1.5, 1.5), type = 'n',
xlab = "1st dimension",

ylab = "2nd dimension",

col.axis = "#777777",

col.lab = "#777777",

cex.axis = 0.75,

cex.lab = 1.25,

main = "DW-nominate scores, 110th Congress",

col.main = "#444444")

abline (v = 0, col = "#cccccc")

points (dwnom2 ~ dwnom1, data = dems110, pch = "D", col = "blue", cex = 0.75)

points (dwnom2 ~ dwnom1, data = reps110, pch = "R", col = "red", cex = 0.75)


That's all a bit complicated, so next time I'll talk about what all those things do. But for now I'll just show what it looks like.

Figure 1. The polarization of Congress.

Set the significant digits for each column in a xtable for fancy Sweave output

June 26, 2009 · Posted in R bloggers · Comments Off 
This tip may be useful in the situations when you need to set the number of digits to print for the different columns in a matrix/data.frame to be outputted as a LaTeX table.
 For example:

#install.packages("xtable")
#library(xtable)
tmp <- matrix(rnorm(9), 3, 3)
xtmp <- xtable(tmp)
digits(xtmp) <- c(0,0,3,4)
print(xtmp, include.rownames = FALSE) # row names suppressed


See here for a nice gallery depicting a large variety of outputs you can produce using the xtable package.

A bit about linear models

June 26, 2009 · Posted in R bloggers · Comments Off 
Before we delve into slightly more advanced plotting commands I want to talk a little about linear models, specifically, linear regression. In R this is very, very simple. For instance, in our 'states' data frame, we might want to look at median household income as a predictor of state education expenditures. The command lm calculates this for us. We'll call our first model, 'model1':

model1 <- lm (publicedexp~hincome)

OK, great, but where are our results? One of the things about R is that you can assign names to all sorts of things, even models. That way, you can continually refer to them when doing other things (as we'll see a bit later.) The way to look at our results is with this:

summary (model1)

Call:
lm(formula = publicedexp ~ hincome)

Residuals:
Min 1Q Median 3Q Max
-397.50 -127.43 -8.69 120.96 431.85

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.516e+02 1.735e+02 1.450 0.153
hincome 2.346e-02 3.869e-03 6.063 1.87e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 198.8 on 49 degrees of freedom
Multiple R-squared: 0.4287, Adjusted R-squared: 0.417
F-statistic: 36.76 on 1 and 49 DF, p-value: 1.87e-07


Which gives us a lot more information than if we'd just run the lm command without assigning a name to the model. Later we'll look at how we can integrate our linear model with our plots.

Next Page »