Social Media Analytics Research Toolkit (SMART@znmeb) Is Moving Into Private Beta
How ideological is Google?
Adam Bonica, a grad student in political science at NYU, recently published a ranking of the political slant of various professions, based on the amount and recipient (Republican or Democratic) of political donations by lawyers, lobbyists, physicians and many other occupations. This paper (PDF) gives the complete analysis, but the chart below (created using the ggplot2 graphics package in R) sums up the results nicely (click to enlarge):
(I liked this quote from Paul Kedrosky about this chart: "How come gas station attendants are so damn partisan?") Now, Adam has taken the analysis to the next level, by looking at employees of individual companies, instead of professions as a whole.
(Unfortunately, it's not entirely clear if the data point lies to the left or at the middle of the company label; this table helps sort out the exact rankings.) It may not be a surprise, for example, that Google's employees tend to give to Democratic-leaning candidates, but does that influence Google's policies as a whole? Adam's article at the link below delves into this question in more detail.
Ideological Cartography: The University of Google: Was the decision to exit China ideological or business as usual?
Why isn’t my 2X Ultra ETF keeping pace with the market and what is path asymmetry (R ex)?

Fig 1. Example of ultra 2X ETFs and path asymmetry
Many people seem to find it incomprehensible (if not reprehensible) that an underlying series may move a certain direction, yet, both the ultra short and ultra long series both finish below the underlying over the long run. What exactly is path asymmetry? Some traders might be familiar with the notion that if you lose some percentage of your account, like 50%, that you need more than 50% to make up for the loss. That is an example of path asymmetry (I should note someone also mentioned it's an example of Seigel's paradox).
Let's look at a very simple example of how this might affect a stock and it's 2x counterparts. Suppose a stock moves from 100dollars to 80 and back to 100 again-- break-even. The move from 100 to 80 on a percentage basis, was a 20% loss. However, to recoup that amount, we need to solve for 80*(1+x)=100; the answer is 25%, not 20%. This means even though the dollar amount is identical for both moves (20dollars down and up), the %amount is not. That is an example of path asymmetry. How does this affect the 2X ultra Leveraged ETFs? Well, since each ETF is designed to track twice the daily move of the underlying, the the +2x ETF will move 40% down, then it will move 50% back up, for a net dollar ending value of 90 dollars. The -2x ETF will move up 2x or 40% to 140, and then retrace -50% leaving it at only 70 dollars. Notice in both cases, each ultra ETF ends up below the underlying price. It is the simple mechanics of path dependency and asymmetry that account for this, even with perfect 2x leveraging. It is important to take into account path dependencies when dealing with any leveraged product, including hedging.
Now keep in mind, there is additional drag on these products, due to fund expenses, which does add merit to the original question. More on this is explained succinctly in this article by Alpha's Tristan Yates and Lye Kok .
Predicting April month return
Bespoke blogged about average monthly returns of the DJI and emphasized April. Before jumping on that information, let’s check some weak points.
In that post, only average returns are presented. We need at least extreme points (min;max) and confidence ranges. Second problem – the normal market have upward trend and we need to get rid of that. To do so, either we have subtract the rolling mean (a tough way) or use logarithmic prices (that’s the easy way!).
Instead of DJI, I took S&P500 from 1950 until now.
Sure, average return in April is above 0, but based on historical data, negative return is possible. I conducted t-test, where null hypothesis was, that average return is equal to zero. I got p-value of 0.0042, so null hypothesis can be rejected (return is above 0).
The graph below shows cumulative return of investment, investing only in April. Keep in mind, that this is log scale and real return would be higher.

Conlusion: based on this data, expect positive return in April.
R code
require(xts) require(quantmod) require(ggplot2) getSymbols('^GSPC',from='1950-01-01') return<-Delt(Cl(to.monthly(log(GSPC)))) return[1]<-0 temp<-as.double(format(index(return),'%m')) temp<-data.frame(as.double(return),as.numeric(temp)) qplot(factor(as.numeric(temp[,2])),as.double(return),data=temp,geom = "boxplot",ylab='Returns',xlab='Months') t.test(return[which(format(index(return),'%m')=='04')]) plot(cumprod(return[which(format(index(return),'%m')=='04')]+1),main='Cumulative return, 1950-present')
Lotka-Volterra model ~ intro
So many know about the Lotka-Volterra model (i.e. the predator-prey model) in ecology. This model portrays two species, the predator (y) and the prey (x), interacting each other in limited space.
The prey grows at a linear rate () and gets eaten by the predator at the rate of (
). The predator gains a certain amount vitality by eating the prey at a rate (
), while dying off at another rate (
).
Given this base, we can ask questions like, what parameterizations can we expect to find a coexistence between the fox and the hare (for example)?
Let’s choose some values for the model: . These values assume a weaker growth of the rabbits relative to the strength of the death of foxes. Below, I simulated these values in R.
And we get coexistence, they live happily forever after. With this simple model, we can play around by generalizing (logistic growth of prey, etc.). I will put up some posts doing so.
The way to do this in R is as follows (just use the deSolve package, which will supersede the odesolve package):
library(deSolve)
LotVmod <- function (Time, State, Pars) {
with(as.list(c(State, Pars)), {
dx = x*(alpha - beta*y)
dy = -y*(gamma - delta*x)
return(list(c(dx, dy)))
})
}
Pars <- c(alpha = 2, beta = .5, gamma = .2, delta = .6)
State <- c(x = 10, y = 10)
Time <- seq(0, 100, by = 1)
out <- as.data.frame(ode(func = LotVmod, y = State, parms = Pars, times = Time))
matplot(out[,-1], type = "l", xlab = "time", ylab = "population")
legend("topright", c("Cute bunnies", "Rabid foxes"), lty = c(1,2), col = c(1,2), box.lwd = 0)Filed under: deSolve, Food Web, ODEs, R

Some Code for Dumping Data from Twitter Gardenhose
Gardenhose is a Streaming API feed that continuously sends a sample (roughly 15% according to Ryan Sarver at the 140tc in September 2009) of all tweets to feed recipients. This is some code for dumping the tweets to files named by date and hour. It is in PHP which is not my favorite language, but works nonetheless. I received a few requests to post it, so here it is.
<?php
//gardenhosedump.php
$username = '';
$password = '';
while(true) {
$file = fopen("http://" . $username . ":" . $password . "@stream.twitter.com/1/statuses/sample.json","r");
while($data = fgets($file))
{
$time = @date("YmdH");
if ($newTime!=$time)
{
@fclose($file2);
$file2 = fopen("{$time}.txt","a");
}
fputs($file2,$data);
$newTime = $time;
}
//need to close the file, but only if it is open!
try {
@fclose($file);
} catch (MyException $e) {}
try {
@fclose($file2);
}
catch (MyException $e) {}
}
?>
TTR_0.20-2 on CRAN
TTR version 0.20-2
Changes from version 0.20-1
NEW FEATURES:
- Added VWAP and VWMA (thanks to Brian Peterson)
- Added v-factor generalization to DEMA (thanks to John Gavin)
CHANGES:
- Updated volatility() to handle univariate case of calc='close' (thanks to Cedrick Johnson)
- Moved EMA, SAR, and wilderSum from .Fortran to .Call and used xts:::naCheck in lieu of TTR's NA check mechanism
- RSI up/down momentum now faster with xts (thanks to Jeff Ryan)
- If 'ratio' is specified in EMA but 'n' is missing, the traditional value of 'n' is approximated and returned as the first non-NA value (thanks to Jeff Ryan)
BUG FIXES:
- Fix to stoch() when maType is a list and 'n' is not set in the list's 3rd element (thanks to Wind Me)
- Fixed fastK in stoch() when smooth != 1
- Fixed segfault caused by EMA when n < NROW(x) (thanks to Douglas Hobbs)
- test.EMA.wilder failed under R-devel (thanks to Prof Brian Ripley)
Scientists misusing Statistics
In ScienceNews this month, there's controversial article exposing the fact that results claimed to be "statistically significant" in scientific articles aren't always what they're cracked up to be. The article -- titled "Odds Are, It's Wrong" is interesting, but I take a bit of an issue with the sub-headline, "Science fails to face the shortcomings of Statistics". As it happens, the examples in the article are mostly cases of scientists behaving badly and abusing statistical techniques and results:
- Authors abusing P-vales to conflate statistical significance with practical significance. A for example, a drug may uncritically be described as "significantly" reducing the risk of some outcome, but the the actual scale of the statistically significant difference is so small that is has no real clinical implication.
- Not accounting for multiple comparisons biases. By definition, a test "significant at the 95% level" has 5% chance of having occurred by random chance alone. Do enough tests, and you'll find some indeed indicate significant differences -- but there will be some fluke events in that batch. There are so many studies, experiments and tests being done today (oftentimes, all in the same paper)that the "false discovery rate" maybe higher than we think -- especially given that most nonsignificant results go unreported.
Statisticians, in general, are aware of these problems and have offered solutions: there's a vast field of literature on multiple comparisons tests, reporting bias, and alternatives (such as Bayesian methods) to P-value tests. But more often than not, these "arcane" issues (which are actually part of any statistical training) go ignored in scientific journals. You don't need to be a cynic to understand the motives of the authors for doing so -- hey, a publication is a publication, right? -- but the cooperation of the peer reviewers and editorial boards is disturbing.
ScienceNews: Odds Are, It's Wrong
Example 7.30: Simulate censored survival data
Smoothing time series with R
Smoothing is a statistical technique that helps you to spot trends in noisy data, and especially to compare trends between two or more fluctuating time series. It's a useful visualization tool that I'm pleased to see cropping up more and more in statistical graphics on the Web -- it's now a staple in econometric charts and is heavily used in polling analysis. For example, here's smoothing used to combine data from various polls over time on Obama's job approval (from pollster.com).
The S language was, to the best of my knowledge, the first software that made statistical smoothing a core part of the graphics system: first with the lowess function and later with other more powerful alternatives. These days in R (S's successor), loess (local polynomrial regression fitting) is the usual go-to alternative for smoothing. With just a couple of lines of code, you can take a noisy time series in R and overlay a smooth trend line to guide the eye. Nathan Yau at FlowingData shows us how to take data like this:
and, with just a few lines of R code and some touching-up in Illustrator, create a chart like this:
FlowingData: How to: make a scatterplot with a smooth fitted line


