Why [Not] Simulate?

November 27, 2012

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

Since we are in the “big data” Era, which means that a massive amount of data are made daily available by governments, institutions, and ordinary people like you and me—the importance of simulate data for “frequentists”, like myself, seems to fade. After all, once we have access to “real data”, why bother with non-real data, right? Simulation data, however, is not only about generating numbers to cover the absence of real information. Quite opposite, simulation is a powerful tool that help us to understand different—but probabilistic—scenarios in the world out there.

Indeed, simulations have been the practice of researchers in all kinds of science, from mathematics, engineering, economics to social sciences. When we look at the use of simulations in social sciences, the very application is to support arguments developed prior to the simulation and advanced by the authors. In that sense, who uses simulations in social science is major oriented to complement a theory than to propose a new one based exclusively on the simulations. The literature is replete with examples of methods and theories that use simulations to complement the authors’ theoretical results (see, for example, Zaller (1992a) The Nature and Origins of Mass Opinion, Bartels (1996), Althaus (2003) Collective Preferences in Democratic Politics, Blais and Indridason (2007).[1]

Despite this “trend” in using simulations in political science, there are many other reasons to try simulations even before collecting real-data in the field, or before running an experiment in the lab. What’s more, simulations might be used as a mechanism for uncover relationships that are too difficult or complex to disentangle with the data at the time.

Although there are virtually no limits for simulations, as a social scientist I do believe we should keep in mind the product of simulations–agents and behavior–must mimic their counterparts in the real world, aren’t, they? I know; many colleagues may disagree with me on this point because once you control the environment of the simulation, nearly any result is possible. Whether we cannot see a particular behaviour in the real life, it may be due to a specific constraint that we are free to relax in simulation exercises.

I conclude, therefore, there are compelling reasons for do simulations in political science as do in other fields. Doing so, we are not abandoning the “real-world problems”, rather by simulating a plethora of situations, we can better understand the reality around us. As a minor example of how powerful simulation is, I will simulate the effect of sample sizes in opinion polls regarding the last local elections in Brazil.

The first chart presents the rating of candidates support for the biggest city in Brazil—Sao Paulo—measured by three important pollsters in the country: DataFolha, Ibope, and Vox Populi. I drew few dashed lines corresponding: (1) the time when free TV advertising started (black vertical dashed line), (2) the real population mean support for the candidate José Serra (horizontal blue dashed line), (3) the true population mean support for Fernando Haddad (horizontal red dashed line), and the true population mean support for the third candidate Celso Russomano (horizontal green dashed line). The bold lines, as expected, means the rating of the three first candidates as reported by the pollsters. Other candidates, undecided, blank, and null vote categories were just disregarded for the plot, though their proportions are counted in analysis.

At first sight, the following graph presents a complex situation for pollsters to measure the “true” support proportion among São Paulo electors. After all, the differences of vote support among candidates are not so high and quite dynamic. Additionally, any difference almost disappeared in the very last period of the campaign. Finally, pollsters were dealing with an electoral campaign with 12 candidates, not only two or 3.
It is interesting to argue why pollsters were so loosely in the instant-runoff voting, but quite precise in the second turn of the election with the first two candidates. The following simulations will help to understand how the sample size effect, plus the fractionalization of the vote support may have misled the measurement of the vote distribution in Sao Pau lo city. To keep things easy, I drew dashed lines once again, but now to correspond e the true population mean of voting intention for each candidate. Essentially, it is the electoral outcome. Keeping this as a simplex problem, I will test 3 different-size samples. The sample sizes I am going to simulate actually mirror the sample sizes used by the pollsters in the country.

Here is the code I am using to simulate the samples. It is a quick implementation of Wild and Seber Chance Encounters.

What emerges from the simulations? First, the bars mean the probability, so, the higher is a bar, the higher is the likelihood to pick that value. Let’s look then for a pattern of the bars’ distribution for the candidate who ended up winning the election in the second ballot: Fernando Haddad. It is somewhat clear that pollsters sampling on average 1200 voters had little less chance to right pinpoint the proportion of support for this candidate. A pollster though could get closer to his true value by sampling around 3000 voters. Sampling 1000 voters does not look so bad. Even though the interval is wider, the probabilities for each value seems pretty normal distributed.

Cited References

[1] Althaus, S L. 2003. Collective preferences in democratic politics: Opinion surveys and the will of the people . Cambridge Univ Pr.

Bartels, L M. 1996. “Uninformed Votes: Information Effects in Presidential Elections .” American Journal of Political Science 40(1): 194–230.

Blais, A, and I H Indridason. 2007. “Making candidates count: the logic of electoral alliances in two-round legislative elections .” Journal of Politics 69(1): 193–205.

Zaller, John. 1992. The nature and origins of mass opinion . Cambridge University Press.

Wild, C. J and George Seber. “Chance encounters: A first course in data analysis and inference.” (2000).

To leave a comment for the author, please follow the link and comment on their blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)