**Back Side Smack » R Stuff**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The intersection of mapping APIs, fast database operations and user engagement offers a lot of very cool crowdsourcing applications ranging from the benign and powerful (Google’s Person Finder) to the minor and questionable (A DUI checkpoints app). Most intriguing in this mix of applications and websites are the unexpected. Clay Shirky spends a bit of time talking about the power of many-to-many communication in fomenting and fostering rapid social action, but even a characterization as broad as that one understates the true range of potential collaboration. As economists we like to talk about communication in information-theoretic terms (if we talk about it at all). The primacy of local information is at the heart of the neoclassical worldview, but in the internet age we have been more focused on information and communication as functions of search costs. Diamond, not Hayek as it were. Part of this willingness to ignore crowdsourced data stems from a deep epistemological distrust of user generated data. Economists like to work with revealed preferences through pricing mechanisms; we aren’t too interested in how much someone claims to value a given good versus what they are willing to pay for that good. But in poor performing markets or goods without real markets we may only have survey data to draw from.

A canonical poorly performing market would be that characterized by repugnant transactions. And we can imagine no larger repugnant market than that for illicit drugs. Enter priceofweed.com, a site where you can report the price paid for various quantities of marijuana across the world (really mostly the US and Canada). I can’t improve much on the google maps applet they use to show average prices per state, but can we draw any conclusions about the market for marijuana from the data on the site? Fortunately for us, their URL and XML structure is very consistent. Each state has a page in a subdirectory of the country and each page includes two tables: the average price paid for various self reported qualities of weed and the individual reports. They report trimmed means (I don’t know their exact methodology, check the blog for more information) for high quality, medium quality and low quality pot along with the sample sizes for each level of quality. In order to extract this information we can use the **XML** package for R to pull in data from each state and concatenate them into a dataframe.

The above script is pretty rough. I don’t perform any consistency checking for missing values–Alaska does in fact report no low quality weed, what a paradise! I also perform a minimum amount of pre-processing. Each table is entered into R as a list of character vectors. I strip the dollar sign from the price, convert the price and sample size to numeric vectors and enter them into a pre-formed matrix. The `Sys.sleep()`

function is a clumsy attempt to be a good net-citizen. I don’t want to hammer the server as quickly as my requests are answered, so I can set the value to a few seconds between requests. What do we get as a result?

Alarm bells should be ringing for the social scientists in the room. It sure looks like a predominance of the data comes from people reporting to have purchased “high quality” weed. In fact if we poke through the reported sample sizes (with something like `apply(weed.prices.df[,4:6],2,sum)`

) we find that 10 times as many people reported buying high quality weed than low quality weed and twice as many reported buying high quality than medium quality. Perhaps the fundamental skepticism from economists is warranted. But this is a solvable problem. If we assume that users aren’t actually truthfully reporting high and medium quality as distinct from each other and are drastically under-reporting low quality purchase (after all who wants to tell even a computer that they bought crappy weed?) then we can get a decent aggregated sample for each state of the prices paid. So we just compute a weighted average for medium and high quality weed.

A slightly bimodal distribution but probably a much clearer picture of actual prices out there. The bulk of our distribution is between 300 and 400 dollars an ounce with peaks at ~$300 and ~$380. If you looked at the price of weed map you can see that a great deal of this variation is regional. Prices for marijuana in the Northeast and the rest of the Atlantic seaboard are high while prices in the Pacific Northwest are very low. Like any drug market the price differences are probably a supply story. Laws on cultivating and possession of marijuana are less severe and less uniformly enforced on the west coast as compared to New England or the South. In a market characterized by many sellers and many buyers with (relatively) homogeneous goods, this will drive the price up pretty neatly. Especially if you have federal trafficking laws serving as a constraint on arbitrage.

Knowing this we can look at the distribution of prices within regions. Here I am just using the R constants for regions (`state.region`

).

The picture here is slightly different than we might have imagined. The West clusters around 300 with a slight bump at 400, but the South has a large cluster of weighted prices around 300 as well. I suspect this is because the south sees much more reporting of medium quality pot than the West and the commensurate prices are lower (you wouldn’t see this in the priceofweed google maplet because they show high quality average by default). But the story for the Northeast and the North Central (what we would probably call the midwest) is pretty consistent.

Seeing as this is nominally an economics blog, we should look at what demographic characteristics of states may contribute to the price of marijuana. I took median income and population density as two possible explanatory variables of interest. Income because I would imagine that marijuana is something of a normal good; as people are on average wealthier they may tend to buy more pot. We can argue about the validity of this assumption (turns out median income doesn’t have much effect), but it seemed to be a decent starting point. Population density was chosen as a concession to the possibility that networks of friends and dealers may play into pricing. More dense states might have more people in contact with each other allowing better price discovery for buyers. Adding them all together we can estimate a linear regression on this cross section of states with `lm(Average ~ Income + log(Density) + Region , data=weighted)`

. A note on data. Flat csv files for population density and income are hard to come by. Most of the data out there is formatted for print or included in a larger dataset for more complex analysis. So I have taken the liberty of grabbing median income and population density from the BLS and the census and throwing them up on google docs. Population density and income are available to anyone who wants them.

Call: lm(formula = Average ~ Income + log(Density) + Region, data = weighted) Residuals: Min 1Q Median 3Q Max -72.304 -25.766 0.198 20.145 78.331 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.108e+02 4.837e+01 6.426 7.97e-08 *** Income 1.898e-03 8.849e-04 2.145 0.03751 * log(Density) -9.551e+00 5.348e+00 -1.786 0.08102 . RegionSouth -2.978e+01 1.784e+01 -1.669 0.10220 RegionNorth Central -1.884e+01 1.832e+01 -1.028 0.30950 RegionWest -6.533e+01 2.017e+01 -3.239 0.00229 **

Included in the coefficient of regression is the both the model intercept and (because we have factors in there) the North East region. We see that income has a slightly positive effect on average price and log of density (I chose the log of density in order to combat the wide dispersion in population density among states) has a slightly negative effect on price. The most statistically significant effect on price comes from the Western states. Not surprising given the large differences in price paid in the pacific northwest for weed compared to the rest of the country.

But we have more information about the state level data. For instance we know the sample sizes of each from the number of reports on medium and high quality weed. We can, if we want, weight the regression by the sample size as a *very* rough proxy for sample variance. Like I say in the comments to the code, don’t try this at home. Sample size **is not** the strict inverse of variance. We can imagine that a state like California which reported over 2000 prices (a large chunk of our overall sample) might have a high within-state variance for the reported prices. But as a guesstimate we can try it out. What do we get with a weighted least squares estimate?

Call: lm(formula = Average ~ Income + log(Density) + Region, data = weighted, weights = sqrt(Reports)) Residuals: Min 1Q Median 3Q Max -327.18 -76.50 -5.21 59.29 284.07 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 308.647208 47.366794 6.516 5.87e-08 *** Income 0.002037 0.000872 2.337 0.0241 * log(Density) -9.188842 5.631828 -1.632 0.1099 RegionSouth -35.813824 15.861839 -2.258 0.0290 * RegionNorth Central -31.321173 16.470353 -1.902 0.0638 . RegionWest -80.162453 18.414679 -4.353 7.88e-05 ***

Some immediate changes jump right out. Our region estimates become more significant and our density estimate drops out. This is largely an artifact of our very poor choice of weights (the regression was `lm(Average ~ Income + log(Density) + Region , data=weighted, weights=sqrt(Reports))`

). California was assigned an enormous weight in the regression due to the large sample size so any small effect will be magnified. However we should be happy that the results do not change dramatically between WLS and OLS. A better method to weight the data would be to compute the actual variance of reported prices, as priceofweed.com has anonymized data for each state. You could crawl their site (or more appropriately, email the owners and ask) for individual reports and compute the estimated variance directly. But I wanted a quick and dirty estimate.

Code to run the estimates and graph the results is below, as always:

**leave a comment**for the author, please follow the link and comment on their blog:

**Back Side Smack » R Stuff**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.