analyze the survey of business owners (sbo) with r

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

each year ending in a two or a seven, census bureau employees stow away their eponymous decennial population count to focus on the economic census.  and when charged with gauging the health of the american economy, who better to survey than business owners.  you read me right.  every five years, the united states government sends this questionnaire to almost every tax-filing sole proprietorship, partnership, and corporation in the nation.  after data collection, they merge on firm-level payroll, employment counts, and revenue statistics from the irs.  and then, and then, and then, they publish the de-identified microdata.  want to know how many minority-owned firms started doing business in the past eight years?  or how many husband and wife teams operate a bed-and-breakfast nationwide?  or which state has the highest rate of exporting firms?  you’re in the right place.

before you dig into the public use microdata sample (pums), see if you can find the numbers you’re looking for on american factfinder and be done with it.  here’s why:  the census bureau has strict rules about respondent confidentiality, so they cannot disclose anything that makes it easy to isolate specific individuals or corporations.  larger firms would be a cinch to pick out of microdata; by just limiting your data set to `state` equals “maryland” and `industry` equals “health care and social assistance” then sorting by the number of employees, you’d find the johns hopkins medical institutions in a complete dataset pretty fast.  so the pums tosses out a bunch of larger companies (mostly publicly-owned) that they call “not classifiable.”  to understand the damage, compare the first two rows of this table: the pums contains records representing 97% of all firms, but since larger businesses have been disproportionately tossed, firms included in the pums employ only 48% of all americans in the labor force and represent only 40% of all payroll and 36% of all commerical revenue nationwide.  one more caveat: instead of one-record-per-firm, the pums has one-record-per-firm-per-state-per-industry, so think of it as something between establishment– and firm-level.  and no, you won’t be able to aggregate those establishments (storefronts) up to the firm-level.  lucky for you, the weights do sum up to all classifiable, non-publicly-held firms.  maybe think of this file as a survey of smallish business owners.  no more bad news.

there’s just one technical document for the sbo pums, read pdf pages one through five.  please.  after that, they start describing variance calculations that i’ve gone out of my way to automate for you.  they recommend this never-before-seen hybrid complex sample design that just uses basic weights to calculate means, medians and totals, then a weirdly-detached multiple imputation procedure for the standard errors.  hakuna matata, i’ve written custom functions so you can focus on your research instead of translating their ancient greek.  this new github repository contains three scripts:

download and import.R

2007 single-year – analysis examples.R
  • connect to that sqlite database you’ve previously initiated
  • copy the main table `y` over to a table `x` and a bunch of mini-tables `x1` through `x10` that it’s cool if you screw around with, since you can always delete and re-create them from your pristine `y` table later
  • set up the hybrid complex sample survey design and class it as something special
  • generate the usual rigamarole of examples using code familiar to other multiply-imputed survey data analysis

recode and replicate.R
  • connect to that sqlite database you’ve previously initiated
  • copy over that main table `y` to `x` so you can make your mistakes on `x`
  • implement the same variable constructions seen in this census bureau-provided sas code to determine which firms are at least fifty percent minority-owned, with sql.  sql, sql, everywhere
  • splinter your recoded table `x` into ten miniature `x1` through `x10` tables, then construct those same two complex sample design objects
  • run just one svyby statement inside a multiple imputation combine function, and immediately generate every little statistic and standard error found in this census bureau-provided tabulation

click here to view these three scripts

for more detail about the survey of business owners, visit:


in addition to the pums (2007-only) and the 2002 and 2007 american factfinder data, the census bureau provides a battery of tables and reports back as far as 1992 that might have the statistic you seek.  if you need something not shown, you could always open up your wallet and buy a custom data table.

confidential to sas, spss, stata, and sudaan users: your statistical language is merely an illusion, albeit a very persistent one.  time to transition to r. 😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)