HRSA Area Resource File Format 2009

February 23, 2011
By

(This article was first published on BioStatMatt » R, and kindly contributed to R-bloggers)

From the HRSA website:

[The ARF 2009] is a database containing more than 6,000 variables for each of the nation’s counties. ARF contains information on health facilities, health professions, measures of resource scarcity, health status, economic activity, health training programs, and socioeconomic and environmental characteristics.

The data file itself is formatted accordingly (from the ARF FAQ):

Q: What are the file specifications?
A: The ARF is an ASCII file with a fixed record format and a record length of 31959 (for the 2009-2010 release). The current release of the ARF has 3225 records (one for each county and independent city in the U.S. as well as one for each county equivalent in the following U.S. territories: Guam, Puerto Rico and the Virgin Islands). There are approximately 6000 variables for each county. The file size is approximately 100MB. Programming software, such as SAS, SPSS or COBOL is needed to extract data from the file, unless using the MS Access version.

The file is difficult to work with directly because of its size and format. Because the data are stored as text, numbers are stored inefficiently (after conversion and compression, the equivalent R data file is 10% of the original size). In cases like this, the saving-grace is human readability. Although the file is ASCII, or rather extended ASCII (I found an accent in San Sebastiàn, PR), it’s not human-readable because the 6256 fields aren’t delimited and are variable in length. Hence, it’s nearly impossible to visually track where fields begin and end. The data are distributed with a SAS macro to read the data into a SAS dataset.

In order to read the data file in R, we must extract the field width information from the meta data file, which is distributed as a Microsoft Excel spreadsheet. Unfortunately, the meta data reads more like a book than a spreadsheet, making export difficult. In the end, I built a sed chain to extract the important bits, including the field offsets, lengths, labels, and types. The extracted meta data are available here: arf2009meta.csv. These meta data may be used to read the ARF data into R using the gdata::read.fwf function, which is unbearably slow on my computer.

Below is an R function that uses the extracted meta data to convert the data to CSV format. Below that is a small stand-alone C program that converts the ARF 2009 data file to CSV much faster. The C code below doesn’t include the definitions for len and lab because they are too long to put into a post. Get the complete program here: asc2csv.c

If you would like consulting support with an HRSA Area Resource File, contact me

# metafile - the meta data file at http://biostamatt.com/uploads/arf2009meta.csv
# ascfile  - the original ARF 2009 ASCII data file
# csvfile  - the file where CSV formatted data is written
 
arf2009.convert <- function(metafile, ascfile, csvfile) {
    cat("reading ARF 2009 meta information...")
    meta <- read.csv(metafile, header=TRUE,
        colClasses=c("numeric","numeric","character","character"))
    meta$start  <- as.integer(meta$offset)
    meta$stop   <- as.integer(meta$offset + meta$length - 1)
    meta$nfield <- length(meta$field)
 
    cat("done\nloading ARF 2009 data...")
    lines <- readLines(ascfile)
    lines <- iconv(lines, to="ASCII//TRANSLIT", sub="byte")
 
    cat("done\nconverting ARF 2009 data...")
    nfields <- length(meta$fields)
    csvcon  <- file(csvfile, "w")
    cat(paste("\"", meta$field, "\"", sep="", collapse=","), "\n", file=csvcon)
    asc2csv <- function(dat, meta, con) {
        nfield <- length(meta$field)
        splits <- sapply(1:nfield, function(ind) 
                      substr(dat, meta$start[ind], meta$stop[ind]))
        cat(paste("\"", splits, "\"", sep="", collapse=","), "\n", file=con)
        cat(".")
    }
    invisible(lapply(lines, asc2csv, meta=meta, con=csvcon))
    close(csvcon)
    cat("done\n")
}
/* gcc -o asc2csv asc2csv.c */
/* ./asc2csv arf2009.asc > arf2009.csv */
 
#include <stdio.h>
#include <stdlib.h>
 
#define NCOL 6256
#define RLEN 31959
static unsigned short len[NCOL];
static char * lab[NCOL];
int main(int argc, char** argv) {
    FILE * fd;
    char buf[RLEN+2], *ptr;
    unsigned int num, col;
    if(argc <2) {
        printf("no file specified\n");
        exit(1);
    }
 
    fd = fopen(argv[1], "rb");
    if(!fd) {
        printf("file open failed\n");
        exit(1);
    }
 
    for(col = 0; col < NCOL; col++) {
        if(col < NCOL - 1) {
            printf("\"%s\",", lab[col]);
        } else {
            printf("\"%s\"\n", lab[col]);
        }
    }
 
    while(!feof(fd)) {
        if(fread(buf, 1L, RLEN+2, fd) < RLEN+2) {
            printf("file read failed\n");
            exit(1);
        }
        for(ptr = buf, col = 0; col < NCOL; col++) {
            printf("\"");
            fwrite(ptr, 1L, len[col], stdout);
            ptr += len[col];
            if(col < NCOL - 1) {
                printf("\",");
            } else {
                printf("\"\n");
            }
        } 
    }
    return 0;
}

To leave a comment for the author, please follow the link and comment on his blog: BioStatMatt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.