fwf without the faff

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fixed width files? What are those?

Well, imagine your entire dataframe/ tibble/ data.table is in a single column.

That’s it, over 200 columns, in one. Each row is a very long string.

Sounds fun right?

Ideally, there would be delimiters ( you know, a few friendly commas between columns) to help you figure out where each new column starts.

Or you could be really unlucky, and you might not have any of those.

Maybe you’ll have column headers somewhere, so you can try and eyeball the data and figure out what’s what?

Yep, this could be a really bad day where everything is squished into one column, in one long string, and no column headers. What can you do now?

Well hopefully, you have a dictionary somewhere, that tells you the column names, and either the start and end positions of each, or the length of each substring.

If you have that, you have hope.

readr to the rescue

What you need is {readr}, specifically, “read_fwf()“`.

For ease, I read in an Excel file as my lookup, and used the widths column from that as my widths vector, passed to “fwf_widths

read_fwf(file_to_read, fwf_widths = lookup$widths)

You could also try passing start and end positions to the “fwf_positions“ argument.

The rest of my task invoved {rio} for importing from SPSS, {readxl}, and data.table for joining on dates within ranges. What I wanted to do was ensure each row had a column where its eventual destination was stored. Then, my plan was to loop through the 30 or so unique destinations, filter out the relevant data, and save it to disk.

However, I found a way of using data.table to ‘loop’ for me –

DT[,fwrite(.SD, destination_col), by = destination_col, .SDcols = cols_to_keep]

Brilliantly succinct – these weren’t the actual column names, but you get the idea – I didn’t need to save the destination_col in the final output, so I’d already defined the columns I needed to keep as a seperate vector (“cols_to_keep“).

I’m not pro or anti any set of tools in R. I use what I need to get the job done.

I’d originally started off with some dplyr , but realised that data.table was going to make the final iterating and file output so much easier, so switched to that when I got to the joining stage.

Ultimately, the people I’m doing the work for don’t care. They just see a fairly large amount of text files with the relevant data for each date range, and that’s all they need to know.

Getting things done is the main priority as that’s what keeps people coming back to you 🙂

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)