Dealing with a Byte Order Mark (BOM)

March 11, 2015
By

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

I have just been trying to import some data into R. The data were exported from a SQL Server client in tab-separated value (TSV) format. However, reading the data into R the “usual” way produced unexpected results:

> data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE)
> head(data)
                                   V1    V2
1 7E51B3EC4263438B22811BE78391A823  2129
2    8617E5E557903C7FAF011FBE2DFCED1D  3518
3    1E8B37DFB143BEEEE052516D2F3B58F5  6018
4    60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5    5A3BA8589DCD62B31948DC2715CA3ED9 12850
6    3552BF8AF58A58C794A43D4AA21F4FBA 13284

Those weird characters in the first record… where did they come from? They don’t show up in a text editor, so they’re not easy to edit out.

Googling ensued and revealed that those weird characters were in fact the byte order mark (BOM), special characters which indicate the endianness of the file. This was quickly confirmed using CYGWIN. (Yes, shamefully, I am working under Windows at the moment!)

cygwin-bom

The solution is remarkably simple: just specify the correct character encoding.

> data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
> head(data)
                                V1    V2
1 7E51B3EC4263438B22811BE78391A823  2129
2 8617E5E557903C7FAF011FBE2DFCED1D  3518
3 1E8B37DFB143BEEEE052516D2F3B58F5  6018
4 60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5 5A3BA8589DCD62B31948DC2715CA3ED9 12850
6 3552BF8AF58A58C794A43D4AA21F4FBA 13284

Problem solved.

The post Dealing with a Byte Order Mark (BOM) appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)