Pitfall of XML package: to know the cause
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the sequel to the previous report “issues specific to cp932 locale, Japanese Shift-JIS, on Windows“. In this report, I will dig the issues deeper to find out what is exactly happening.
1. Where it occurs
I knew the issues depend node texts, because a very same script run on Windows to parse another table in the same html source.
# Windows src <- 'http://www.taiki.pref.ibaraki.jp/data.asp' t2 <- iconv(as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ), from='utf-8', to='shift-jis') > t2 # bad [1] NA s2 <- iconv(as.character( readHTMLTable(src, which=6, trim=T, header=F, skip.rows=1, encoding='shift-jis')[2,2] ), from='utf-8', to='shift-jis') > s2 # good [1] "北茨城中郷"
To know the difference of the two html parts is to know where the issue occurs.
Let’s see the html source by primitive functions, instead of using the package XML.
con <- url(src, encoding='shift-jis') x <- readLines(con) close(con)
I know two useful keywords to locate points of t2
and s2
above; 2016
and 101
respectively.
# for t2 > grep('2016', x) [1] 120 133 141 148 160 161 > x[119:121] [1] "ttt" [2] "tttt最新の観測情報 (2016年1月17日 8時)" [3] "ttt " # for s2 > grep('101', x) [1] 181 > x[181:182] [1] "tttttt101 " [2] "tttttt北茨城中郷 "
Note that only x[182]
is for s2
and x[181]
was used to find the position. Apparently, differences between t2
and s2
are:
t2
includes .t2
spreads over multiple lines.
Because I want to know the exact content of a single node (html element), the three lines of t2
must be joined together.
paste(x[119:121], collapse='rn')
Pasting with a newline code rn
may be the answer, but more exact procedure is better.
Binary functions are elegant tools which can handle an html source as binary data as is offered by the web server regardless of client platforms and locales.
con <- url(src, open='rb') skip <- readBin(con, what='raw', n=5009) xt2 <- readBin(con, what='raw', n=92) skip <- readBin(con, what='raw', n=3013) xs2 <- readBin(con, what='raw', n=31) close(con)
In this time I cheated the byte position of interests from the prior result x
.
# t2 begins after > sum(nchar(x[1:118], type='bytes') + 2) + nchar(sub('<.*$', '', x[119]), type='bytes') [1] 5009 # t2 length > sum(nchar(x[119:121], type='bytes') + 2) - 2 - nchar(sub('<.*$', '', x[119]), type='bytes') [1] 92 # s2 begins after > sum(nchar(x[1:181], type='bytes') + 2) + nchar(sub('<.*$', '', x[182]), type='bytes') [1] 8114 # s2 length > nchar(x[182], type='bytes') - nchar(sub('<.*$', '', x[182]), type='bytes') [1] 31 # s2 from the end of t2 > 8114 - 5009 - 92 [1] 3013
Variables xt2
and xs2
have what I want as binary (raw) vector.
# Windows > rawToChar(xt1) [1] "rntttt最新の観測情報 (2016年1月17日 8時)rnttt " > rawToChar(xs1) [1] "北茨城中郷 "
Compare inside texts of these nodes.
t2: rntttt最新の観測情報 (2016年1月17日 8時)rnttt s2: 北茨城中郷
Maybe, a text including control codes (r
, n
, t
) and/or html entities (
) is unsafe, and a text made up of printable Kanji
characters only is safe. So, the issues must occur at these special characters.
Before digging more, I want to introduce two nice binary functions that can be used without cheating of byte position. Function getBinaryURL
in package RCurl
can get the whole binary data from a web page. Function grepRaw
can locate positions of specific character in binary vector.
library(RCurl) x <- getBinaryURL(src) > str(x) raw [1:35470] 0d 0a 3c 68 ... > grepRaw('2016', x, all=FALSE) [1] 5062 > x[5000:5100] [1] 09 3c 74 72 3e 0d 0a 09 09 09 3c 74 64 20 63 6c 61 73 73 3d 22 74 69 74 [25] 6c 65 22 3e 0d 0a 09 09 09 09 8d c5 90 56 82 cc 8a cf 91 aa 8f ee 95 f1 [49] 26 6e 62 73 70 3b 26 6e 62 73 70 3b 81 69 32 30 31 36 94 4e 31 8c 8e 31 [73] 37 93 fa 26 6e 62 73 70 3b 26 6e 62 73 70 3b 38 8e 9e 81 6a 0d 0a 09 09 [97] 09 3c 2f 74 64 > rawToChar(x[5000:5100]) [1] "trnttt rntttt最新の観測情報 (2016年1月17日 8時)rnttt 2. What is cause
Let’s check out what happens when a html has html entity (
) or spaces (
r
n
t
). I’m going to use a minimum html to compare responses ofpackage XML
onMac
andWindows
.library(XML) > xmlValue(xmlRoot(htmlParse('ABC', asText=T))) [1] "ABC"2-1. No-Break Space (U+00A0, )
# Mac > xmlValue(xmlRoot(htmlParse( ' ', asText=T))) [1] " " # good > iconv(xmlValue(xmlRoot(htmlParse( ' ', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( ' ', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' あ', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' xe3x81x82', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' x82xa0', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good # Windows > xmlValue(xmlRoot(htmlParse( ' ', asText=T))) [1] "ツxa0" # nonsense; putting utf-8 characters on shift-jis terminal > iconv(xmlValue(xmlRoot(htmlParse( ' ', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( ' ', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' あ', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' xe3x81x82', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( ' x82xa0', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # goodAs shown above, function
xmlValue
always returns a utf-8 string and the result is exactly same on bothMac
andWindows
, regardless of the difference of locales. Anis converted to a
u+00a0
(xc2xa0
in utf-8). An error occurs wheniconv
converts utf-8 characters intoshift-jis
on bothMac
andWindows
. So, this is not an issue ofxmlValue
.The issue can be simplified into an issue of
iconv
.# Mac and Windows > iconv('u00a0', from='utf-8', to='shift-jis', sub='byte') [1] "" # bad As the above, function
iconv
fails convertingu+00a0
intoshift-jis
. Because Mac people usually do not convert characters intoshift-jis
, the issue is specific toWindows
.Perhaps I found a background of the cause. According to a list of JIS X 0213 non-Kanji at wikipedia,
No-Break Space
was not defined inJIS X 0208
and added atJIS X 0213
in year 2004. This meansu+00a0
is included in the latest extended shift-jis (shift_jis-2004
), but not in the conventional shift-jis. Because Windowscode page 932
(cp932
) was defined after the conventional shift-jis (JIS X 0208
),cp932
is not compatible withJIS X 0213
. In contrast, Mac usesshift_jis-2004
(JIS X 0213
).# Mac > charToRaw(iconv('u00a0', from='utf-8', to='shift_jisx0213', sub=' ')) [1] 85 41 # goodWhen the explicit version of shift-jis is specified, iconv successfully converts
u+00a0
intoshift_jis-2004
.But,
Windows
fails with the message:unsupported conversion from 'utf-8' to 'shift_jisx0213' in codepage 932.Actually, the issue is not of
iconv
, but of differences between versions ofJIS code
.2-2. trim
In the following tests, a Japanese
Hiragana
character “あ”, binary code of that is “e3 81 82
” forutf-8
and is “82 a0
” forshift-jis
, was used.# Mac > xmlValue(xmlRoot(htmlParse( 'a', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( 'ta', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( 'あ', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( 'tあ', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( 'tあ', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'tx82xa0', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( 'atあ', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse( 'tあ', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'txe3x81x82', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. #utf-8, trim # Windows > xmlValue(xmlRoot(htmlParse( 'a', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( 'ta', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'あ', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse( 'tあ', asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse( 'tあ', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'tx82xa0', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'atあ', asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( 'tあ', from='shift-jis', to='utf-8'), asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( 'txe3x81x82', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim
Mac
passed all tests.In contrast, a bad case was found in
Windows
, when the text consisted of Japanese characters (あ
) and space characters (t
), when an optiontrim=TRUE
was specified, and when removal of space characters weretruly required
. The result was something unreadable.Ascii
andutf-8
encodings were safe.A point here may be the difference of regular expression by locales;
utf-8
forMac
andshift-jis
forWindows
.# Mac > charToRaw(gsub('\s', '', 'tあ')) [1] e3 81 82 # good # Windows charToRaw(gsub('\s', '', iconv('tあ', from='shift-jis', to='utf-8'))) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # badThis result matches exactly with the tests of
xmlValue
. So, what the trim=TRUE in the package XML is doing may be using R’sregular expression
(depends onlocale
) to the internal string (alwaysutf-8
). Because the regular expression working onJapanese Windows
is safe to the national locale (cp932
), it is unsafe to the international locale (utf-8
).Additionally, the result of the utf-8 trim test was good in
Windows
. This indicates there’re some procedures to handle locale differences in package XML and the bad case slips through by mistake.3. Another way
Thanks to Jeroen for the comment. CRAN
package xml2
is free from the 2nd issue.# Windows > charToRaw(xml_text(read_xml( 'tあ', encoding='shift-jis'), trim=T)) [1] e3 81 82 # good. shift-jis, trim=TRUEIts
trim=TRUE
is working good on Windows with shift-jis encoded html, and the result is independent toread_xml
optionas_html=TRUE
orFALSE
. So we can use thepackage xml2
as an alternative solution instead of thepackage XML
.
![]()
To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Copyright © 2021 | MH Corporate basic by MH Themes
Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)Click here to close (This popup will not appear again)