Scraping table from any web page with R or CloudStat

January 15, 2012
By

(This article was first published on PR, and kindly contributed to R-bloggers)

Scraping table from any web page with R or CloudStat:

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

  1. US Airline Customer Score.
  2. World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’
airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

> library(XML)

Warning message:
package "XML" was built under R version 2.14.1
> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table
Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1 Southwest 78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2 All Others NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3 Airlines 72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4 Continental 67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5 American 70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6 United 71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7 US Airways 72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8 Delta 77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines 69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61
11 PreviousYear%Change FirstYear%Change
1 81 2.5 3.8
3 65 -1.5 -9.7
4 64 -9.9 -4.5
5 63 0.0 -10.0
7 61 -1.6 -15.3
8 56 -9.7 -27.3
9 # N/A N/A
>

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

> chess = "http://ratings.fide.com/top.phtml?list=men"

> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table
Rank Name Title Country Rating Games B-Year
1  1  Carlsen, Magnus  g  NOR  2835  17  1990
2  2  Aronian, Levon  g  ARM  2805  25  1982
3  3  Kramnik, Vladimir  g  RUS  2801  17  1975
4  4  Anand, Viswanathan  g  IND  2799  17  1969
5  5  Radjabov, Teimour  g  AZE  2773  9  1987
6  6  Topalov, Veselin  g  BUL  2770  9  1975
7  7  Karjakin, Sergey  g  RUS  2769  16  1990
8  8  Ivanchuk, Vassily  g  UKR  2766  16  1969
9  9  Morozevich, Alexander  g  RUS  2763  6  1977
10  10  Gashimov, Vugar  g  AZE  2761  9  1986
11  11  Grischuk, Alexander  g  RUS  2761  8  1983
12  12  Nakamura, Hikaru  g  USA  2759  17  1987
13  13  Svidler, Peter  g  RUS  2749  17  1976
14  14  Mamedyarov, Shakhriyar  g  AZE  2747  9  1985
15  15  Tomashevsky, Evgeny  g  RUS  2740  0  1987
16  16  Gelfand, Boris  g  ISR  2739  9  1968
17  17  Caruana, Fabiano  g  ITA  2736  19  1992
18  18  Nepomniachtchi, Ian  g  RUS  2735  16  1990
19  19  Wang, Hao  g  CHN  2733  6  1989
20  20  Kamsky, Gata  g  USA  2732  0  1974
21  21  Dominguez Perez, Leinier  g  CUB  2730  6  1983
22  22  Jakovenko, Dmitry  g  RUS  2729  0  1983
23  23  Ponomariov, Ruslan  g  UKR  2727  13  1983
24  24  Vitiugov, Nikita  g  RUS  2726  1  1987
25  25  Adams, Michael  g  ENG  2724  17  1971
26  26  Leko, Peter  g  HUN  2720  9  1979
27  27  Almasi, Zoltan  g  HUN  2717  8  1976
28  28  Giri, Anish  g  NED  2714  15  1994
29  29  Le, Quang Liem  g  VIE  2714  0  1991
30  30  Navara, David  g  CZE  2712  8  1985
31  31  Shirov, Alexei  g  LAT  2710  13  1972
32  32  Polgar, Judit  g  HUN  2710  0  1976
33  33  Riazantsev, Alexander  g  RUS  2710  0  1985
34  34  Wojtaszek, Radoslaw  g  POL  2706  8  1987
35  35  Moiseenko, Alexander  g  UKR  2706  7  1980
36  36  Vallejo Pons, Francisco  g  ESP  2705  15  1982
37  37  Malakhov, Vladimir  g  RUS  2705  0  1980
38  38  Jobava, Baadur  g  GEO  2704  23  1983
39  39  Bacrot, Etienne  g  FRA  2704  14  1983
40  40  Laznicka, Viktor  g  CZE  2704  8  1988
41  41  Sutovsky, Emil  g  ISR  2703  8  1977
42  42  Naiditsch, Arkadij  g  GER  2702  14  1985
43  43  Movsesian, Sergei  g  ARM  2700  9  1978
44  44  Sasikiran, Krishnan  g  IND  2700  9  1981
45  45  Vachier-Lagrave, Maxime  g  FRA  2699  13  1990
46  46  Dreev, Aleksey  g  RUS  2698  6  1969
47  47  Efimenko, Zahar  g  UKR  2695  8  1985
48  48  Volokitin, Andrei  g  UKR  2695  0  1986
49  49  Wang, Yue  g  CHN  2694  6  1987
50  50  Fressinet, Laurent  g  FRA  2693  17  1981
51  51  Li, Chao b  g  CHN  2693  6  1989
52  52  Grachev, Boris  g  RUS  2693  0  1986
53  53  Nielsen, Peter Heine  g  DEN  2693  0  1973
54  54  Van Wely, Loek  g  NED  2692  13  1972
55  55  Bruzon Batista, Lazaro  g  CUB  2691  19  1982
56  56  McShane, Luke J  g  ENG  2691  8  1984
57  57  Eljanov, Pavel  g  UKR  2690  10  1983
58  58  Kasimdzhanov, Rustam  g  UZB  2689  14  1979
59  59  Inarkiev, Ernesto  g  RUS  2689  6  1985
60  60  Zvjaginsev, Vadim  g  RUS  2688  8  1976
61  61  Andreikin, Dmitry  g  RUS  2688  0  1990
62  62  Areshchenko, Alexander  g  UKR  2688  0  1986
63  63  Rublevsky, Sergei  g  RUS  2686  0  1974
64  64  Akopian, Vladimir  g  ARM  2685  8  1971
65  65  Potkin, Vladimir  g  RUS  2684  0  1982
66  66  Sargissian, Gabriel  g  ARM  2683  15  1983
67  67  Berkes, Ferenc  g  HUN  2682  16  1985
68  68  Bologan, Viktor  g  MDA  2680  15  1971
69  69  Bauer, Christian  g  FRA  2679  24  1977
70  70  Tiviakov, Sergei  g  NED  2677  22  1973
71  71  Short, Nigel D  g  ENG  2677  15  1965
72  72  Motylev, Alexander  g  RUS  2677  6  1979
73  73  Gharamian, Tigran  g  FRA  2676  0  1984
74  74  Kobalia, Mikhail  g  RUS  2673  0  1978
75  75  Meier, Georg  g  GER  2671  9  1987
76  76  Onischuk, Alexander  g  USA  2670  13  1975
77  77  Bu, Xiangzhi  g  CHN  2670  6  1985
78  78  Alekseev, Evgeny  g  RUS  2670  0  1985
79  79  Azarov, Sergei  g  BLR  2667  0  1983
80  80  Kryvoruchko, Yuriy  g  UKR  2666  0  1986
81  81  Balogh, Csaba  g  HUN  2665  8  1987
82  82  Harikrishna, P.  g  IND  2665  6  1986
83  83  Khismatullin, Denis  g  RUS  2664  8  1984
84  84  Nguyen, Ngoc Truong Son  g  VIE  2662  6  1990
85  85  Fridman, Daniel  g  GER  2660  11  1976
86  86  Smirin, Ilia  g  ISR  2660  7  1968
87  87  Ding, Liren  g  CHN  2660  6  1992
88  88  Sadler, Matthew D  g  ENG  2660  3  1974
89  89  Korobov, Anton  g  UKR  2660  0  1985
90  90  Cheparinov, Ivan  g  BUL  2659  18  1986
91  91  Timofeev, Artyom  g  RUS  2659  0  1985
92  92  Georgiev, Kiril  g  BUL  2658  17  1965
93  93  Bartel, Mateusz  g  POL  2658  9  1985
94  94  Zhigalko, Sergei  g  BLR  2658  8  1989
95  95  Feller, Sebastien  g  FRA  2658  0  1991
96  96  Ragger, Markus  g  AUT  2655  17  1988
97  97  Jones, Gawain C B  g  ENG  2653  27  1987
98  98  So, Wesley  g  PHI  2653  5  1993
99  99  Milov, Vadim  g  SUI  2653  0  1972
100  100  Gupta, Abhijeet  g  IND  2652  9  1989
101  101  Postny, Evgeny  g  ISR  2652  8  1981
102  102  Roiz, Michael  g  ISR  2652  6  1983
103  103  Gyimesi, Zoltan  g  HUN  2652  4  1977
104  104  Nikolic, Predrag  g  BIH  2652  2  1960
>

Done. You had successfully scraping data from any web page with R or CloudStat.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Tags: scrape, scraping, data collection

To leave a comment for the author, please follow the link and comment on his blog: PR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.