by Joseph Rickert
I recently had the opportunity to look at the data used for the 2009 KDD Cup competition. There are actually two sets of files that are still available from this competition. The "large" file is a series of five .csv files that when concatenated form a data set with 50,000 rows and 15,000 columns. The "small" file also contains 50,000 rows but only 230 columns. "Target" files are also provided for both the large and small data sets. The target files contain three sets of labels for "appentency", "churn" and "upselling" so that the data can be used to train models for three different classification problems.
The really nice feature of both the large and small data sets is that they are extravagantly ugly, containing large numbers of missing variables, factor variables with thousands of levels, factor variables with only one level, numeric variables with constant values, and correlated independent variables. To top it off, the targets are severely unbalanced containing very low proportions of positive examples for all three of the classification problems. These are perfect files for practice.
Often times, the most difficult part about working with data like this is just knowing where to begin. Since getting a good look is usually a good place to start, let's look at a couple of R tools that I found to be helpful for taking that first dive into messy data.
The mi package takes a sophisticated approach to multiple imputation and provides some very advance capabilities. However, it also contains simple and powerful tools for looking at data. The function missing.pattern.plot() lets you see the pattern of missing values. The following line of code provides a gestalt for the small data set.
main = "KDD 2009 Small Data Set")
Observations (rows) go from left to right and variables from bottom to top. Red indicates missing values. Looking at just the first 25 variables makes it easier to see what the plot is showing.
The function mi.info(), also in the mi package, provides a tremendous amount of information about a data set. Here is the output for the first 10 variables. The first thing the function does is list the variables with no data and the variables that are highly correlated with each other. Thereafter, the function lists a row for each variable that includes the number of missing values and the variable type. This is remarkably useful information that would otherwise take a little bit of work to discover.
> mi.info(DF) variable(s) Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var169, Var175, Var185 has(have) no observed value, and will be omitted. following variables are collinear []  "Var156" "Var66" "Var9" []  "Var104" "Var105" []  "Var111" "Var157" "Var202" "Var33" "Var61" "Var71" "Var91" names include order number.mis all.mis type collinear 1 Var1 Yes 1 49298 No nonnegative No 2 Var2 Yes 2 48759 No binary No 3 Var3 Yes 3 48760 No nonnegative No 4 Var4 Yes 4 48421 No ordered-categorical No 5 Var5 Yes 5 48513 No nonnegative No 6 Var6 Yes 6 5529 No nonnegative No 7 Var7 Yes 7 5539 No nonnegative No 8 Var8 No NA 50000 Yes proportion No 9 Var9 No NA 49298 No nonnegative Var156, Var66 10 Var10 Yes 8 48513 No nonnegative No
For Revolution R Enterprise users the function rxGetINfo() is a real workhorse. It applies to data frames as well as data stored in .xdf files. For data in these files there is essentially no limit to how many observations can be analysed. rxGetInfo() is an example of an external memory algorithm that only reads a chunk of data at a time from the file. Hence, there is no need to try and stuff all of the data into memory.
The following is a portion of the output from running the function with the getVarinfo flag set to TRUE.
Data frame: DF Number of observations: 50000 Number of variables: 230 Variable information: Var 1: Var1, Type: numeric, Low/High: (0.0000, 680.0000) Var 2: Var2, Type: numeric, Low/High: (0.0000, 5.0000) Var 3: Var3, Type: numeric, Low/High: (0.0000, 130668.0000)
. Var 187: Var187, Type: numeric, Low/High: (0.0000, 910.0000) Var 188: Var188, Type: numeric, Low/High: (-6.4200, 628.6200) Var 189: Var189, Type: numeric, Low/High: (6.0000, 642.0000) Var 190: Var190, Type: numeric, Low/High: (0.0000, 230427.0000) Var 191: Var191 2 factor levels: r__I Var 192: Var192 362 factor levels: _hrvyxM6OP _v2gUHXZeb _v2rjIKQ76 _v2TmBftjz ... zKnrjIPxRp ZlOBLJED1x ZSNq9atbb6 ZSNq9aX0Db ZSNrjIX0Db Var 193: Var193 51 factor levels: _7J0OGNN8s6gFzbM 2Knk1KF 2wnefc9ISdLjfQoAYBI 5QKIjwyXr4MCZTEp7uAkS8PtBLcn 8kO9LslBGNXoLvWEuN6tPuN59TdYxfL9Sm6oU ... X1rJx42ksaRn3qcM X2uI6IsGev yaM_UXtlxCFW5NHTcftwou7BmXcP9VITdHAto z3s4Ji522ZB1FauqOOqbkl zPhCMhkz9XiOF7LgT9VfJZ3yI Var 194: Var194 4 factor levels: CTUH lvza SEuy Var 195: Var195 23 factor levels: ArtjQZ8ftr3NB ArtjQZmIvr94p ArtjQZQO1r9fC b_3Q BNjsq81k1tWAYigY ... taul TnJpfvsJgF V10_0kx3ZF2we XMIgoIlPqx ZZBPiZh Var 196: Var196 4 factor levels: 1K8T JA1C mKeq z3mO Var 197: Var197 226 factor levels: _8YK _Clr _vzJ 0aHy ... ZEGa ZF5Q ZHNR ZNsX ZSv9 Var 198: Var198 4291 factor levels: _0Ong1z _0OwruN _0OX0q9 _3J0EW7 _3J6Cnn ... ZY74iqB ZY7dCxx ZY7YHP2 ZyTABeL zZbYk2K Var 199: Var199 5074 factor levels: _03fc1AIgInD8 _03fc1AIgL6pC _03jtWMIkkSXy _03wXMo6nInD8 ... zyR5BuUrkb8I9Lth ZZ5
rxGetInfo() doesn't provide all of the information that mi.info() does, but is does do a particularly nice job on factor data, giving the number of levels and showing the first few. The two functions are complementary.
For a full listing of the output shown above down load the file: Download Mi_info_output.