Today, I finally got inspired to deal with tons of datasets from the Tribunal Superior Eleitoral on the Brazilian elections. The cause of the delay for putting my finger on them was simply to avoid troubles with messy large text files. The set of data I collect consists of above 40GB of pure text files, which reports electoral results, candidates’ profile, campaign revenues and expenditures etc. Therefore, if anything it may be a good example of using R for data management, and that it might be useful for students while dealing with messy datasets from everywhere.
The task can be stated as follows. Suppose you have a set of data files (data1.txt, data2.txt, […] ,data27.txt) which represents some data–or a subset data–sliced by states or electoral districts. What you want to do is simply stack every data file into a beautiful unique file for more aggregated analyses, or just releasing the computer from storing too many sliced data. In sum, the task is to obtain a table of all subsets; more complex cases will be addressed on later posts. This can be done by browsing to the directory where the files are, then looping through them importing and merging. Finally, the aggregated file can be written back to the disk.
The piece of code below does just that. The first line paste the path where R must look at for the files. The second line creates an empty data object to store each of the importing files if any. The third line reads the path to the files, and then a loop for reading each existing file of type “.txt” as table. The last line in the loop creates the final table by appending each subset that was imported into memory. Finally, the last part of the program, which is out of the loop for efficiency purpose, simply write the final table to the disk as a text file,delimiting the columns by semicolon ‘;’.