R in Open Data: Complaints in The Field of Freedom of Information data set from data.gov.rs

February 12, 2017
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

The notebooks (R, Rmd, and HTML files are provided in my GitHub repository) focus on an exploratory analysis of the open data set on the complaints in the field of freedom of information, provided at the Open Data Portal of the Republic of Serbia that is currently under development. The data set was kindly provided to the Open Data Portal by the Commissioner for Information of Public Importance and Personal Data Protection of the Republic of Serbia. Many more open data sets will be indexed and uploaded to the Open Data Portal of the Republic of Serbia in the forthcoming weeks and months.

You should view this as an exercise in data wrangling and visualization with {ggplot2} and {igraph} primarily. As of the data set: (a) no metadata and no documentation were provided; (b) the translation of legal terms from Serbian to English is mine, meaning: a lot of Google Translate suggestions were used (I’m a psychologists, not a lawyer or a legal expert); © mixture of latin and cyrilic alphabet was detected in the data; (d) thorough cleaning takes place here in Part A; exploratory analysis + data visualizations are be presented Part B. R programming language provides a fantastic infrastructure for data wrangling (cleaning, preparation, re-structuring; in a nutshell, all necessary data management process than needs to be taken care of before any attempts at EDA or statistical modeling). In Part A I have used {dplyr} in combination with {tidyr} and {base} functions to inspect and clean up the data set (as much as I could); in Part B, {ggplot2} and {igraph} functionality was added to visualize some of the interesting patterns from the data set.

This is also a good reality check for all those who are contemplating a Data Science career. Similarly to what I had to do here, you will be often faced with data sets with no documentation and no metadata, and than you will need to do explorations before doing real EDA in order to try to figure out the semantics of the data; many times, you will be forced to combine structured and abstract approaches to clean data with manual procedures; you will be driven mad by inconsistencies and unavailability, but it will still be up to you to do what you can in order to try to squeeze out something useful from the data that you have at your disposal. It’s no joke: data wrangling and related procedures will be stealing a huge amount of time from you. Statistical modeling comes almost as a reward after what you’ve been through since you’ve been introduced to the data set…

Here are some examples with {ggplot2} and {igraph} from this case study.

image

Figure 1. Number of complaints filed per applicant group 2005 – 2016. {ggplot2} w. facet_wrap().

image

Figure 2. Each applicant group (blue circles) in this directed graph points towards the domains (gold circles) in respect to which it has sent its complaints to the Commissioner for Information of Public Importance and Personal Data Protection. {igraph}.

image

Figure 3. Each applicant group in this directed graph points towards the top three authority groups in respect to which it has sent the maximum numbers of complaints to the Commissioner for Information of Public Importance and Personal Data Protection (applicant groups represented by blue and authority groups by red circles): {igraph}.

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)