On the biases in data

December 5, 2017

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
Hanna Wallach, Senior Researcher at Microsoft Research NYC. 

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)