Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:
“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
— Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.
“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
— Hanna Wallach, Senior Researcher at Microsoft Research NYC.
The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.
Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world