**The Exactness of Mind**, and kindly contributed to R-bloggers)

The visualization of complex data

sets is of essential importance in communicating your data products.

Beyond pie charts, histograms, line graphs and other common forms of

visual communication begins the reign of data sets that encompass too

much information to be easily captured by these simple data displays. A

typical context that abounds with complexity is found in the areas of

text mining, natural language processing, and cognitive computing in

general; such a complex context for data

presentation is pervasive in an attempt to build a visual interface for

products like semantic search engines or recommendation engines. For

example, statistical models like LDA (Latent Dirichlet Allocation)

enable for a thorough insight into the similarity structure across

textual documents or vocabulary terms used to describe them. But as the

number of pairwise similarities between terms of documents to be

presented to the end users increases, the problem of effective data

visualization becomes harder. In this and the following blog posts, I

wish to present and discuss some of the well-known approaches to

visualize difficult data sets from the viewpoint of pure ergonomics, as

well as to suggest some improvements and new ideas here and there.

**A simple similarity map for iris**

In this first blog on this topic, the idea is to

focus on the basics of ergonomics in the visualization of similarity

information. We will compare two ideas: the usage of similarity maps following a dimensionality reduction of the dataset – which can be taken as a sort of baseline idea in the visualization of similarity data – and the usage of directed graphs from higher-dimensional spaces directly, which is less common. I would like to put forward an argument in support of the later approach which I started using some time ago.

But first, let us have some data. I will use a well

known, rather simple data set, merely for the purposes of the

illustration here: namely, the famous iris:

# – the iris dataset

data(iris)

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

The iris dataset with its

four numerical and one categorical variable is, as you will see,

complicated quite enough to illustrate the main problem that I wish to

discuss. Now, let’s illustrate the problem. The classical approach to

assess the similarity between the different observations (rows) in this

dataset would be to derive some metric from the numerical variables,

perform dimensionality reduction, and then plot the data points in a

lower-dimensional space. I will work with Euclidean distance and a

non-linear dimensionality reduction approach by performing ordinal

multidimensional scaling (MDS) with {smacof} in R:

# – distances in iris:

dIris <- dist(iris[, 1:4], method = “euclidean”)

Following the computation of Euclidean distances,

each observation from the dataset can be treated as if it was

represented by a set of features, each feature standing for a distance

from the observation under discussion to all other observations in the

dataset (including itself, where the distance is zero). The source data

were four-dimensional (four continuous variables were used to represent

each observation); now they are N-dimensional, with N representing the

number of observations. But, being distances, the data now convey

information about the similarities between the observations. Now, the

dimensionality reduction step:

# – multidimensional scaling w. {smacof}

library(smacof)

mdsIris <- smacofSym(dIris, type = “ordinal”)

mdsIris$stress

confIris <- as.data.frame(mdsIris$conf)# – add category to confIris:

confIris$Species <- iris$Species

head(confIris)D1 D2 Species

1 -0.8962210 0.09385943 setosa

2 -0.9060827 -0.05265667 setosa

3 -0.9608955 -0.03096467 setosa

4 -0.9167038 -0.08550457 setosa

5 -0.9101914 0.10073609 setosa

6 -0.7702262 0.22385668 setosa

In ordinal multidimensional scaling, only the

ordering of the initial distances is preserved; the obtained mapping is

thus invariant under rotation and translation – like any MDS solution –

and any monotonic scale transformation that preserves the rank order of

similarity. The stress of the final MDS configuration is .025, not bad. Now we have a 2D representation of the data. Let’s plot it with {ggplot2}:

# – visualize w. {ggplot2}

library(ggplot2)

library(ggrepel)

library(RColorBrewer)

confIris$Color <- recode(confIris$Species, ‘setosa’ = ‘aquamarine3’, ‘versicolor’ = ‘coral3’, ‘virginica’ = ‘deepskyblue3’)confIris <- confIris[order(confIris$Species), ]

confIris$Label[which(confIris$Species == ‘setosa’)] <- paste(’s’, 1:length(which(confIris$Species == ‘setosa’)), sep = “”)

confIris$Label[which(confIris$Species

== ‘versicolor’)] <-paste(’ve’, 1:length(which(confIris$Species ==

‘versicolor’)), sep = “”)

confIris$Label[which(confIris$Species ==

‘virginica’)] <- paste(‘vi’, 1:length(which(confIris$Species ==

‘virginica’)), sep = “”)ggplot(confIris, aes(x = D1, y = D2, label = Label)) + geom_text_repel(aes(x = D1, y = D2, label = Label),

size = 2, segment.size = .1) +

geom_point(size = 1.5, color = confIris$Color) +

geom_point(size = 1, color = “white”) +

ggtitle(‘Iris: semantic map.’) +

theme_bw() + theme(plot.title = element_text(size = 10))

With 150 objects to represent, it is already difficult to read the similarity map. It can be readily seen that different types of observations (“s” for “setosa”, “vi” for “virginica”, and “ve” for “versicolor”, the famous iris$Species,

each additionally indexed by the observation number in the respective

class) tend to cluster naturally. However, we know one very important

thing about the iris dataset: it is not

linearly separable. In the simplest possible explanation of what

non-linear separability means: look for the border between the blue (“virginica”) and red (“versicolor”)

points, and try to imagine a straight line through the plane that

perfectly separates the two classes (and then generalize if you wish: no

plane in 3D, no hyper-plane in a higher dimensional space… can separate

the classes perfectly). Ask yourself: what observations are the most responsible for these feature in the iris dataset? For example, is “vi7” at the

bottom of the map a problematic case in categorization? What about the

following: “vi50”, “vi47”, “ve34”, “ve23”? The observation “vi7” is

found far away from the other members of its category, however, it’s

still somewhat lonesome in this 2D similarity map. Observations “vi50”,

“vi47”, “ve34”, and “ve23”, found in the lower right segment of the map

but closer to its center, on the other hand, are found in a very

crowded part of the similarity space: could they be a part of

explanation for the presence of non-linear separability in iris?

The problem of ergonomics is obviously related to the

overcrowding of the points in the map. The conceptual problem with

dimensionality reduction to simple, visually comprehensible similarity

spaces, is that there is no dimensionality reduction without a loss of information from

the initial, higher-dimensional space (except in some really rare,

theoretically possible cases). As we will see, as a consequence of this

there are situations when seemingly unimportant observations manage to

hide a part of their nature in a lower-dimensional representation. But

we need to see that first.

**Similarity in iris as a directed graph**

Before performing dimensionality reduction with

ordinal MDS, we were found in a 150-dimensional similarity – distance,

actually – space. In spite of being a powerful technique for non-linear

dimensionality reduction, MDS has “eaten” some of the information from

the initial hyperspace. Now, let’s see if we can go back into the

hyperspace and visualize it somehow to enable us discuss the importance

of particular observations in this dataset. Each data point is described

as a vector of 150 elements, each element representing the distance

between the data point under observation and some other data point.

Let’s search through these vectors and find the nearest neighbor – the most similar observation – for each data point:

# – iris as a directed graph:

mdsDistIris <- as.matrix(mdsIris$confdist)

nnIris <- apply(mdsDistIris, 1, function(x) {

minDist <- sort(x, decreasing = T)

minDist <- minDist[length(minDist)-1]

which(x == minDist)[1]

})# – graphFrame:

graphFrame <- data.frame(from = 1:150,

to = nnIris,

Species = iris$Species)

We have first discovered the nearest neighbors, and

the introduced a new variable, Degree, that describes to how many other

observations (excluding itself) does a particular data point presents

the nearest neighbor. In the following similarity map, the size of the

marker represents the degree of each observation:

Observations “vi50” and “ve34” stand close to each

other, belonging to different classes; moreover, they are marked as

having a higher degree in the nearest neighbor network (we haven’t

visualized the network yet) than many other observations; they are candidates to represent the nearest neighbors to each other.

Still, “vi7” at the bottom

of the map is certainly not a hub. In the next step, we change the

visualization strategy. Again, get ready for a loss of information.

While MDS attempts at a representation that will preserve as many as

possible information about all pairwise

similarities from the initial dataset, thus necessarily compromising

with some error present in the lower-dimensional representation, we

chose to sacrifice the context and represent information on similarity neighborhoods only. We will represent the similarity data from iris by instantiating a directed graph, with each node representing an observation, and each node having exactly one link, pointing towards the nearest neighbor of the respective observation. We use {igraph} to construct and represent the graph:

# – represent high-dimensional Iris neighbourhoods as directed graph:

graphFrame$from <- confIris$Label[graphFrame$from]

graphFrame$to <- confIris$Label[graphFrame$to]

library(igraph)

irisNet <- graph.data.frame(graphFrame, directed = T)

V(irisNet)$size <- degree(irisNet)*1.25

V(irisNet)$color <- as.character(confIris$Color)

V(irisNet)$label <- confIris$Label# – plot w. {igraph}

par(mai=c(rep(0,4)))

plot(irisNet,vertex.shape = “circle”,

vertex.frame.color = NA,

edge.width = .75,

edge.color = “grey”,

edge.arrow.size = 0.15,

vertex.label.font = 1,

vertex.label.color = “black”,

vertex.label.family = “sans”,

vertex.label.cex = .65,

vertex.label.dist = .35,

edge.curved = 0.5,

margin = c(rep(0,4)))

As expected, “vi50” and “vi34” have mutual links,

standing for each other’s nearest neighbor, and still being members of

different categories; but “vi39” and “ve21” too, a fact that was less

obvious in the similarity map. Also, we now discover that the nearest

neighbor to “vi7” is “ve13” from another class.

The clustering in the nearest neighbor graphs occurs naturally, and

thus the graph conveys some information about the similarity context for

groups of points in spite of the fact that it was not intentionally

built to represent such information.

**Maps vs. Graphs**

The nearest neighbor directed graph approach is not

very common among the attempts to visualize similarity data. However,

following years of work in the field of distributive semantics, I have

started abandoning similarity maps as a mean of data visualization

almost completely in favor of graphs. While I haven’t performed any

methodologically rigorous comparisons among the two approaches, I have

used both in building interfaces for semantic search engines, and the

users turned out to be always more satisfied with directed nearest

neighbor graphs in comparisons to similarity maps. Intuitively, I would

say the main reason is the absence of too much crowding of data points.

Also, the aesthetics of graphs seem more pleasing. Graphs are maybe a

bit more difficult to visually search through: the links and the

occurrence of clusters tend to somehow guide the eye to search for a

particular data point in the imminent neighborhood only. Again, I don’t

find the ergonomics of similarity maps much more efficient in that

respect. The drawback of directed graphs: while some important clusters

of nearest neighbors emerge naturally in them, the graph layout will not

necessarily enable the user to recognize the way the observations tend

to group in general. I have experimented with directed graphs in which

two nearest neighbors are selected for each observations, and that works

too, but the graph becomes unreadable quickly for any number significantly larger than ten to fifteen observations.

Comparing the loss of information in similarity maps in contrast to the nearest neighbor graphs, it seems to me that the latter have

a clear advantage in the presentation of complex datasets, and I would

clearly prefer the graphs in any situation – or at least suggest always

using them as an additional method of interpretation alongside the

respective maps. In order to illustrate, I have inspected ordinal MDS solutions for iris

distances in Euclidean space, ranging the solution dimensionality from

two to five, and computed the accuracy of determining correctly the N-th

neighbor for each observation for each solution. The following line

chart is quite telling.

On the x-axis we find the order of the neighbor (1st, 2nd,

etc), while the accuracy of correctly determining the neighbor of the

respective order across the observations is represented on the y-axis (%

scale). One can now have a quite direct insight into the behavior of

information loss in ordinal MDS. The 2D solution seems to be especially problematic, with far less than 50% recognition of the 1st

neighbor across the observations. The local information from

higher-dimensional spaces represented by directed graphs perfectly

complements the compromises made in the lower-dimensional similarity

maps.

Goran S. Milovanović, PhD

Data Science Consultant, SmartCat

**leave a comment**for the author, please follow the link and comment on their blog:

**The Exactness of Mind**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...