Exact Complexity of Mergesort, and an R Regression Oddity

[This article was first published on Byte Mining » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s nice to be back after a pretty crazy two weeks or so.

Let me start off by stating that this blog post is simply me pondering and may not be correct. Feel free to comment on inaccuracies or improvements!

In preparation for an exam and my natural tendencies to be masochistic, I am forcing myself to find the exact complexities of some sorting algorithms and I decided to start with a favorite – mergesort. Mergesort divides an array or linked list first into two halves (or close to it) and then recursively divides the successive lists into halves until it ends up with two lists containing 1 element each – the base case. The elements are then compared and switched so that they are in order, and form their own list.

At successive levels we compare the last element of the first sublist to the first element of the second sublist and merge them together to form another list. This process continues up the recursion tree until the entire original list is sorted. For a more comprehensive and precise description, see this article on mergesort.

Easy: The Worst Case

The worst case is easy as any CS student will tell you. Looking at the recursion trees below, we see that in the worst case, we must perform O(\log{n}) “layers” of “parallel” merges corresponding to the height of the recursion tree, each merge performing O(n) comparisons. Then, the worst case complexity of merge sort is O(n \log{n}).

Three Cases

Trivially, if the number of elements is a power of 2 (figure a), all of the sublists at each level of the recursion tree will have the same size. This forms a recursion tree that is full and balanced. In this case, we have the worst case complexity because each element is involved in exactly \log_2{n} merge operations each of which take O(n) time. In situation b the number of elements is 1 smaller than a power of 2 (i.e. 3, 7, 15). In situation c, the number of elements is 1 greater than a power of 2 (i.e. 3, 9, 17).

(a)
(b)
(c)

In situation a, the number of merge operations required for each of the n elements is \log_2{n} which yields O(n \log{n}) operations.

In situations b and c, some elements require more merge operations than others; however, the number of merges differs by at most 1. The number of merges for each element is approximately \log_2{n} and is exactly \lfloor \log_2{n} \rfloor or \lceil \log_2{n} \rceil, then, the total number of work performed is equal to


T(n) = a \lfloor \log_2{n} \rfloor + b \lceil \log_2{n} \rceil

In situation a, we have

T(n) = a \lfloor \log_2{n} \rfloor + b \lceil \log_2{n} \rceil = a \log_2{n}+ b \log_2{n} = \left( a + b \right) \log_2{n} = n \log_2{n}

Then what? We need to find expressions for the constants a and b. I will call them a_n and b_n. I drew out several recursion trees and got the following table:


Mergesort Table

From the table, I extracted the following relationships. I will leave the proof by induction to the reader.


 b_n = \left\{ \begin{array}{ll} n - 2 & \mbox{if } n = 2^i + 1, i = 1, \ldots \\ \frac{n}{2} & \mbox{if } n = 2^i, i = 1, \ldots \\ b_{n-1} - 1 & \mbox{otherwise}  \\ \end{array} \right.

 a_n = \left\{ \begin{array}{ll} 2 & \mbox{if } n = 2^i + 1, i = 1, \ldots \\ \frac{n}{2} & \mbox{if } n = 2^i, i = 1, \ldots \\ a_{n-1} + 2 & \mbox{otherwise}  \\ \end{array} \right.

This is not quite exactly n \log{n} as I have heard some people say, but it is pretty darn close (O(n \log{n})). Using the code below, I simulated several values for a_n and b_n and the corresponding plot for T(n).

n <- seq(1,2000)
an <- bn <- ifelse(ceiling(log(n)/log(2))== floor(log(n)/log(2)),n/2,0)
i <- seq(1,floor(log(max(n))/log(2)))
bn[2**i + 1] <- n[2**i + 1] - 2
an[2**i + 1] <- 2
an[1] <- 0
bn[1] <- 0

for (i in 2:length(an)){
	an[i] <- ifelse(an[i] == 0, 2 + an[i-1], an[i])
	bn[i] <- ifelse(bn[i] == 0, bn[i-1] - 1, bn[i])
}

T <- an*ceiling(log(n)/log(2)) + bn*floor(log(n)/log(2))

plot(T~n, type= l ,main="Exact Runtime of Mergesort", xlab=expression(italic(n)), ylab=expression(italic(T(n))))
curve(x*log(x)/log(2), add=TRUE, col="red")
legend(topleft, c("Exact Solution" ,expression(n*log(n))),lwd=c(1,1), col=c( "black" , "red" ))

Nlog2N <- n*log(n)/log(2)
my.model <- lm(T~Nlog2N)
TSS <- sum((T - mean(T))^2)
RSS <- sum(my.model$residuals^2)
MSS <- TSS - RSS
R2 <- MSS/RSS


Mergesort Comp

Note that there is not a perfect overlap here.

The Precise Regression Issue

Naturally, if the exact running time of merge sort is n \log_2{n} then I should get a regression model that has a perfect fit to the data…

Call:
lm(formula = T ~ Nlog2N)

Residuals:
      Min        1Q    Median        3Q       Max 
-90.15504 -13.84243   0.02602  19.70344  44.44878 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 1.016e+01  1.191e+00     8.526   <2e-16 ***
Nlog2N      1.005e+00  9.825e-05 10224.908   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 28.46 on 1998 degrees of freedom
Multiple R-squared:     1,	Adjusted R-squared:     1 
F-statistic: 1.045e+08 on 1 and 1998 DF,  p-value: < 2.2e-16 

Hmm. The R^2 value is exactly 1. This confused me for a while, until I realized that this was a precision issue in R, and not a regression issue. Note that if this were a perfect fit, the Residuals section should be completely 0. Also, the base of the logarithm (e or 2) does not matter; the results are exactly the same! So why is R^2 = 1?


R^2 = \frac{MSS}{TSS} = \frac{84659798098}{84661416007} = 0.9999809 \approx 1

So why is this an issue? Most people do simply round up 0.99999809 to 1, but 1 is profound to statisticians: it means that the linear model has a perfect fit. Ok, so maybe it is a precision issue or option within R. But maybe not… look at the p-value! The question to take home is: why is it that the p-value has such precision, while the R^2 value does not? Personally, if we had to round, I would round the p-value to 0, and keep the precision of R^2. Thoughts?

Conclusion to the Original Purpose of this Post

So, to make a really long story just long, the total running time is not exactly n \log_2{n}, rather it is O(n \log{n}).

To leave a comment for the author, please follow the link and comment on their blog: Byte Mining » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)