Proteomics Data Analysis (2/3): Data Filtering and Missing Value Imputation

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome to Part Two of the three-part tutorial series on proteomics data analysis. The ultimate goal of this exercise is to identify proteins whose abundance is different bewteen the drug-resistant cells and the control. In other words, we are looking for a list of differentially regulated proteins that may shed light on how cells escape the cancer-killing action of a drug. In Part One, I have demonstrated the steps to acquiring a proteomics data set and performing data pre-processing. We will pick up from the cleaned data set and confront the missing value problem in proteomics.

Again, the outline for this tutorial series is as follows:

  • Data acquisition and cleaning
  • Data filtering and missing value imputation
  • Statistical testing and data interpretation

Missing Value Problem

Although mass spectrometry-based proteomics has the advantage of detecting thousands of proteins from a single experiment, it faces certain challenges. One problem is the presence of missing values in proteomics data. To illustrate this, let's examine the first few rows of the log~2~-transformed and raw protein abundance values.

head(select(df, Gene, starts_with("LOG2")))</code>
<em>##    Gene LOG2.Parental_bR1 LOG2.Parental_bR2 LOG2.Parental_bR3
## 1 RBM47              -Inf              -Inf          21.87748
## 2 ESYT2           25.6019          25.56180          25.68763
## 3 ILVBL              -Inf          20.76474              -Inf
## 4 KLRG2              -Inf              -Inf          22.31786
## 5 CNOT1              -Inf              -Inf              -Inf
## 6   PGP              -Inf              -Inf              -Inf
##   LOG2.Resistant_bR1 LOG2.Resistant_bR2 LOG2.Resistant_bR3
## 1               -Inf               -Inf               -Inf
## 2               -Inf               -Inf               -Inf
## 3               -Inf               -Inf               -Inf
## 4               -Inf               -Inf               -Inf
## 5               -Inf               -Inf           29.14207
## 6               -Inf               -Inf           22.46269
<pre><code class="r">head(select(df, Gene, starts_with("LFQ")))</code>
<em>##    Gene LFQ.intensity.Parental_bR1 LFQ.intensity.Parental_bR2
## 1 RBM47                          0                          0
## 2 ESYT2                   50926000                   49530000
## 3 ILVBL                          0                    1781600
## 4 KLRG2                          0                          0
## 5 CNOT1                          0                          0
## 6   PGP                          0                          0
##   LFQ.intensity.Parental_bR3 LFQ.intensity.Resistant_bR1
## 1                    3852800                           0
## 2                   54044000                           0
## 3                          0                           0
## 4                    5228100                           0
## 5                          0                           0
## 6                          0                           0
##   LFQ.intensity.Resistant_bR2 LFQ.intensity.Resistant_bR3
## 1                           0                           0
## 2                           0                           0
## 3                           0                           0
## 4                           0                           0
## 5                           0                   592430000
## 6                           0                     5780200
<p>It is hard to miss the <code>-Inf</code>, which represent protein intensity measurements of 0 in the raw data set. We consider these data points as missing values, or a lack of quantification in the indicated samples. This is a common issue in proteomics experiments, and it arises due to sample complexity and randomness (or stochasticity) in sampling.</p>
<p>For example, imagine pouring out a bowl of Lucky Charms cereal containing a thousand different marshmallows. Let's say there is only one coveted rainbow marshmallow for every one thousand pieces. The likelihood of your bowl containing the rare shape is disappointingly low. In our situation, there are approximately 20,000 proteins expressed in a given cell, and many in low quantities. Hence, the probability of consistently capturing proteins with low expression across all experiments is small.</p>
<h2>Data Filtering</h2>
<p>To overcome the missing value problem, we need to remove proteins that are sparsely quantified. The hypothesis is that a protein quantified in only one out of six samples offers insufficient grounds for comparison. In addition, the protein could have been mis-assigned.</p>
<p>One of many filtering schemes is to keep proteins that are quantified in at least two out of three replicates in one condition. To jog your memory, we have two conditions, one drug-resistant cell line and a control, and three replicates each. The significance of replicates will be discussed in <strong>Part 3</strong> of the tutorial. For now, we will briefly clean the data frame and apply filtering.</p>
<pre><code class="r">## Data cleaning: Extract columns of interest
df = select(df, Protein, Gene,, starts_with("LFQ"), starts_with("LOG2"))
<em>## Observations: 1,747
## Variables: 15
## $ Protein                     <chr> "A0AV96", "A0FGR8", "A1L0T0", "A4D...
## $ Gene                        <chr> "RBM47", "ESYT2", "ILVBL", "KLRG2"...
## $                <chr> "RNA-binding protein 47", "Extende...
## $ LFQ.intensity.Parental_bR1  <dbl> 0, 50926000, 0, 0, 0, 0, 0, 0, 0, ...
## $ LFQ.intensity.Parental_bR2  <dbl> 0, 49530000, 1781600, 0, 0, 0, 0, ...
## $ LFQ.intensity.Parental_bR3  <dbl> 3852800, 54044000, 0, 5228100, 0, ...
## $ LFQ.intensity.Resistant_bR1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 8213400...
## $ LFQ.intensity.Resistant_bR2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 6903700...
## $ LFQ.intensity.Resistant_bR3 <dbl> 0, 0, 0, 0, 592430000, 5780200, 0,...
## $ LOG2.Parental_bR1           <dbl> -Inf, 25.60190, -Inf, -Inf, -Inf, ...
## $ LOG2.Parental_bR2           <dbl> -Inf, 25.56180, 20.76474, -Inf, -I...
## $ LOG2.Parental_bR3           <dbl> 21.87748, 25.68763, -Inf, 22.31786...
## $ LOG2.Resistant_bR1          <dbl> -Inf, -Inf, -Inf, -Inf, -Inf, -Inf...
## $ LOG2.Resistant_bR2          <dbl> -Inf, -Inf, -Inf, -Inf, -Inf, -Inf...
## $ LOG2.Resistant_bR3          <dbl> -Inf, -Inf, -Inf, -Inf, 29.14207, ...
<pre><code class="r">## Data filtering function
filter_valids = function(df, conditions, min_count, at_least_one = FALSE) {
  # df = data frame containing LOG2 data for filtering and organized by data type
  # conditions = a character vector dictating the grouping
  # min_count = a numeric vector of the same length as "conditions" indicating the minimum 
  #     number of valid values for each condition for retention
  # at_least_one = TRUE means to keep the row if min_count is met for at least one condition
  #     FALSE means min_count must be met across all conditions for retention

  log2.names = grep("^LOG2", names(df), value = TRUE)   # Extract LOG2 column names
  cond.names = lapply(conditions, # Group column names by conditions
                      function(x) grep(x, log2.names, value = TRUE, perl = TRUE))

  cond.filter = sapply(1:length(cond.names), function(i) {
    df2 = df[cond.names[[i]]]   # Extract columns of interest
    df2 = as.matrix(df2)   # Cast as matrix for the following command
    sums = rowSums(is.finite(df2)) # count the number of valid values for each condition
    sums >= min_count[i]   # Calculates whether min_count requirement is met

  if (at_least_one) {
    df$KEEP = apply(cond.filter, 1, any)
  } else {
    df$KEEP = apply(cond.filter, 1, all)

  return(df)  # No rows are omitted, filter rules are listed in the KEEP column

## Apply filtering
df.F = filter_valids(df,
                     conditions = c("Parental", "Resistant"),
                     min_count = c(2, 2),
                     at_least_one = TRUE)

The output data frame df.F is a copy of df with an additional KEEP column indicating the rows to retain. We will complete the filtering using the following operation and then check out the first couple of rows.

df.F = filter(df.F, KEEP)
head(select(df.F, Gene, starts_with("LOG2")))</code>
<em>##    Gene LOG2.Parental_bR1 LOG2.Parental_bR2 LOG2.Parental_bR3
## 1 ESYT2          25.60190          25.56180          25.68763
## 2 EIF3C          26.93022          27.11644          26.83231
## 3 NACAM          27.71299          27.66756          27.53527
## 4 DX39A          25.90933          25.69806          25.93283
## 5  BACH          25.07153          25.39110          25.06027
## 6 MYO1C          27.16471          27.48416          27.43841
##   LOG2.Resistant_bR1 LOG2.Resistant_bR2 LOG2.Resistant_bR3
## 1               -Inf               -Inf               -Inf
## 2           26.29148           26.04087           26.46083
## 3           29.03570           28.68295           28.89753
## 4           26.39331           26.54022               -Inf
## 5               -Inf               -Inf               -Inf
## 6           27.61400           27.02263           27.44530
<p>Notice that the protein in the first row is quantified in the <em>Parental</em> line but not the <em>Resistant</em> one. Proteins like this are of great interest to us as they are likely implicated in the mechanism of drug resistance. In addition, note that the final number of proteins after filtering (1031) is roughly 60% the original number (1747). Filtering reduces our list of proteins to ones quantified in a reasonably consistent manner.</p>
<h2>Data Normalization</h2>
<p>Before we proceed to imputation, we need to account for technical variability in the amount of sample analyzed by the mass spectrometer from one run to another. This is an issue parallel to the variation in sequencing depth in RNAseq experiments. To normalize out these technical differences, we performed a global median normalization. For each sample, the median of the log~2~-transformed distribution is subtracted from all the values. </p>
<pre><code class="r">## Data normalization function
median_centering = function(df) {
  # df = data frame containing LOG2 columns for normalization
  LOG2.names = grep("^LOG2", names(df), value = TRUE)

  df[, LOG2.names] = lapply(LOG2.names, 
                            function(x) {
                              LOG2 = df[[x]]
                              LOG2[!is.finite(LOG2)] = NA   # Exclude missing values from median calculation
                              gMedian = median(LOG2, na.rm = TRUE)
                              LOG2 - gMedian


## Normalize data
df.FN = median_centering(df.F)

The result is that each sample is centered at a log~2~(intensity) of 0.

Data Imputation

After filtering and normalization, some missing values remain. How do we deal with them from here? The statistical approach designed to answer such a question is called imputation. For a thorough discussion of imputation on proteomics data sets, I highly recommend this article in the Journal of Proteome Research.

Since missing values are associated with proteins with low levels of expression, we can substitute the missing values with numbers that are considered “small” in each sample. We can define this statistically by drawing from a normal distribution with a mean that is down-shifted from the sample mean and a standard deviation that is a fraction of the standard deviation of the sample distribution. Here's a function that implements this approach:

## Data imputation function
impute_data = function(df, width = 0.3, downshift = 1.8) {
  # df = data frame containing filtered 
  # Assumes missing data (in df) follows a narrowed and downshifted normal distribution

  LOG2.names = grep("^LOG2", names(df), value = TRUE)
  impute.names = sub("^LOG2", "impute", LOG2.names)

  # Create new columns indicating whether the values are imputed
  df[impute.names] = lapply(LOG2.names, function(x) !is.finite(df[, x]))

  # Imputation
  df[LOG2.names] = lapply(LOG2.names,
                          function(x) {
                            temp = df[[x]]
                            temp[!is.finite(temp)] = NA

                   = width * sd(temp[df$KEEP], na.rm = TRUE)   # shrink sd width
                            temp.mean = mean(temp[df$KEEP], na.rm = TRUE) - 
                              downshift * sd(temp[df$KEEP], na.rm = TRUE)   # shift mean of imputed values

                            n.missing = sum(
                            temp[] = rnorm(n.missing, mean = temp.mean, sd =                          

## Apply imputation
df.FNI = impute_data(df.FN)

Let's graphically evaluate the results by overlaying the distribution of the imputed values over the original distribution. In doing so, we observe that the number of missing values is greater in the resistant condition compared to the control. Furthermore, the missing values take on a narrow spread at the lower end of the sample distribution, which reflects our notion that low levels of protein expression produce missing data.


This is the second of three tutorials on proteomics data analysis. I have described the approach to handling the missing value problem in proteomics.

In the final tutorial, we are ready to compare protein expression between the drug-resistant and the control lines. This involves performing a two-sample Welch's t-test on our data to extract proteins that are differentially expressed. Moreover, we will discuss ways to interpret the final output of a high-throughput proteomics experiment. Stay tuned for the revelation of proteins that may play a role in driving the resistance of tumor cells.

    Related Post

    1. Clean Your Data in Seconds with This R Function
    2. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    3. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1
    4. Using R with MonetDB
    5. Recording and Measuring Your Musical Progress with R


    1. Data Management


    1. Data Manipulation
    2. Data Visualisation
    3. dplyr
    4. R Programming
    5. Tips & Tricks

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)