# Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

@article{Galluzzo2021BurrowsWT, title={Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark}, author={Ylenia Galluzzo and Raffaele Giancarlo and M{\'a}rio Luciano Randazzo and Simona E. Rombo}, journal={ArXiv}, year={2021}, volume={abs/2107.03341} }

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset… Expand

#### References

SHOWING 1-10 OF 17 REFERENCES

BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies

- Computer Science, Medicine
- Bioinform.
- 2015

BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Expand

A New Burrows Wheeler Transform Markov Distance

- Computer Science, Mathematics
- AAAI
- 2020

Unlike other compression-based distance metrics known to us, the new Burrows Wheeler Markov Distance works by embedding sequences into a fixed-length feature vector, which allows it to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods. Expand

Rapid parallel genome indexing with MapReduce

- Computer Science
- MapReduce '11
- 2011

A novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model is presented and the end-to-end runtime is reduced from hours to mere minutes. Expand

Parallel distributed memory construction of suffix and longest common prefix arrays

- Computer Science
- SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2015

This paper presents parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Expand

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

- Computer Science
- NSDI
- 2012

Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand

Replacing suffix trees with enhanced suffix arrays

- Computer Science
- J. Discrete Algorithms
- 2004

This article shows how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that use an enhanced suffix array and solves the same problem in the same time complexity. Expand

Indexing Compressed Text

- Computer Science
- 2003

This thesis implements an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts and shows that the index is compact and supports fast search. Expand

Next-generation sequencing: big data meets high performance computing.

- Computer Science, Medicine
- Drug discovery today
- 2017

To make effective use of the produced data, the design of big data algorithms and their efficient implementation on modern high performance computing systems is required. Expand

Linear work suffix array construction

- Mathematics, Computer Science
- JACM
- 2006

A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms. Expand

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

- Medicine, Biology
- PloS one
- 2014

A neural-network based training scheme is provided to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. Expand