String Indexing 2020

Research Project

We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is to index every substring of these long strings to facilitate efficient queries. We propose a new simple, scalable, and inherently parallelizable algorithm for building a suffix array for out-of-core strings. Our new algorithm, Suffix Rank, scales to arbitrarily large inputs, using disk as a memory extension. It solves the problem in just O(log n) scans over the disk-resident data. We evaluate the practical performance of our new algorithm, and show that for inputs significantly larger than the available amount of RAM, it scales better than other state-of-the-art solutions, such as eSAIS, SAscan, and eGSA.

People

Marina Barsky [email protected]
Jonathan Gabor [email protected]
Mariano Consens [email protected]
Alex Thomo [email protected]

New algorithm: Suffix Rank

Suffix Rank: a new scalable algorithm for indexing large string collections: VLDB paper

Implementation: https://github.com/mgbarsky/SUFFIX_RANK

Sample dataset: Small input dataset used for exhaustive correctness test: small_hg_reads.zip

Competitive algorithms

We compared performance and scalability of Suffix Rank to the following algorithms:

eSAIS Code. Instructions.
This algorithm uses the STXXL library which handles all I/O tasks using multiple I/O threads. Our algorithm is implemented in bare C and does not use multi-core processing. Suffix Rank runs faster than eSAIS when the input is at least 30 times larger than the available RAM.
SAscan Code. Instructions.
This algorithm is lightweight and extremely fast. Suffix Rank runs faster than SAscan when the input is at least 15 times larger than the available RAM.
eGSA. Code. Instructions.
This algorithm builds a generalized suffix array for a large collection of very short strings: each line of the input is treated as a separate string and the algorithm only sorts the suffixes of each such string rather than suffixes of the entire input. Suffix Rank can also build a generalized suffix array, and it works faster than eGSA on all inputs tested in the experiments.

For comparison, run Suffix Rank and eSAIS on the 5GB of Human Genomes dataset using 1GB of RAM. Our algorithm is 3.5 times faster, despite the fact that eSAIS performs multi-threaded I/Os.

Input sequences

WIKI

Dataset: link.
Source: the entire XML file enwiki-20200220-pages-meta-current.xml downloaded from wikimedia dump.
Description: A single file containing text in XML format. Indexed as ASCII text. Converted encoding using this command.
Motivation: Evaluate performance for sorting suffixes of one very long string.
Alphabet Σ: 96.
Total size N: 76GB.
Number of separate strings:: 1.

PROTEINS

Dataset: link.
Source: the entire TrEMBL database of proteins in text format downloaded from uniprot.org.
Description: The database of total size 680GB includes a large amount of information about each protein sequence. We extracted all pure protein sequences using this code. Each protein sequence is on a separate line.
Motivation: Explore how the algorithm handles a very large collection of short repetitive strings.
Alphabet Σ: 25.
Total size N: 59GB.
Number of separate strings (lines): 185,000,000.
Max string length: 21KB.
Ave string length: 2.5KB.

GUTENBERG PLUS

Dataset: link.
Source: the entire Guteneberg collection of English books in text format was downloaded from http://www.gutenberg.org/robot/harvest using this script (ref). The total amount of data was about 15GB, and it was complemented upto 20GB with the text files downloaded from this site.
Description: Natural language text. Anomalous dataset: several versions of the same book, and each book in Gutenberg is prefixed with the same long copyright notice.
Motivation: Evaluate performance on a large repetitive collection of small and medium-size files.
Alphabet Σ: 105.
Total size N: 20GB.
Number of separate strings (files): 55,000.
Max string length: 150MB.
Ave string length: 360KB.

HG READS

Dataset: link.
Source: the collection of raw DNA reads from 8 different sequencing experiments was downloaded from the Phase 3 1000 Genomes Project data: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/. The content was converted into pure DNA strings using this command, which removed the statistics and description rows.
Description: This dataset contains a very large amount of very short strings with multiple identical substrings.
Motivation: Explore how the algorithm handles a large collection of multiple short strings.
Alphabet Σ: 5.
Total size N: 20GB.
Number of separate strings (lines): 241,000,000.
Max string length: 100.
Ave string length: 82.

SKYLINE

Dataset: 15GB sample.
Source: artificially generated using this code.
Description: The Skyline dataset is recursively defined to contain identical substrings at any depth of the recursion. For example if the initial substring A is defined as '121', next we build a larger substring B = A + '3' + A, and the final string C = B + '4' + B. So the final substring of length 16 is over the alphabet of size log₂16 = 4 and is C = '121 3 121 4 121 3 121'. If you plot the numbers, it reminds the skyline. Sorting suffixes of C is not a trivial task: there are identical prefixes at each level of recursion.
Motivation:: Skyline input is used as an adversarial input for most suffix sorting algorithms.
Alphabet Σ: log₂N.
Size N: 2GB - 30GB
Number of separate strings:: 1.

CHROMOSOMES

Dataset: link.
Source: using this query we downloaded several human genome assemblies from the NCBI website. We selected sequences of assembled human genomes, where each chromosome is represented as a separate string (file).
Description: The data was converted from FASTA format to plain ASCII text removing header and identifier information and leaving just DNA sequences. In addition to the regular DNA alphabet {'A', 'C', 'G', 'T'} there are long stretches of 'N' which stands for an unknown nucleotide.
Motivation: This is a rather adversarial dataset which contains multiple repeating substrings of significant length: putting several almost identical human genomes into the same dataset ensured that very long suffixes will not be assigned the unique rank until the very last iteration of our algorithm.
Alphabet Σ: 5.
Total size N: 20GB.
Number of separate strings(files): 165.
Max string length: 250MB.
Ave string length: 121MB.