Skip to main content
Figure 2 | Genome Medicine

Figure 2

From: VISPA: a computational pipeline for the identification and analysis of genomic vector integration sites

Figure 2

Simplified architecture of the distributed alignment, filtering tool, control, and flow in the distributed implementation. The MapReduce workers repeatedly call BLAST on each query sequence in the input subset assigned to it by the Hadoop framework. Each stream of BLAST results is then filtered according to the specified rules: if there are no results left at this point, the read is discarded (N set, no-hit); otherwise, remaining hits are classified as either ambiguous (A set, repeats) or unambiguous (U set). Finally, a local output collector opens all MapReduce output files (one per worker) and merges them into three new files, one for each category. In the course of the data analysis performed for our clinical trials, the alignment and filtering step has been run on up to 240 CPU cores simultaneously.

Back to article page