MARSHAL: Managing Additional References in SHort-read ALignment
---------------------------------------------------------------
http://marshalbio.sourceforge.net/
If you have any questions or comments, please send them to Rebecca Sealfon at ras2198@columbia.edu.


CONTENTS
1.  Overview, Installation and Basic Tutorial
	1.1  Overview
	1.2  Installing and Uninstalling MARSHAL
	1.3  Tutorial
2.  File Format Requirements
	2.1  Primary Reference, Indels and Short Reads
	2.2  Indel Table
	2.3  Reformatting Sequence Files for MARSHAL with the "sed" Command 
3.  Obtaining the Insertions and Indel Table File from Alignment Output: The marshal_extract Program
4.  Translating Positional Coordinates Between Builds
5.  Creating the Initial Chimeric Reference With marshal_build and Running the Short-Read Aligner
6.  Processing Alignment Output to Calculate a Coverage Cutoff: marshal_simulate, marshal_coverage
7.  Calling Indels with marshal_analyze
8.  Creating a Final Chimera with marshal_integrate


1.  OVERVIEW, INSTALLATION AND BASIC TUTORIAL
	1.1  Overview
MARSHAL allows short-read nucleotide aligners to utilize multiple references with negligible overhead of time and space.  The use of multiple references in short-read assembly is important for detecting structural variations (SVs), such as long insertions and deletions (indels) that may only appear in some of the references.  MARSHAL is designed for analyzing structural variations among closely related individuals or species.  It is most effective when run with a set of references that are relatively similar to one another and to the reads, but are polymorphic with respect to specific long indels.  Users may choose which external program performs the short-read alignment, since MARSHAL works by preprocessing the alignment input and postprocessing the alignment output.  Several scripts that facilitate interaction with additional programs are also included.

Input into MARSHAL is provided as one or more short read files, a FASTA primary reference, a tab-delimited file with a formatted list of long indels, and FASTA sequence files containing the insertions.  An optional preprocessing step identifies the indels from pairwise alignment between the primary reference and other reference(s), outputting this information in appropriate format for the main pipeline.  Prior to short-read alignment, the indel information is used to create a FASTA-formatted initial chimeric reference.  This chimera is inclusive of the primary reference and all the insertions, and is designed to enable the short-read aligner to adequately map the reads to the indels.  After the aligner is run, indel coverage is analyzed from a SAM-formatted alignment map and each indel's presence vs. absence evaluated.  An additional script, which may be called before or after short-read alignment, allows the user to create a continuous chimeric reference by specifying which segments to add and remove.


	1.2  Installing and Uninstalling MARSHAL
Install:
After downloading MARSHAL, uncompress the downloaded .tar.gz file by typing, in the directory that contains that file
$ tar -zxvf marshal-XXX.tar.gz
where XXX is the version obtained.  You will see a MARSHAL folder.

Enter that folder and compile the program in the main download directory by typing
$ javac *.java

Uninstall:
Delete all bytecode files by typing, in the main download directory
$ rm *.class

Remove all MARSHAL files by typing, in the parent directory of the MARSHAL folder
$ rm -rf marshal-*


	1.3  Tutorial
In marshal-sampleData.tar.gz, downloadable from the main MARSHAL webpage, are the files required to perform the analysis of composite human/bacterial sequences described in the Supplementary Material of the MARSHAL paper.  These files include: primary reference "chrY.withAureusSegments.fa", secondary reference insertion files "aureus.falseInsertions.fa" and "chrY.insertions.fa", indel table file "allIndels.txt" and short read file "reads.fa".  The instructions in the tutorial can be copied and pasted into the terminal if the marshal-sampleData/ directory is moved to the main folder with the MARSHAL code.

Once MARSHAL is installed and compiled, the preliminary chimeric reference chrY.aureus.chimera.fa can be created by typing
$ java marshal_build marshal-sampleData/chrY.withAureusSegments.fa marshal-sampleData/allIndels.txt chrY.aureus.chimera.fa marshal-sampleData/chrY.insertions.fa marshal-sampleData/aureus.falseInsertions.fa

The output file "chrY.aureus.chimera.fa" can be used as a reference for the alignment of the short reads (reads.fa).  We downloaded the source code for Bowtie 0.12.7 (http://bowtie-bio.sourceforge.net/tutorial.shtml), a C++ program, into the parent directory of the main MARSHAL folder and typed the following commands:
$ tar -zxvf bowtie-src-0.12.7.tar.gz
$ cd bowtie-src-0.12.7
$ make
$ ./bowtie-build ../marshal-XXX/chrY.aureus.chimera.fa chimera.index
where XXX is the relevant version of MARSHAL.  This sequence of commands installs and compiles Bowtie and builds a Bowtie index of the chimeric reference.  The actual alignment was performed by typing:
$ ./bowtie -Sf chimera.index ../marshal-XXX/marshal-sampleData/reads.fa reads.chimeraMap.sam

Return to the main MARSHAL download directory to perform the coverage analysis.
$ cd ../marshal-XXX

Create an indel table file "randomIndels.txt" containing randomly placed indels, and analyze the coverage of these random indels:
$ java marshal_simulate chrY.aureus.chimera.fa marshal-sampleData/allIndels.txt randomIndels.txt -x -n 6 -i 5.0 0.1 -i 15.6 0.2
$ java marshal_coverage ../bowtie-src-0.12.7/reads.chimeraMap.sam randomIndels.txt

Alternatively, the first coverage analysis performed in the Supplement can be performed on Supplementary Table 2, once it is saved as a text file with the explanatory text removed.  marshal_coverage's output file "basecoverage.txt" can be opened in a spreadsheet.

Analyze the original indels as covered or uncovered:
$ java marshal_analyze ../bowtie-src-0.12.7/reads.chimeraMap.sam marshal-sampleData/allIndels.txt --verbose

Integrate all the indels into a new reference chromosome chrY.indelsIntegrated.fa (add the insertions to their homologous positions, remove the deletions from their homologous positions):
$ java marshal_integrate marshal-sampleData/chrY.withAureusSegments.fa marshal-sampleData/allIndels.txt chrY.indelsIntegrated.fa marshal-sampleData/chrY.insertions.fa


2.  FILE FORMAT REQUIREMENTS
	2.1  Primary Reference, Indels and Short Reads
MARSHAL accepts a single- or multi-sequence FASTA file as the primary reference, with the sequence(s) in this file recorded as chromosomes.  Segments of secondary references that are absent from the primary reference (FASTA insertion files), as well as their coordinates and the coordinates of primary reference segments that are absent in secondary references (deletions), are also input into MARSHAL.  The coordinates of indels are stored in a separate indel table file (see Section 2.2).

The primary reference must be entirely contained in a single FASTA file; there can be as many FASTA insertion files as the command line will allow.  However, the file names and FASTA header rows of the insertion sequences, as well as the FASTA header rows of the chromosomes in the primary reference, are used by the program for identification.  All FASTA headers must begin with the ">" character.  Their names are subject to additional restrictions, particularly that they must match Column B (accession) and Column C (chromosome number) of the indel table.  For the details of these requirements, see section 2.2 and particularly the descriptions of the accession and chromosome number columns.

In addition, MARSHAL requires the end-user to input a set of short reads.  Since the short reads are sent to the mapper unmodified, their format requirements are those of the mapper.  MARSHAL can be used with any mapper that will accept a multi-sequence FASTA file as input and produce a SAM (Sequence Alignment Map) file as output.


	2.2  Indel Table
Many steps in the pipeline input or output an indel table, which must be stored in its own file.  The indel table file is a plain-text, tab-delimited document that can be easily opened and manipulated by a spreadsheet.  It includes a header row, which is skipped over and otherwise ignored by the program.  Each subsequent row represents a different insertion or deletion.  Sorting the rows is not necessary, as this occurs within MARSHAL.  Not all columns hold information that is relevant to all types of indels; where information is irrelevant, "NA" is used.

The columns of the indel table are as follows:
Column A: Type.  "ins" if the region is an insertion, "del" if the region is a deletion.
Column B: Accession.  A unique string of characters that identifies each insertion; the column reads "NA" for deletions.  When handling the input FASTA files that contain the insertion sequences, the program checks first in the file name and then in the FASTA header for case-sensitive exact matches to the accession.  Thus, it determines which file or sequence corresponds to which row of the table.  If an input file contains multiple insertions, the name of the file must not contain the accession of any of the insertions.
Column C: Chrm.  The chromosome number (we suggest starting this with "chr" for ease of use with marshal_build).  When handling the primary reference, the program identifies chromosome number by FASTA header lines.  It searches for case-sensitive exact matches to the character sequences contained in this column of the table, and also requires that a space follows the chromosome name.  Thus, "chr11 " is recognized as different from "chr1 ".
Column D: Orientation.  Orientation of insertions along the chromosome--"fwd" for forward orientation, "rc" for reverse-complement.  The column reads "NA" for deletions.
Column E: Ins start.  Start position of the insertion along the input insertion sequence.  "NA" for deletions.
Column F: Ins end.  End position of the insertion along the input insertion sequence.  "NA" for deletions.
Column G: Chrom start.  Start position of the indel along the chromosome.
Column H: Chrom end.  End position of the indel along the chromosome.
Column I: Ploidy.  The ploidy of the chromosome.
Column J (optional): Detected presence or absence of the indel region.  0 if same as primary reference, 1 if different.  In other words, insertions to the primary reference are "0" if absent and "1" if present; deletions from the primary reference are "0" if the sequence is present and "1" if it is absent.  Created by the indel caller, and input into marshal_integrate to create the final chimera.
Subsequent columns (optional): Can be used to store additional information about the sequences and the regions.  For example, marshal_analyze in --verbose mode outputs an indel table file with three extra columns per indel region: number of covered bases, total number of bases in the indel region, and total number of short read bases that mapped to the region.  See Supplementary Table 3 for an additional example.

Notes on Columns E-H (positional coordinates):
The positions of the bases are always numbered along each FASTA header-identified sequence (chromosome, reference DNA segment or input insertion sequence) from 1 (the first base in the sequence) through N, where N is the Nth and last base in the sequence.  However, "start" and "end" are numbered slightly differently for insertion vs. chromosome positions.  For Columns E and F, "ins start" and "ins end" denote the positions of the first and last bases that are part of the insertions.  However, for Columns G and H, "chrom start" denotes the last chromosome position before the indel region begins (0 if the region begins before the start of the chromosome), whereas "chrom end" denotes the first chromosome position after the indel region ends (some number greater than N if the region ends after an N-base chromosome).

The format of the indel table and the notations used for sequence coordinates are based on Supplementary Table 10 of Kidd,J.M. et al. (2010) Characterization of missing human genome sequences and copy-number polymorphic insertions.  Nat. Methods, 7, 365-371.


	2.3  Reformatting Sequence Files for MARSHAL with the "sed" Command 
The user can enter the UNIX command "sed" (stream editor) to perform some of the relevant text replacements in input files, as follows:
$ sed 's/Abracadabra/Poof/g' input.txt > output.txt
This produces a file "output.txt" that is identical to "input.txt", except that each occurrence of "Abracadabra" is replaced with "Poof".  For example, replacing ";" (an alternative representation of the FASTA header) with ">" in the file "human.txt" for the new output file "human2.txt":
$ sed 's/;/>/g' human.txt > human2.txt
sed can also be used to change the accession numbers of insertions, e.g.: sed 's/$INS/$INS11/g' insertions.txt > insertions2.txt


3.  OBTAINING THE INSERTIONS AND INDEL TABLE FILE FROM ALIGNMENT OUTPUT: THE marshal_extract PROGRAM
A script called marshal_extract enables the user to extract the inserted sequences and indel table file from the output of pairwise alignment, in formats suitable for input into the main pipeline.  It requires two input files: (1) a FASTA file containing the query sequence; (2) a file containing a list of the segments that aligned.  For the latter, marshal_extract expects the user to have previously filtered out undesirable aligned segments (e.g. short paralogs) and sorted the alignment output by query start position.  It accepts a tab-delimited file with no header row and four columns:
Column A: Reference start position, for the relevant aligned segment in that row
Column B: Reference end position, for the relevant aligned segment in that row
Column C: Query start position, for the relevant aligned segment in that row
Column D: Query end position, for the relevant aligned segment in that row

Start and end coordinates are the positions of the first and last bases in the alignment.  For example, in two 10-base sequences, if bases 3-5 of the reference align with bases 7-9 of the query, and bases 1-2 of the reference align with bases 9-10 of the query, the alignment file should read as follows:
3	5	7	9
1	2	9	10

The script extracts the segments of the query sequence that did not align, and outputs two files: a secondary reference file (suffix ".fa") containing the insertion sequences, and an indel table file (suffix ".indels.txt") as described in Section 2.2.  marshal_extract is called as follows:
$ java marshal_extract query.fa alignedList.txt outputPrefix [flags]
where outputPrefix is the prefix of the output files, query.fa is the query sequence and alignedList.txt is the input list of the segments that aligned.

If heap size is too small, it can be changed by adding the -Xmx flag before "marshal_extract."

Program flags can be input in any order.  A complete list of the flags recognized by marshal_extract:
-a/--accession [accession code]: Accession string for the query sequence.  When this flag is used, the program searches each FASTA header in the query file for the accession code and only extracts the insertion sequences if this code is found in the header.  This flag enables utilization of a multi-sequence FASTA file in which only one sequence is the query.
-c/--chr [chromosome name]: Chromosome identifier.  Allows the user to specify the name of the chromosome that has been aligned, for recording in the indel table file.  Default is "chr1".
-m/--minsize [integer]: Minimum indel length.  Resets minimum recorded indel length to the next command line parameter.  Default minimum is 20.  Although any integer greater than zero will be accepted, a minimum length slightly lower than short read size is recommended.
-p [integer]: Ploidy.  Resets ploidy to the next command line parameter, for recording in the indel table file.  Default ploidy is 2.


4.  TRANSLATING POSITIONAL COORDINATES BETWEEN BUILDS
The MARSHAL software is bundled with scripts (translate_input, lift_sort, remap_sort, translate_new_build) for interfacing with UCSC's liftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver) and/or NCBI's Remap (http://www.ncbi.nlm.nih.gov/genome/tools/remap), tools for translating homologous positional coordinates between builds of the same genome or between genomes.  These scripts facilitate the translation of information in the primary reference positional columns (Columns G and H) of an indel table file.

If this section of the pipeline is utilized, the first step is to run translate_input:
$ java translate_input <indel table file> <output file name>
translate_input produces an output file containing the information in the chromosome and the primary reference positional columns, formatted as genome coordinate positions (chrN:AA-BB).  This file can be directly uploaded to either website.

After running liftOver, the end-user must save the conversion file and failure file for input into lift_sort, which identifies the success vs. failure of each mapping.  Run lift_sort as follows:
$ java lift_sort <input position file> <file with successful remaps> <file with unsuccessful remaps> <output file name>
The input position file is the same as the output file of translate_input.

After running Remap, the script remap_sort can be used to extract the success vs. failure information in a Remap mapping file.  remap_sort is run as follows:
$ java remap_sort <Remap mapping report> <output file name>

The output of either lift_sort or remap_sort is input to translate_new_build, which transforms the LiftOver or Remap output into tab-delimited format.  translate_new_build is run as follows:
$ java translate_new_build <input file> <output file name>
The output of this program can be copied and pasted directly into the indel table, as the chrom start and chrom end columns.


5.  CREATING THE INITIAL CHIMERIC REFERENCE WITH marshal_build AND RUNNING THE SHORT-READ ALIGNER
marshal_build creates the initial chimeric reference file for input into the short read aligner.  To increase the likelihood that short reads spanning the edges of the insertions will map, it adds the homologous flanking regions of the primary reference before and after each insertion, creating chimeric sequences that are included in the chimeric reference.  In doing so, it creates a number of temporary files that are automatically deleted at the end of the run.  The flanking regions of the input deletions are also appended to the primary reference as a separate sequence, to allow reads that span the edges of deleted segments to map to an appropriate homolog.  Flanks are of uniform length.

marshal_build is run as follows:
$ java marshal_build <primary reference> <indel table file> <output file name> <insertion sequence files> [-f <integer representing flank size>]
The -f flag allows the end-user to reset flank length from a default value of 100 bases.  Ideally, flanks are one base shorter than the longest short reads, so the number of aligning “edge” reads can be maximized and the number of reads that map exclusively to the flanks minimized.

MARSHAL allows the end-user flexibility in choosing the short-read aligner from the wide variety of externally available programs.  Since the short reads are input into the aligner unmodified, their format requirements are dependent on the alignment software's specification.  The specific aligner and settings can depend on the requirements of the individual dataset, so long as the alignment output is in SAM format.


6.  PROCESSING ALIGNMENT OUTPUT TO CALCULATE A COVERAGE CUTOFF: marshal_simulate, marshal_coverage
The evaluation of the indel regions as present or absent is based on the evaluation of each base as covered or uncovered.  The scripts marshal_simulate and marshal_coverage are intended to facilitate the estimation of appropriate coverage cutoffs.  marshal_simulate creates an indel table file containing a set of randomized indel regions along the primary reference, and marshal_coverage outputs coverage statistics for a given set of indel regions.

Assuming that the primary reference is similar to the short-read sequence and long indels suitable for MARSHAL analysis represent a minority of the genome, most of the bases in a randomly selected set of primary reference segments are expected to map to the short reads.  Thus, it can be assumed that the majority of the genome is covered.  The coverage statistics for a randomized set of indel regions, as produced by marshal_simulate, can provide a frame of reference for analyzing the input indels of known polymorphism.

marshal_simulate is run as follows:
$ java marshal_simulate <primary reference> <indel table file> <output file name> [flags]
Flags recognized by marshal_simulate:
-f [integer]: Resets flank size.  Only relevant when -x is used.  Default flank size 100.
-i [decimal] [decimal]: Resets log2 maximum region size, for each linear or logarithmic size increment.  Each use of -i can specify a separate increment and upper bound for that increment--the first decimal is the upper bound, the second decimal is the increment.  Default (5.0, 0.1), (17.6, 0.2).  I.e., logarithmic increment of 0.1 up to size 2^5.0, logarithmic increment of 0.2 up to size 2^17.6.
-lin: Sets the spacing of the sizes of the random regions to linear rather than logarithmic.  Default logarithmic spacing.
-min [decimal]: Resets log2 minimum region size.  Default 4.0.
-n [integer]: Resets the number of random regions created at each size.  Default 100.
-x: Excludes output regions that overlap with the flanks of the original indels.  Default off.

marshal_coverage is run as follows:
$ java marshal_coverage <SAM alignment file> <indel table file> [flags]
Optional flags:
-f/--flank_size [integer]: The size of each flanking region appended to the indels.  Default 100 bases.
-o/--out [file]: The name of the output file.  Default "basecoverage.txt".

basecoverage.txt, the output file of marshal_coverage, can be opened in a spreadsheet.  Each row represents a separate chromosome, with the names of the chromosomes in the first column.  Subsequent columns represent coverage values, from coverage of 0 onward.  Thus, for a given chromosome, the cell in the second column represents the number of bases in the indel regions of that chromosome which received no coverage; the cell in the third column represents the number of bases in the indel regions that received coverage of 1; the cell in the second column represents the number of bases in the indel regions that received coverage of 2; and so on.  These data points can be graphed.


7.  CALLING INDELS WITH marshal_analyze
After short-read alignment, the program output is analyzed.  MARSHAL requires the short-read map to be formatted in SAM, and utilizes the SAM file headers to obtain information about the sequences.  marshal_analyze converts the SAM output into an intermediate binary format and performs the main indel analysis.  The output file is an indel table with at least one additional column--whether each index's presence or absence matches the primary reference.  This column reads 0 if the indel call is the same as primary reference, 1 if it is different from the primary reference.  In other words, insertions to the primary reference are "0" if absent and "1" if present; deletions from the primary reference are "0" if the sequence is present and "1" if it is absent.

In -v (--verbose) mode, which is not substantially slower or more memory-intensive, three further columns are printed.  In order from left to right, they contain the information about: (1) the number of bases in each indel region called as "covered", (2) the total number of bases in the indel region, (3) the total number of short read bases that mapped to the region.

marshal_analyze is run as follows:
$ java marshal_analyze <SAM alignment file> <indel table file> [flags]
Optional flags:
-b/--min_base [integer]: The minimum coverage of a reference sequence base for it to be called "covered".  Default 1.
-f/--flank_size [integer]: The size of each flanking region appended to the indels.  Default 100 bases.
-o/--out [file]: The name of the output file.  Default "calledindels.txt".
-s/--min_seg [decimal]: The proportion of bases (0-1) in a segment that must be covered for the segment to be called present.  Default 0.7.
-v/--verbose: Mode that prints additional data for each segment (number of covered bases, total number of bases in the indel region, and total number of short read bases that mapped to the region).  Default off.


8.  CREATING A FINAL CHIMERA WITH marshal_integrate
marshal_integrate outputs another type of chimeric reference, a FASTA sequence in which the indels from an indel table file are integrated into their homologous positions with respect to a primary reference.  marshal_integrate may be used to create a modified chimeric reference for assembling additional sequences without MARSHAL-type indel analysis.

marshal_integrate is run as follows:
$ java marshal_integrate <primary reference> <indel table file> <output file name> <insertion sequence files>