Open preprint reviews by Aaron Darling

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

Rebecca R Murphy, Jared M O', Connell, Anthony J Cox, Ole B Schulz-Trieglaff

The manuscript describes an algorithm and corresponding software implementation to detect misassemblies in de novo genome assemblies using mate pair sequence data. de novo genome assembly is a highly active area of research, driven by ongoing advances in sequencing technologies. Many of the current generation of assemblers are prone to misassembling regions of genomes that contain high identity repetitive elements, especially those that are at or above the read length or in some cases the size of k-mers used for de bruijn graph-based algorithms. It is exciting to see new efforts to solve these problems. The context in which a tool such as this would be used could be better introduced. One case is when the initial draft assembly algorithm is unable to incorporate mate-pair information, and a subsequent scaffolding step is to be carried out. In this situation, if the initial assembly contains errors, the scaffolder will be unable to accurately scaffold the assembly with mate-pair data unless the errors are detected and corrected prior to scaffolding. This approach is used for example in the A5 and A5-miseq pipelines. However, it is possible in principle to construct an assembler which leverages the mate pair information directly during the contigging process to avoid such errors in the first place. For such assemblers, a tool like this may not provide any added utility. This kind of information on the scope of applicability could be better introduced.

Experimental Design

Overall I think enough of the algorithm was described to understand how it works, however I have a number of questions about the rationale for the design of the method which I have detailed below in the General Comments section. Unfortunately I’m afraid that the design of the accuracy evaluation experiment leaves us with little idea of the method’s expected behavior on real datasets. This is because the method appears to have been tuned (T parameter, ROC curves) on the same data for which accuracy is reported. If this is not the case, then the manuscript text needs to be revised to clarify the issue. I am also concerned that no effort has been made to compare the method’s performance to previous work to solve the same problem, for example the A5qc module used by the A5 pipeline to detect and correct misassemblies. It is erroneously stated in the introduction that no other software is optimized to work with mate-pair data, yet A5qc does and in fact the use of mate-pairs to detect misassembled contigs was the main motivating use case in its development. The A5 manuscript (Tritt et al 2012 PLoS ONE) discusses its use with mate pair datasets explicitly. A5qc is by no means perfect and is likely to be a bit challenging to use independently of the rest of the pipeline, but that is no excuse for inaccurately representing its application scope and neglecting to benchmark it. Indeed, I suspect A5qc may be able to detect some misassemblies that the present method can not, because the present method is limited to identifying misassemblies where read pairs map entirely within a single contig, whereas A5qc can identify misassemblies involving read pairs that map to different contigs. I would also guess that the present method may be more sensitive than A5qc for the within-contig misassemblies. There could be other tools that are relevant for comparative benchmarking; I have not kept up with the literature in this area.

Validity of the findings

The main issue potentially impacting the validity of the findings is whether the test data were also used for selection of the T parameter. The test datasets are relatively limited, comprising less than 10 genomes, which leaves a non-negligible potential for parameter overfitting.

Note that I did not (yet) evaluate the software itself by running it on my own datasets, in light of the other issues that I think should be addressed first.

General comments for the author

The following are specific notes that I made while reading the manuscript:

Abstract: de bruijn assemblers are clearly the most prevalent for Illumina data, but is the scope of applicability really limited to de bruijn assemblers?

Introduction, paragraph 2: The Tritt et al 2012 citation is much more appropriate for A5 in this context, as it describes a technique to use mate pair reads to detect and correct assembly errors. The Coil et al 2014 paper does not include details of that technique, nor does it contain any revisions of that technique.

“Statistical analysis of mate pair insert sizes” paragraph 1: The use of a uniform to model the mate-pair noise makes sense, but it’s not clear how appropriate the normal would be for nextera, unless a tight fragment size range has been gel extracted. If data from gel extraction is expected it should be mentioned, and some further discussion is warranted in either case.

same section, paragraph 2: the definition of spanning is somewhat ambiguous. do the read pairs spanning site i include only read pairs with one read entirely before i and the other entirely after? or are pairs where one or both reads actually include i considered ‘spanning’? and for a region to be spanned, which of these definitions is correct?

same section, eqn 1: an extra set of P’s in the second terms of the numerator & denom would help improve clarity, also there’s no need for the first equation – why not just define the notation for the prior \pi_x up-front and give eqn 2?

same section, eqn 2: The same insert size on a contig of different lengths would give different values for this expression, because the uniform distribution has been defined only over the contig length. Is that intentional? Desirable? The rationale for this decision is not obvious and some explanation is in order.

same section, paragraph 3: It seems like the same process used to estimate the insert size distribution parameters would naturally also yield an estimate of \pi_0. Is that used? I have seen mate pair error levels as high as 20% in some libraries. It’s highly sensitive to the lab conditions, so would be ideal to estimate from the data rather than use a fixed value.

same section, eqn 4: what about circular molecules, e.g. plasmids? a large number of pairs near the contig ends will appear to be in the wrong orientation…

same section, eqn 5: why do we need to calculate these contig-specific distributions? is it to deal with local deviation in coverage? or is it because the distributions used in eqn. 1 have a contig-local domain? or something else? some explanation is needed.

same section, eqn 6: it’s a bit strange that this approach starts out with the Bayesian framework (e.g. the use of a prior probability) then goes into a frequentist framework for the hypothesis test. One way to keep it Bayesian would be to set up a Bayes factor of the competing hypotheses of no misassembly vs. one or more misassemblies in the target window. then the threshold T could be applied to the Bayes factors instead of the standard deviation.

section “Global Assembly Parameters”, eqn 7: the notation for MAD is not quite right, as it suggests taking the median of a single data point in Y.

same section, eqn 7: at high enough levels of noise in the mate pair library, this approach is likely to overestimate the true insert size to some extent. The 30kbp threshold will help mitigate the problem, and the later steps to identify extreme divergence from a contig’s background will also reduce the noise, but this likely comes with a cost in sensitivity for misassembly detection.

same section eqn 8: why use this approach instead of one of the other common approaches to calculate standard deviation?

section “Interval Sequence Tree construction”: How is the interval sequence tree used in this algorithm? it’s not clear at what step in the breakpoint detection the IST gets queried. This should either be explained or the whole section omitted.

section “ROC plots” eqn 10: The FPR here is sensitive to how finely the contigs are sliced into windows because it includes true negatives.

section “Workflow pipeline”: nice to see the precision here: exact version & command line

same section: what version of QUAST? they do change a little from one release to the next.

same section, bwa commands: again, versions would be good, even better would be a script that reproduces all the results!

same section commands: you might want to indicate which version of NxRepair produced the results described here in case you ever fix bugs or make improvements…

show less

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li

The pdf of this review as well as associated scripts are available from here:

This is a review of Heng Li “Aligning sequence reads, clone sequences and assembly contigs with BWA­MEM”, v2, available from

The review was written by Aaron Darling and presented at the Bioinformatics Lunch Bytes meeting at the University of Technology Sydney.

Review of the manuscript:

The manuscript introduces the “mem” module of bwa, a read mapping and alignment program that is widely used in the high throughput sequence analysis field. The manuscript is written in the bioinformatics app note format, which has a 2 page limit. Even though the page limit is highly restrictive the manuscript manages to include some helpful introduction and discussion. I found the methods section to be lacking important detail though (see below).

Major comments:

1. Why does the reseeding approach work? The manuscript does not give any insight into why this helps improve accuracy. The issue seems to be that the longest matches (the SMEM) for a query sequence might not be part of the true alignment. By allowing shorter matches to be alignment seeds it can still find the true alignment in the occasional case where the longest MEM matches the wrong place. That makes sense, but why require
higher multiplicity and restrict only to matches covering the middle base in the SMEM? Also, why is the reseeding threshold 28nt? I can wager a guess that this seemed like a good speed/accuracy tradeoff on a particular dataset but it would be nice to have some idea of whether this is a good heuristic for a wide range of datasets or if it is really tuned for, e.g. human genomes. Sequence analysis has too many magic numbers already. Finally, how frequently is the SMEM not part of the true chain in real data? One way to evaluate this would be to test accuracy with the reseeding disabled.

2. How does the seed ranking work? The manuscript says “We rank a seed by length of chain it belongs to and then by the seed length” but this plain language description lacks the precision required to reproduce the method. Is chain length defined as the span of the chain in the query, the reference sequence, the sum of constituent seed lengths or something else? This should be explained.

3. The program evaluation is quite limited, in that only a single substitution & indel rate is used to compare aligners. Comparing a wide range of sequence divergences is a pretty substantial computational undertaking so it’s ok to keep the benchmarking simple, but the interpretation of results needs to be strongly qualified in the manuscript. I am specifically concerned that the SMEM based alignment seeding strategy will fall apart at higher
substitution or indel error rates because it depends on finding long exact matches. What happens to reads with dense polymorphisms or sequencing data generated on instruments with high error rates such as the Pacific Biosciences instrument?

4. There is one glaring omission in the comparison set of aligners and that is LAST ( LAST can do some clever things such as quality­ aware gapped alignment scoring and approximate Bayesian aligned read pairing, in addition to long sequence alignment. If there is some reason that a comparison to LAST isn’t possible or appropriate it should be stated so rather than leaving us to wonder about it.

Minor comments:

1. There are numerous minor grammatical and spelling errors which could be easily fixed and would improve readability

2. There seems to be a single mismatch penalty used for the gapped alignment, however, some nucleotide substitutions happen much more frequently than others. At a minimum the ability to treat transitions differently than transversions would be nice, though being able to use a 4x4 substitution matrix (and/or estimating it directly from the data) would be ideal.

3. No mention is made of whether or how the per­ base quality estimates produced by sequencing instruments are used during gapped alignment.

4. The value of the “SW rescuing” approach is unclear. The manuscript says the 2nd best SW score is recorded ­­ what happens when there are many equally good alignments? The alignment with the 2nd best score might be minimally different. Or is this referring to the best alignment to a different region of the reference sequence? If so, the explanation needs to be improved.

5. What is a “standing seed” in section 3 paragraph 2?

6. The discussion comment that “only bwa mem scales well to large genomes” is misleading since there is a well developed body of literature on methods and software for pairwise and multiple genome alignment covering large genomes and some of them scale very well to large genomes. bwa mem and nucmer are not the only two programs in this space.

Other comments:

1. The description of how pairing scores are calculated was pleasantly understandable (to me, anyway) and succinct. One question: if each read in a pair has many possible alignment locations then a large number of possible combinations arise. Are all of these scored or does bwa bound the search somehow?

2. “is more performant” ­> performs better

Review of the software:

I obtained bwa­0.7.5a from­bwa/files/ After downloading I used the following commands to build and run the program:

>tar xjf bwa­0.7.5a.tar.bz2
>cd bwa­0.7.5a/
>cp bwa $HOME/bin/

Running the software with no arguments produces a help message with enough detail to get started, and running the mem subcommand without arguments gives all the options for bwa mem alignment. I was pleased to see the fine grain control over the alignment parameters available for power users, though I’m sure 99% of users will stick with defaults here because it can be very hard to determine sensible settings for these parameters.

One parameter question: the manuscript describes a Z dropoff which is like BLAST’s X­drop but allows the gap in either the reference or query to be penalty­ free. The program help for bwa mem has:

-d INT off­-diagonal X­-dropoff [100]

Is this the same thing? If so it would help to improve the variable naming consistency.

Running bwa:

>bwa index C_botulinum_F_Langeland.fasta
>bwa mem C_botulinum_F_Langeland.fasta cbot_mem1.fq cbot_mem2.fq > cbot_mem.sam

Based on my comments above, I decided to run my own comparison of bwa to lastal with the approximate Bayesian paired read mapping module. To do this I simulated paired end reads from the Clostridium botulinum F Langeland genome with artsim:

>art_illumina --­­paired -­l 100 ­-f 1 ­-m 400 ­-s 50 ­-i C_botulinum_F_Langeland.fasta ­-o cbot_mem

The above command generates about 1x simulated coverage of the C. botulinum reference. I then ran bwa mem as above. last­193 was run as (paramters taken verbatim from their documentation at­pair­probs.html):

>time lastal ­-i1 ­-e120 ­-Q1 cbot_last cbot_mem1.fq > cbot_lastal1.maf
>time lastal ­-i1 ­-e120 ­-Q1 cbot_last cbot_mem2.fq > cbot_lastal2.maf
>time last­-pair­ cbot_lastal1.maf cbot_lastal2.maf > lastal_pairs.maf
>time maf­ sam lastal_pairs.maf > lastal_pairs.sam

accuracy was evaluated using a short perl script (see file bwa_mem_eval.tar.bz2 in figshare):

>./ cbot_mem.sam < cbot_mem.both.aln
precision: 0.992127563556135 recall: 0.998284561049445

>./ lastal_pairs.sam < cbot_mem.both.aln
precision: 1 recall: 0.988837162737148

So lastal has remarkably good precision but aligns slightly fewer of the reads with the recommended settings than bwa mem. If we turn these numbers into an F1 score ( we get lastal: 0.98883 bwa­mem: 0.99519 so if you like the balance of precision and recall provided by F1, bwa­ mem seems like a good choice. For things like identifying rare variants, I will probably stick with lastal when possible since the
occasional misalignments could potentially end up creating a lot of rare variants. It might also be possible to improve lastal’s recall with a lower e­value threshold, I did not explore this.

One difficulty I encountered in benchmarking these is an apparent bug in artsim where it reports the location of one read in each pair seemingly at random, while the other read’s location was given correctly. I introduced some slop into the accuracy script to work around this, in particular that an alignment would be called correct if the mapped location was within 550nt of the true location.

Review of the code:

The program is implemented in C, with a minimalist but functional Makefile provided. I only evaluated the bwamem part of the code. The header file bwamem.h is well documented, with understandable comments describing what the variables, structures, and (most) function interfaces are doing. Use of doxygen structured documentation is nice. The implementation in bwamem.c has much less code commenting. Superficially it seems relatively modular with sensible enough function and variable names that a seasoned and motivated C coder could understand and modify it. I didn’t notice any obvious coding faux pas such as large copied blocks of code, although a few smaller instances could be refactored into functions to remove duplication. In a thorough code review I would probably suggest denser commenting and unpacking or at least description of gems like this:

>if (which && ((p­>cigar[p­>n_cigar­1]&0xf) == 4 || (p­>cigar[p­>n_cigar­1]&0xf) == 3)) qb += p­>cigar[p­>n_cigar­1]>>4;

One little annoyance is the use of hard­coded hexadecimal constants in mem_aln2sam(). While their meaning may be obvious to the creator of the SAM format it is not very readable for the rest of us. On the whole though it looks pretty good. Better than the code I write (though I can not claim to be a model coder)

show less