Completed on 6 Apr 2018 by Matthew Stephens .
Login to endorse this review.
The main contribution of this paper
is to introduce an intuitive, fast, and relatively simple algorithm
for fitting the "Structure" (or "Admixture") model from population genetics.
I found the paper very interesting. I also congratulate the authors
on the R package, which is generally well documented and
I was able to get running easily.
However, I also found aspects of the presentation rather confusing
(starting with the title!)
and the comparisons need some work to make this publishable.
Major Comments:
1. Presentation:
The paper calls this a "nonparametric" estimator of population structure,
that "unifies" admixture models (AMs) and PCA. I think this misses the mark on
many fronts. Effectively the algorithms described here fit the AM,
using a least squares formulation rather than the usual binomial likelihood.
For example, note that the parameters P and Q being estimated here have the same interpretation,
and satisfy the same constraints, as the P and Q parameters in the AM.
There is nothing new on unification with PCA here - the unifying view
that both PCA and AM can be viewed as matrix factorization
problems has already been pointed out by previous authors (eg Engelhardt and Stephens,
cited here). The authors should remove the expository material related to the "unification"
(none of which is key to the paper), and just cite previous work as needed.
This will provide them the chance to really emphasize what I think
is the true (and important) contribution
here, which is a novel (and fast, and simple) algorithm for fitting the
AM by least squares.
(I understand that, computationally,
PCA plays a role in the actual algorithm presented. But its use here is
simply a computational device, as a step toward fitting the AM.
Claiming that this unifies the two is just confusing.)
2. Comparisons
Given the above comments, it is natural to ask how the algorithm here
compares with previous algorithms. And indeed, the current Results section
focusses on exactly this question, in terms of both estimates of the parameters
and in terms of speed. But there is one part missing: an assessment of
how the algorithms compare in the value of the objective functions achieved.
This is complicated slightly by there being 2 different objective functions:
one based on the Frobenius norm ||X-2PQ||_2 and another based on the
binomial likelihood. But this is easily addressed by examining
both objective functions.
It would be particularly interesting to see whether the algorithm here can actually
achieve a better binomial likelihood value than the approaches that are
directly based on optimizing the binomial likelihood. But in any case,
one would hope that the algorithm here will be better at least in Frobenius
norm.
3. The final algorithm is a two stage algorithm, that first does a dimension
reduction (yielding Fhat) and then uses a "projected-ALS"
algorithm to minimize ||Fhat - PQ||_2 subject to
constraints on P and Q.
My comment here is that the paper spends a lot of space and statistical
discussion on the first step, obtaining Fhat,
and yet the need for it is far from obvious.
Why is this better than directly applying the projected ALS algorithm to
minimize ||X-2PQ||_2 ? (which already provides a dimension reduction in itself)
I did some very preliminary numerical simulations
using the authors software, which you can see at
https://stephens999.github.io/misc/ALStruct.html
In these limited investigations I found the two
approaches gave similar values for ||X-2PQ||_2, but
the dimension reduction step seemed to improve speed quite a bit in one simulation.
It was not clear to me if that is expected generally, and from the paper
it does not come across as the main motivation for the dimension
reduction.
I think the authors need to be clear about why the need
for the dimension reduction - if it is primarily computational
speed then a lot of the discussion given seems unnecessary,
and needs replacing with a demonstration that indeed the speed is improved.
If it is to improve estimation accuracy compared with directly
minimizing ||X-2PQ||_2 then the improvement in accuracy should be
demonstrated numerically and some additional explanation for
why it improves accuracy given.
Other comments:
-p5: I suggest presenting the admixture model as being based on E(X) = 2PQ,
This leads naturally to a least-squares formulation
which, I believe, is what your algorithm essentially targets.
-p6: the description here needs rewriting. You do not
assume Fhat = PQ, because you use a least squares formulation (15)
that does not impose this assumption.
And describing (5) as a constraint is unhelpful because you
do not know Q so it cannot be applied! If anything, (5) is an assumption
rather than a constraint. But I'm not sure you even need it.
My suggestion is to rephrase your approach. Motivate it by introducing
the following problem: minimize ||X-2PQ|| subject to the
constraints (2)-(4). Then outline how this leads to your 2-step algorithm
which first replaces X with Fhat, and then minimizes ||Fhat-2PQ||.
-p10: what is variance(f_{ij})? Isn't f a constant in your treatment,
not a random variable? I'm confused.
-p13: you should make it clear that convergence to a stationary point
is far from a guarantee that it converges to a good solution -
the surface can be quite multi-modal and badly behaved.
- Fig 7: Focussing on just the first 2 rows of Q seems dangerous.
Why not just show the usual structure barplots here, which summarize
all of Q.
-p24: The sentences "As such... constraints" and "It is a simple, but central..."
seem like they relate to a different paper! The replacement of
the equality constraint (13) with a least squares criterion (15) makes both
these statements false, surely?
Details:
- p4 "not supported..." there are probabilistic interpretations of PCA
(eg Tipping and Bishop) that seem to contradict this.
- p5: I find introducing F as the "individual-specific allele frequencies"
confusing. I suggest first introducing F as the binomial parameter,
and then - if you need to -
saying that F can be thought of as the "individual-specific allele frequencies"
(although I'm not sure it is that helpful?)
-p5: the footnote appears false; PSD includes more general
non-uniform priors on both P and Q (eg see their equation 7)
- p6: "frequentist likelihood-based" is confusing. If you mean maximum
likelihood, just say maximum likelihood (which is not an especially
frequentist procedure.) By the same token "in the frequentist setting"
can be removed.
-p12: "In the case of anchor SNPs" - here you should rephrase to make
clear that you are defining what an anchor SNP is.
-p14: "Unconstrained ALS" - this is not unconstrained! How about "Projected ALS"?
Software:
In the function alstructure you have parameters P_init and Q_init that are not used.
It says "Only available for cALS method." but this is not available as a method
in this function? (I think it would be useful to be able to pass in initial values like this)