{"title": "Linear regression without correspondence", "book": "Advances in Neural Information Processing Systems", "page_first": 1531, "page_last": 1540, "abstract": "This article considers algorithmic and statistical aspects of linear regression when the correspondence between the covariates and the responses is unknown. First, a fully polynomial-time approximation scheme is given for the natural least squares optimization problem in any constant dimension. Next, in an average-case and noise-free setting where the responses exactly correspond to a linear function of i.i.d. draws from a standard multivariate normal distribution, an efficient algorithm based on lattice basis reduction is shown to exactly recover the unknown linear function in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio are established for approximate recovery of the unknown linear function by any estimator.", "full_text": "Linear regression without correspondence\n\nDaniel Hsu\n\nColumbia University\n\nNew York, NY\n\ndjhsu@cs.columbia.edu\n\nkshi@cs.columbia.edu\n\nKevin Shi\n\nColumbia University\n\nNew York, NY\n\nXiaorui Sun\n\nMicrosoft Research\n\nRedmond, WA\n\nxiaoruisun@cs.columbia.edu\n\nAbstract\n\nThis article considers algorithmic and statistical aspects of linear regression when\nthe correspondence between the covariates and the responses is unknown. First, a\nfully polynomial-time approximation scheme is given for the natural least squares\noptimization problem in any constant dimension. Next, in an average-case and\nnoise-free setting where the responses exactly correspond to a linear function of\ni.i.d. draws from a standard multivariate normal distribution, an ef\ufb01cient algorithm\nbased on lattice basis reduction is shown to exactly recover the unknown linear\nfunction in arbitrary dimension. Finally, lower bounds on the signal-to-noise ratio\nare established for approximate recovery of the unknown linear function by any\nestimator.\n\nIntroduction\n\n1\nConsider the problem of recovering an unknown vector \u00afw 2 Rd from noisy linear measurements\nwhen the correspondence between the measurement vectors and the measurements themselves is\nunknown. The measurement vectors (i.e., covariates) from Rd are denoted by x1, x2, . . . , xn; for\neach i 2 [n] := {1, 2, . . . , n}, the i-th measurement (i.e., response) yi is obtained using x\u00af\u21e1(i):\n\ni 2 [n] .\n\nyi = \u00afw>x\u00af\u21e1(i) + \"i ,\n\n(1)\nAbove, \u00af\u21e1 is an unknown permutation on [n], and the \"1,\" 2, . . . ,\" n are unknown measurement errors.\nThis problem, which has been called unlabeled sensing [22], linear regression with an unknown\npermutation [18], and linear regression with shuf\ufb02ed labels [1], arises in many settings; see the\naforementioned references for more details. In short, sensing limitations may create ambiguity in\nor even completely lose the ordering of measurements. The problem is also interesting because the\nmissing correspondence makes an otherwise well-understood problem into one with very different\ncomputational and statistical properties.\n\nPrior works. Unnikrishnan et al. [22] study conditions on the measurement vectors that permit\nrecovery of any target vector \u00afw under noiseless measurements. They show that when the entries of\nthe xi are drawn i.i.d. from a continuous distribution, and n 2d, then almost surely, every vector\n\u00afw 2 Rd is uniquely determined by noiseless correspondence-free measurements as in (1). (Under\nnoisy measurements, it is shown that \u00afw can be recovered when an appropriate signal-to-noise ratio\ntends to in\ufb01nity.) It is also shown that n 2d is necessary for such a guarantee that holds for all\nvectors \u00afw 2 Rd.\nPananjady et al. [18] study statistical and computational limits on recovering the unknown permutation\n\u00af\u21e1. On the statistical front, they consider necessary and suf\ufb01cient conditions on the signal-to-noise ratio\nSNR :=k \u00afwk2\ni=1 are i.i.d. draws from the normal distribution\nN(0, 2) and the measurement vectors (xi)n\ni=1 are i.i.d. draws from the standard multivariate normal\ndistribution N(0, I d). Roughly speaking, exact recovery of \u00af\u21e1 is possible via maximum likelihood\n\n2 /2 when the measurement errors (\"i)n\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmin\nw,\u21e1\n\nnXi=1\u21e3w>x\u21e1(i) yi\u23182\n\n(2)\n\nwhen SNR nc for some absolute constant c > 0, and approximate recovery is impossible for any\nmethod when SNR \uf8ff nc0 for some other absolute constant c0 > 0. On the computational front, they\nshow that the least squares problem (which is equivalent to maximum likelihood problem)\n\ngiven arbitrary x1, x2, . . . , xn 2 Rd and y1, y2, . . . , yn 2 R is NP-hard when d =\u2326( n)1, but admits\na polynomial-time algorithm (in fact, an O(n log n)-time algorithm based on sorting) when d = 1.\nAbid et al. [1] observe that the maximum likelihood estimator can be inconsistent for estimating \u00afw in\ncertain settings (including the normal setting of Pananjady et al. [18], with SNR \ufb01xed but n ! 1).\nOne of the alternative estimators they suggest is consistent under additional assumptions in dimension\nd = 1. Elhami et al. [8] give a O(dnd+1)-time algorithm that, in dimension d = 2, is guaranteed to\napproximately recover \u00afw when the measurement vectors are chosen in a very particular way from the\nunit circle and the measurement errors are uniformly bounded.\n\nContributions. We make progress on both computational and statistical aspects of the problem.\n1. We give an approximation algorithm for the least squares problem from (2) that, any given\ni=1, and \u270f 2 (0, 1), returns a solution with objective value at most 1 + \u270f times that\n(xi)n\nof the minimum in time (n/\u270f)O(d). This a fully polynomial-time approximation scheme for any\nconstant dimension.\n\ni=1, (yi)n\n\n2. We give an algorithm that exactly recovers \u00afw in the measurement model from (1), under the\nassumption that there are no measurement errors and the covariates (xi)n\ni=1 are i.i.d. draws\nfrom N(0, I d). The algorithm, which is based on a reduction to a lattice problem and employs\nthe lattice basis reduction algorithm of Lenstra et al. [16], runs in poly(n, d) time when the\ncovariate vectors (xi)n\ni=1 and target vector \u00afw are appropriately quantized. This result may\nalso be regarded as for each-type guarantee for exactly recovering a \ufb01xed vector \u00afw, which\ncomplements the for all-type results of Unnikrishnan et al. [22] concerning the number of\nmeasurement vectors needed for recovering all possible vectors.\n\n3. We show that in the measurement model from (1) where the measurement errors are i.i.d. draws\nfrom N(0, 2) and the covariate vectors are i.i.d. draws from N(0, I d), then no algorithm can\napproximately recover \u00afw unless SNR C min{1, d/ log log(n)} for some absolute constant\nC > 0. We also show that when the covariate vectors are i.i.d. draws from the uniform\ndistribution on [1/2, 1/2]d, then approximate recovery is impossible unless SNR C0 for\nsome other absolute constant C0 > 0.\n\nOur algorithms are not meant for practical deployment, but instead are intended to shed light on the\ncomputational dif\ufb01culty of the least squares problem and the average-case recovery problem. Indeed,\nnote that a na\u00efve brute-force search over permutations requires time \u2326(n!) = n\u2326(n), and the only\nother previous algorithms (already discussed above) were restricted to d = 1 [18] or only had some\nform of approximation guarantee when d = 2 [8]. We are not aware of previous algorithms for the\naverage-case problem in general dimension d.2\nOur lower bounds on SNR stand in contrast to what is achievable in the classical linear regression\nmodel (where the covariate/response correspondence is known): in that model, the SNR requirement\nfor approximately recovering \u00afw scales as d/n, and hence the problem becomes easier with n. The\nlack of correspondence thus drastically changes the dif\ufb01culty of the problem.\n\n2 Approximation algorithm for the least squares problem\n\nIn this section, we consider the least squares problem from Equation (2). The inputs are an arbitrary\nmatrix X = [x1|x2|\u00b7\u00b7\u00b7|xn]> 2 Rn\u21e5d and an arbitrary vector y = (y1, y2, . . . , yn)> 2 Rn, and the\n1Pananjady et al. [18] prove that PARTITION reduces to the problem of deciding if the optimal value of (2) is\nzero or non-zero. Note that PARTITION is weakly, but not strongly, NP-hard: it admits a pseudo-polynomial-time\nalgorithm [10, Section 4.2]. In Appendix A, we prove that the least squares problem is strongly NP-hard by\nreduction from 3-PARTITION (which is strongly NP-complete [10, Section 4.2.2]).\n\n2A recent algorithm of Pananjady et al. [19] exploits a similar average-case setting but only for a somewhat\n\neasier variant of the problem where more information about the unknown correspondence is provided.\n\n2\n\n\fRn; approximation parameter \u270f 2 (0, 1).\n\nAlgorithm 1 Approximation algorithm for least squares problem\ninput Covariate matrix X = [x1|x2|\u00b7\u00b7\u00b7|xn]> 2 Rn\u21e5k; response vector y = (y1, y2, . . . , yn)> 2\nassume X>X = I k.\noutput Weight vector \u02c6w 2 Rk and permutation matrix \u02c6\u21e7 2P n.\n1: Run \u201cRow Sampling\u201d algorithm with input matrix X to obtain a matrix S 2 Rr\u21e5n with r = 4k.\n2: Let B be the set of vectors b = (b1, b2, . . . , bn)> 2 Rn satisfying the following: for each i 2 [n],\n3: Let c := 1 + 4(1 +pn/(4k))2.\n4: for each b 2B do\n5:\n6:\n\n2, and let rb := min\u21e72Pn kX \u02dcwb \u21e7>yk2\nCompute \u02dcwb 2 arg minw2Rk kS(Xw b)k2\n2.\nConstruct ap\u270frb/c-net Nb for the Euclidean ball of radius pcrb around \u02dcwb, so that for each\nv 2 Rk with kv \u02dcwbk2 \uf8ff pcrb, there exists v0 2N b such that kv v0k2 \uf8ffp\u270frb/c.\n\n\u2022 if the i-th column of S is all zeros, then bi = 0;\n\u2022 otherwise, bi 2{ y1, y2, . . . , yn}.\n\nmin\n\n\u21e72Pn kXw \u21e7>yk2\n\n2 and \u02c6\u21e7 2 arg min\n\n\u21e72Pn kX \u02c6w \u21e7>yk2\n2.\n\n7: end for\n8: return \u02c6w 2 arg min\nw2Sb2B Nb\n\ngoal is to \ufb01nd a vector w 2 Rd and permutation matrix \u21e7 2P n (where Pn denotes the space of n\u21e5n\npermutation matrices3) to minimize kXw \u21e7>yk2\n2. This problem is NP-hard in the case where\nd =\u2326( n) [18] (see also Appendix A). We give an approximation scheme that, for any \u270f 2 (0, 1),\nreturns a (1 + \u270f)-approximation in time (n/\u270f)O(k) + poly(n, d), where k := rank(X) \uf8ff min{n, d}.\nWe assume without loss of generality that X 2 Rn\u21e5k and X>X = I k. This is because we can\nalways replace X with its matrix of left singular vectors U 2 Rn\u21e5k, obtained via singular value\ndecomposition X = U\u2303V >, where U >U = V >V = I k and \u2303 0 is diagonal. A solution\n(w, \u21e7) for (U , y) has the same cost as the solution (V\u2303 1w, \u21e7) for (X, y), and a solution (w, \u21e7)\nfor (X, y) has the same cost as the solution (\u2303V >w, \u21e7) for (U , y).\n\n2.1 Algorithm\nOur approximation algorithm, shown as Algorithm 1, uses a careful enumeration to beat the na\u00efve\nbrute-force running time of \u2326(|Pn|) =\u2326( n!). It uses as a subroutine a \u201cRow Sampling\u201d algorithm\nof Boutsidis et al. [5] (described in Appendix B), which has the following property.\nTheorem 1 (Specialization of Theorem 12 in [5]). There is an algorithm (\u201cRow Sampling\u201d) that,\ngiven any matrix A 2 Rn\u21e5k with n k, returns in poly(n, k) time a matrix S 2 Rr\u21e5n with r = 4k\nsuch that the following hold.\n\n2 satis\ufb01es kAw0 bk2\n\n2 \uf8ff c \u00b7\n\n1. Every row of S has at most one non-zero entry.\n2. For every b 2 Rn, every w0 2 arg minw2Rk kS(Aw b)k2\n2 for c = 1 + 4(1 +pn/(4k))2 = O(n/k).\n\nminw2Rk kAw bk2\n\nThe matrix S returned by Row Sampling determines a (weighted) subset of O(k) rows of A such\nthat solving a (ordinary) least squares problem (with any right-hand side b) on this subset of rows and\ncorresponding right-hand side entries yields a O(n/k)-approximation to the least squares problem\nover all rows and right-hand side entries. Row Sampling does not directly apply to our problem\nbecause (1) it does not minimize over permutations of the right-hand side, and (2) the approximation\nfactor is too large. However, we are able to use it to narrow the search space in our problem.\nAn alternative to Row Sampling is to simply enumerate all subsets of k rows of X. This is justi\ufb01ed\nby a recent result of Derezi\u00b4nski and Warmuth [7], which shows that for any right-hand side b 2 Rn,\nusing \u201cvolume sampling\u201d [3] to choose a matrix S 2{ 0, 1}k\u21e5k (where each row has one non-zero\nentry) gives a similar guarantee as that of Row Sampling, except with the O(n/k) factor replaced by\nk + 1 in expectation.\n\n3Each permutation matrix \u21e7 2P n corresponds to a permutation \u21e1 on [n]; the (i, j)-th entry of \u21e7 is one if\n\n\u21e1(i) = j and is zero otherwise.\n\n3\n\n\f2.2 Analysis\n\nThe approximation guarantee of Algorithm 1 is given in the following theorem.\nTheorem 2. Algorithm 1 returns \u02c6w 2 Rk and \u02c6\u21e7 2P n satisfying\n\nX \u02c6w \u02c6\u21e7>y\n\n2\n\n2 \uf8ff (1 + \u270f)\n\nmin\n\nw2Rk,\u21e72PnXw \u21e7>y2\n\n2 .\n\nProof. Let opt := minw,\u21e7 kXw \u21e7>yk2\n2 be the optimal cost, and let (w?, \u21e7?) denote a solution\nachieving this cost. The optimality implies that w? satis\ufb01es the normal equations X>Xw? =\n? y. By Theorem 1 and\nX>\u21e7>\nthe normal equations, the vector \u02dcwb? and cost value rb? satisfy\n\n? y. Observe that there exists a vector b? 2B satisfying Sb? = S\u21e7>\n\n? y2\nopt \uf8ff rb? \uf8ffX \u02dcwb? \u21e7>\n2 =X( \u02dcwb? w?)2\n\nMoreover, since X>X = I k, we have that k \u02dcwb? w?k2 \uf8ffp(c 1) opt \uf8ff pcrb?. By construc-\ntion of Nb?, there exists w 2N b? satisfying kw w?k2\n2 \uf8ff \u270frb?/c \uf8ff \u270f opt. For\nthis w, the normal equations imply\n\n2 = kX(w w?)k2\n\n2 + opt \uf8ff c \u00b7 opt .\n\nmin\n\n\u21e72Pn kXw \u21e7>yk2\n\n2 \uf8ff kXw \u21e7>\n\n? yk2\n\n2 = kX(w w?)k2\n\n2 + opt \uf8ff (1 + \u270f) opt .\n\nTherefore, the solution returned by Algorithm 1 has cost no more than (1 + \u270f) opt.\n\nBy the results of Pananjady et al. [18] for maximum likelihood estimation, our algorithm enjoys recov-\nery guarantees for \u00afw and \u00af\u21e1 when the data come from the Gaussian measurement model (1). However,\nthe approximation guarantee also holds for worst-case inputs without generative assumptions.\n\nRunning time. We now consider the running time of Algorithm 1. There is the initial cost for\nsingular value decomposition (as discussed at the beginning of the section), and also for \u201cRow\nSampling\u201d; both of these take poly(n, d) time. For the rest of the algorithm, we need to consider the\nsize of B and the size of the net Nb for each b 2B . First, we have |B| \uf8ff nr = nO(k), since S has\nonly 4k rows and each row has at most a single non-zero entry. Next, for each b 2B , we construct the\n-net Nb (for :=p\u270frb/c) by constructing a /pk-net for the `1-ball of radius pcrb centered at\n\u02dcwb (using an appropriate axis-aligned grid). This has size |Nb|\uf8ff (4c2k/\u270f)k/2 = (n/\u270f)O(k). Finally,\neach arg minw2Rk computation takes O(nk2) time, and each (arg) min\u21e72Pn takes O(nk + n log n)\ntime [18] (also see Appendix B). So, the overall running time is (n/\u270f)O(k) + poly(n, d).\n\n3 Exact recovery algorithm in noiseless Gaussian setting\n\nTo counter the intractability of the least squares problem in (2) confronted in Section 2, it is natural\nto explore distributional assumptions that may lead to faster algorithms. In this section, we consider\nthe noiseless measurement model where the (xi)n\ni=1 are i.i.d. draws from N(0, I d) (as in [18]). We\ngive an algorithm that exactly recovers \u00afw with high probability when n d + 1. The algorithm runs\nin poly(n, d)-time when (xi)n\nIt will be notationally simpler to consider n + 1 covariate vectors and responses\n\ni=1 and \u00afw are appropriately quantized.\n\ni = 0, 1, . . . , n .\n\nyi = \u00afw>x\u00af\u21e1(i) ,\n\n(3)\nHere, (xi)n\ni=0 are n + 1 i.i.d. draws from N(0, I d), the unknown permutation \u00af\u21e1 is over {0, 1, . . . , n},\nand the requirement of at least d + 1 measurements is expressed as n d.\nIn fact, we shall consider a variant of the problem in which we are given one of the values of the\nunknown permutation \u00af\u21e1. Without loss of generality, assume we are given that \u00af\u21e1(0) = 0. Solving this\nvariant of the problem suf\ufb01ces because there are only n + 1 possible values of \u00af\u21e1(0): we can try them\nall, incurring just a factor n + 1 in the computation time. So henceforth, we just consider \u00af\u21e1 as an\nunknown permutation on [n].\n\n4\n\n\fdence parameter 2 (0, 1); lattice parameter > 0.\nthat y0 = \u00afw>x0.\n\nAlgorithm 2 Find permutation\ninput Covariate vectors x0, x1, x2, . . . , xn in Rd; response values y0, y1, y2, . . . , yn in R; con\ufb01-\nassume there exists \u00afw 2 Rd and permutation \u00af\u21e1 on [n] such that yi = \u00afw>x\u00af\u21e1(i) for each i 2 [n], and\noutput Permutation \u02c6\u21e1 on [n] or failure.\n1: Let X = [x1|x2|\u00b7\u00b7\u00b7|xn]> 2 Rn\u21e5d, and its pseudoinverse be X\u2020 = [\u02dcx1|\u02dcx2|\u00b7\u00b7\u00b7|\u02dcxn].\n2: Create Subset Sum instance with n2 source numbers ci,j := yi \u02dcx>\n\nj x0 for (i, j) 2 [n] \u21e5 [n] and\n\ntarget sum y0.\n\n3: Run Algorithm 3 with Subset Sum instance and lattice parameter .\n4: if Algorithm 3 returns a solution S\u2713 [n] \u21e5 [n] then\n5:\n6: else\n7:\n8: end if\n\nreturn any permutation \u02c6\u21e1 on [n] such that \u02c6\u21e1(i) = j implies (i, j) 2S .\nreturn failure.\n\nAlgorithm 3 Lagarias and Odlyzko [12] subset sum algorithm\ninput Source numbers {ci}i2I \u21e2 R; target sum t 2 R; lattice parameter > 0.\noutput Subset \u02c6S\u2713I or failure.\n1: Construct lattice basis B 2 R(|I|+2)\u21e5(|I|+1) where\nB := \"\n\nt ci : i 2I # 2 R(|I|+2)\u21e5(|I|+1) .\nI|I|+1\n\n2: Run basis reduction [e.g., 16] to \ufb01nd non-zero lattice vector v of length at most 2|I|/2 \u00b7 1(B).\n, 0)>, with z 2 Z and \u02c6S 2{ 0, 1}I is characteristic vector for some \u02c6S\u2713I then\n3: if v = z(1, >\u02c6S\nreturn \u02c6S.\n4:\n5: else\nreturn failure.\n6:\n7: end if\n\n3.1 Algorithm\nOur algorithm, shown as Algorithm 2, is based on a reduction to the Subset Sum problem. An\ninstance of Subset Sum is speci\ufb01ed by an unordered collection of source numbers {ci}i2I \u21e2 R, and\na target sum t 2 R. The goal is to \ufb01nd a subset S\u2713I such thatPi2S ci = t. Although Subset Sum\nis NP-hard in the worst case, it is tractable for certain structured instances [12, 9]. We prove that\nAlgorithm 2 constructs such an instance with high probability. A similar algorithm based on such a\nreduction was recently used by Andoni et al. [2] for a different but related problem.\nAlgorithm 2 proceeds by (i) solving a Subset Sum instance based on the covariate vectors and\nresponse values (using Algorithm 3), and (ii) constructing a permutation \u02c6\u21e1 on [n] based on the\nsolution to the Subset Sum instance. With the permutation \u02c6\u21e1 in hand, we (try to) \ufb01nd a solution\nw 2 Rd to the system of linear equations yi = w>x\u02c6\u21e1(i) for i 2 [n]. If \u02c6\u21e1 = \u00af\u21e1, then there is a unique\nsuch solution almost surely.\n\n3.2 Analysis\nThe following theorem is the main recovery guarantee for Algorithm 2.\nTheorem 3. Pick any 2 (0, 1). Suppose (xi)n\ni=0 are i.i.d. draws from N(0, I d), and (y0)n\nfollow the noiseless measurement model from (3) for some \u00afw 2 Rd and permutation \u00af\u21e1 on [n] (and\n\u00af\u21e1(0) = 0), and that n d. Furthermore, suppose Algorithm 2 is run with inputs (xi)n\ni=0, ,\nand , and also that 2n2/\" where \" is de\ufb01ned in Equation (8). With probability at least 1 ,\nAlgorithm 2 returns \u02c6\u21e1 = \u00af\u21e1.\nRemark 1. The value of \" from Equation (8) is directly proportional tok \u00afwk2, and Algorithm 2\nrequires a lower bound on \" (in the setting of the lattice parameter ). Hence, it suf\ufb01ces to determine\n\ni=0, (yi)n\n\ni=1\n\n5\n\n\fi=1 y2\n\ntail bound (Lemma 6 in Appendix C) shows that with high probability,pPn\n\na lower bound onk \u00afwk2. Such a bound can be obtained from the measurement values: a standard\ni /(2n) is a lower\nbound on k \u00afwk2, and is within a constant factor of it as well.\nRemark 2. Algorithm 2 strongly exploits the assumption of noiseless measurements, which is\nexpected given the SNR lower bounds of Pananjady et al. [18] for recovering \u00af\u21e1. The algorithm,\nhowever, is also very brittle and very likely fails in the presence of noise.\nRemark 3. The recovery result does not contradict the results of Unnikrishnan et al. [22], which\nshow that a collection of 2d measurement vectors are necessary for recovering all \u00afw, even in the\nnoiseless measurement model of (3). Indeed, our result shows that for a \ufb01xed \u00afw 2 Rd, with high\nprobability d + 1 measurements in the model of (3) suf\ufb01ce to permit exactly recovery of \u00afw, but this\nsame set of measurement vectors (when d + 1 < 2d) will fail for some other \u00afw0.\nThe proof of Theorem 3 is based on the following theorem\u2014essentially due to Lagarias and Odlyzko\n[12] and Frieze [9]\u2014concerning certain structured instances of Subset Sum that can be solved using\nthe lattice basis reduction algorithm of Lenstra et al. [16]. Given a basis B = [b1|b2|\u00b7\u00b7\u00b7|bk] 2 Rm\u21e5k\nfor a lattice\n\nL(B) := 8<:\n\nkXi=1\n\nzibi : z1, z2, . . . , zk 2 Z9=; \u21e2 Rm ,\n\nthis algorithm can be used to \ufb01nd a non-zero vector v 2L (B)\\{0} whose length is at most 2(k1)/2\ntimes that of the shortest non-zero vector in the lattice\nmin\n\n1(B) :=\n\nv2L(B)\\{0}kvk2 .\n\nTheorem 4 ([12, 9]). Suppose the Subset Sum instance speci\ufb01ed by source numbers {ci}i2I \u21e2 R\nand target sum t 2 R satisfy the following properties.\n1. There is a subset S ? \u2713I such thatPi2S? ci = t.\n2. De\ufb01ne R := 2|I|/2p|S ?| + 1 and ZR := {(z0, z) 2 Z \u21e5 ZI : 0 < z2\n0 +Pi2I z2\ni \uf8ff R2}.\nThere exists \"> 0 such that |z0 \u00b7 t Pi2I zi \u00b7 ci| \" for each (z0, z) 2Z R that is not\nan integer multiple of (1, ?), where ? 2{ 0, 1}I is the characteristic vector for S ?.\nLet B be the lattice basis B constructed by Algorithm 3, and assume 2|I|/2/\". Then every\nnon-zero vector in the lattice \u21e4(B) with length at most 2|I|/2 times the length of the shortest non-zero\nvector in \u21e4(B) is an integer multiple of the vector (1, S?, 0), and the basis reduction algorithm\nof Lenstra et al. [16] returns such a non-zero vector.\nThe Subset Sum instance constructed in Algorithm 2 has n2 source numbers {ci,j : (i, j) 2 [n] \u21e5 [n]}\nand target sum y0. We need to show that it satis\ufb01es the two conditions of Theorem 4.\nLet S\u00af\u21e1 := {(i, j) : \u00af\u21e1(i) = j}\u21e2 [n] \u21e5 [n], and let \u00af\u21e7 = ( \u00af\u21e7i,j)(i,j)2[n]\u21e5[n] 2P n be the permutation\nmatrix with \u00af\u21e7i,j := 1{\u00af\u21e1(i) = j} for all (i, j) 2 [n] \u21e5 [n]. Note that \u00af\u21e7 is the \u201ccharacteristic vector\u201d\nfor S\u00af\u21e1. De\ufb01ne R := 2n2/2pn + 1 and\n\nZR := ((z0, Z) 2 Z \u21e5 Zn\u21e5n : 0 < z2\n\n0 + X1\uf8ffi,j\uf8ffn\n\ni,j \uf8ff R2) .\n\nZ2\n\nA crude bound shows that|ZR|\uf8ff 2O(n4).\nThe following lemma establishes the \ufb01rst required property in Theorem 4.\nLemma 1. The random matrix X has rank d almost surely, and the subset S\u00af\u21e1 satis\ufb01es y0 =\nP(i,j)2S\u00af\u21e1\nsupported on all of Rn\u21e5d. This implies that X\u2020X =Pn\n\nProof. That X has rank d almost surely follows from the fact that the probability density of X is\n\nj=1 \u02dcxjx>j = I d, and\n\nci,j.\n\nnXj=1\n\nx>0 \u02dcxjx>j \u00afw = X1\uf8ffi,j\uf8ffn\n\nx>0 \u02dcxj \u00b7 yi \u00b7 1{\u00af\u21e1(i) = j} = X1\uf8ffi,j\uf8ffn\n\nci,j \u00b7 1{\u00af\u21e1(i) = j} .\n\ny0 =\n\n6\n\n\fThe next lemma establishes the second required property in Theorem 4. Here, we use the fact that\n\nmultiple of (1, \u00af\u21e7).\nLemma 2. Pick any \u2318, \u23180 > 0 such that 3|ZR| \u2318 + \u23180 < 1. With probability at least 1 3|ZR| \u2318 \u23180,\nevery (z0, Z) 2Z R with Z = (Zi,j)(i,j)2[n]\u21e5[n] satis\ufb01es\n\nthe Frobenius normz0 \u00af\u21e7 ZF is at least one whenever (z0, Z) 2 Z \u21e5 Zn\u21e5n is not an integer\n\nProof. By Lemma 1, the matrix \u00af\u21e7 satis\ufb01es y0 = Pi,j\n\n(\u21e1/4) \u00b7p(d 1)/n \u00b7 \u23182+ 1\n\u21e3pn + pd +p2 ln(1/\u23180)\u23182 \u00b7z0 \u00af\u21e7 ZF \u00b7k \u00afwk2 .\n\n\u00af\u21e7i,j \u00b7 ci,j. Fix any (z0, Z) 2Z R with\n\nZ = (Zi,j)(i,j)2[n]\u21e5[n]. Then\n\nZi,j \u00b7 ci,j \nZi,j \u00b7 ci,j = Xi,j\nz0 \u00b7 y0 Xi,j\n\n(z0 \u00b7 \u00af\u21e7i,j Zi,j) \u00b7 x>0 \u02dcxj \u00b7 \u00afw>x\u00af\u21e1(i) .\n\nz0 \u00b7 y0 Xi,j\n\nd1\n\nUsing matrix and vector notations, this can be written compactly as the inner product x>0 (X\u2020(z0 \u00af\u21e7\nZ)> \u00af\u21e7X \u00afw). Since x0 \u21e0 N(0, I d) and is independent of X, the distribution of the inner product is\nnormal with mean zero and standard deviation equal to kX\u2020(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afwk2. By Lemma 7 (in\nAppendix C), with probability at least 1 \u2318,\n\nObserve that X\u2020 = (X>X)1X> since X has rank d by Lemma 1, so\n\nx>0X\u2020(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afw kX\u2020(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afwk2 \u00b7r \u21e1\nkX\u2020(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afwk2 kX>(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afwk2\n\n.\n\n2 \u00b7 \u2318.\n\nBy Lemma 4 (in Appendix C), with probability at least 1 \u23180,\n\n2\n\nkXk2\n2 \uf8ff \u21e3pn + pd +p2 ln(1/\u23180)\u23182\n\n.\n\nkXk2\n\n(4)\n\n(5)\n\n(6)\n\nAnd by Lemma 9 (in Appendix C), with probability at least 1 2\u2318,\n\nkX>(z0 \u00af\u21e7 Z)> \u00af\u21e7X \u00afwk2 (z0 \u00af\u21e7 Z)> \u00af\u21e7F \u00b7k \u00afwk2 \u00b7r (d 1)\u21e1\n\n(7)\nSince \u00af\u21e7 is orthogonal, we have that k(z0 \u00af\u21e7 Z)> \u00af\u21e7kF = kz0 \u00af\u21e7 ZkF . Combining this with (4),\n(5), (6), and (7), and union bounds over all (z0, Z) 2Z R proves the claim.\nProof of Theorem 3. Lemma 1 and Lemma 2 (with \u23180 := /2 and \u2318 := /(6|ZR|)) together imply\nthat with probability at least 1 , the source numbers {ci,j : (i, j) 2 [n] \u21e5 [n]} and target sum y0\nsatisfy the conditions of Theorem 4 with\nS ? := {(i, j) 2 [n] \u21e5 [n] : \u00af\u21e1(i) = j} ,\n\" :=\n\n\u00b7 \u23181+1/(d1) .\n\n\u00b7k \u00afwk2 2 poly(n, log(1/)) \u00b7k \u00afwk2 .\n\n(8)\n\n8n\n\nd1\n\n(\u21e1/4) \u00b7p(d 1)/n \u00b7 (/(6|ZR|))2+ 1\n\u21e3pn + pd +p2 ln(2/)\u23182\n\nThus, in this event, Algorithm 3 (with satisfying 2n2/2/\") returns \u02c6S = S ?, which uniquely\ndetermines the permutation \u02c6\u21e1 = \u00af\u21e1 returned by Algorithm 2.\n\nRunning time. The basis reduction algorithm of Lenstra et al. [16] is iterative, with each iteration\nprimarily consisting of Gram-Schmidt orthogonalization and another ef\ufb01cient linear algebraic process\ncalled \u201csize reduction\u201d. The total number of iterations required is\n\nO0@ k(k + 1)\n\n2\n\nlog pk \u00b7\n\nmaxi2[k]kbik2\n\n1(B)\n\n7\n\n!1A .\n\n\fIn our case, k = n2 and 1(B) = pn + 1; and by Lemma 10 (in Appendix C), each of the basis\nvectors constructed has squared length at most 1 + 2 \u00b7 poly(d, log(n), 1/) \u00b7k \u00afwk2\n2. Using the tight\nsetting of required in Theorem 3, this gives a poly(n, d, log(1/)) bound on the total number of\niterations as well as on the total running time.\nHowever, the basis reduction algorithm requires both arithmetic and rounding operations, which are\ntypically only available for \ufb01nite precision rational inputs. Therefore, a formal running time analysis\nwould require the idealized real-valued covariate vectors (xi)n\ni=0 and unknown target vector \u00afw to be\nquantized to \ufb01nite precision values. This is doable, and is similar to using a discretized Gaussian\ndistribution for the distribution of the covariate vectors (and assuming \u00afw is a vector of \ufb01nite precision\nvalues), but leads to a messier analysis incomparable to the setup of previous works. Nevertheless, it\nwould be desirable to \ufb01nd a different algorithm that avoids lattice basis reduction that still works with\njust d + 1 measurements.\n\n4 Lower bounds on signal-to-noise for approximate recovery\n\nIn this section, we consider the measurement model from (1) where (xi)n\nN(0, I d) or the uniform distribution on [1/2, 1/2]d, and (\"i)n\nestablish lower bounds on the signal-to-noise ratio (SNR),\nSNR = k \u00afwk2\n2\n\n2\n\n,\n\ni=1 are i.i.d. draws from either\ni=1 are i.i.d. draws from N(0, 2). We\n\ni=1, (yi)n\n\nrequired by any estimator \u02c6w = \u02c6w((xi)n\nThe estimators may have a priori knowledge of the values ofk \u00afwk2 and 2.\nTheorem 5. Assume (\"i)n\n1. There is an absolute constant C > 0 such that the following holds. If n 3, d 22, (xi)n\n\ni=1) for \u00afw to approximately recover \u00afw in expectation.\n\ni=1 are i.i.d. draws from N(0, 2).\n\ni=1 follow the measurement model from (1), and\n\nare i.i.d. draws from N(0, I d), (yi)n\n\ni=1\n\n, 1 ,\nthen for any estimator \u02c6w, there exists some \u00afw 2 Rd such that\n1\n24k \u00afwk2 .\n\nSNR \uf8ff C \u00b7 min\u21e2\nE\u21e5k \u02c6w \u00afwk2\u21e4 \n\nlog log(n)\n\nd\n\n2. If (xi)n\n\ni=1 are i.i.d. draws from the uniform distribution on [1/2, 1/2]d, and (yi)n\n\nmeasurement model from (1), and\n\ni=1 follow the\n\nthen for any estimator \u02c6w, there exists some \u00afw 2 Rd such that\n\nSNR \uf8ff 2 ,\n\nE\u21e5k \u02c6w \u00afwk2\u21e4 \n\n1\n\n2\u27131 \n\n1\n\np2\u25c6k \u00afwk2 .\n\nNote that in the classical linear regression model where yi = \u00afw>xi + \"i for i 2 [n], the maximum\nlikelihood estimator \u02c6wmle satis\ufb01es Ek \u02c6wmle \u00afwk2 \uf8ff Cpd/n, where C > 0 is an absolute constant.\nTherefore, the SNR requirement to approximately recover \u00afw up to (say) Euclidean distancek \u00afwk2 /24\nis SNR 242Cd/n. Compared to this setting, Theorem 5 implies that with the measurement model\nof (1), the SNR requirement (as a function of n) is at substantially higher (d/ log log(n) in the normal\ncovariate case, or a constant not even decreasing with n in the uniform covariate case).\nFor the normal covariate case, Pananjady et al. [18] show that if n > d, \u270f< pn, and\n\nthen the maximum likelihood estimator ( \u02c6wmle, \u02c6\u21e1mle) (i.e., any minimizer of (2)) satis\ufb01es \u02c6\u21e1mle = \u00af\u21e1\nwith probability at least 1 c0n2\u270f.\nIt is\nstraightforward to see that, on the same event, we havek \u02c6wmle \u00afwk2 \uf8ff Cpd/n for some absolute\n\nSNR nc\u00b7 n\n(Here, c > 0 and c0 > 0 are absolute constants.)\n\nnd +\u270f ,\n\n8\n\n\fconstant C > 0. Therefore, the necessary and suf\ufb01cient conditions on SNR for approximate recovery\nof \u00afw lie between C0d/ log log(n) and nC00 (for absolute constants C0, C00 > 0). Narrowing this\nrange remains an interesting open problem.\nA sketch of the proof in the normal covariate case is as follows. Without loss of generality, we restrict\nattention to the case where \u00afw is a unit vector. We construct a 1/p2-packing of the unit sphere in\nRd; the target \u00afw will be chosen from from this set. Observe that for any distinct u, u0 2 U, each\nof (x>i u)n\ni=1 is an i.i.d. sample from N(0, 1) of size n; we prove that they therefore\ndetermine empirical distributions that are close to each other in Wasserstein-2 distance with high\nprobability. We then prove that conditional on this event, the resulting distributions of (yi)n\ni=1 under\n\u00afx = u and \u00afx = u0 (for any pair u, u0 2 U) are close in Kullback-Leibler divergence. Hence, by (a\ngeneralization of) Fano\u2019s inequality [see, e.g., 11], no estimator can determine the correct u 2 U\nwith high probability.\nThe proof for the uniform case is similar, using U = {e1,e1} where e1 = (1, 0, . . . , 0)>. The full\nproof of Theorem 5 is given in Appendix D.\n\ni=1 and (x>i u0)n\n\nAcknowledgments\n\nWe are grateful to Ashwin Pananjady, Micha\u0142 Derezi\u00b4nski, and Manfred Warmuth for helpful discus-\nsions. DH was supported in part by NSF awards DMR-1534910 and IIS-1563785, a Bloomberg Data\nScience Research Grant, and a Sloan Research Fellowship. XS was supported in part by a grant from\nthe Simons Foundation (#320173 to Xiaorui Sun). This work was done in part while DH and KS were\nresearch visitors and XS was a research fellow at the Simons Institute for the Theory of Computing.\n\nReferences\n[1] Abubakar Abid, Ada Poon, and James Zou. Linear regression with shuf\ufb02ed labels. arXiv\n\npreprint arXiv:1705.01342, 2017.\n\n[2] Alexandr Andoni, Daniel Hsu, Kevin Shi, and Xiaorui Sun. Correspondence retrieval. In\n\nConference on Learning Theory, 2017.\n\n[3] Haim Avron and Christos Boutsidis. Faster subset selection for matrices and applications. SIAM\n\nJournal on Matrix Analysis and Applications, 34(4):1464\u20131499, 2013.\n\n[4] Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order statistics and\n\nKantorovich transport distances. preprint, 2014.\n\n[5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal coresets for\nleast-squares regression. IEEE Transactions on Information Theory, 59(10):6880\u20136892, 2013.\n[6] Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and\n\nbanach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.\n\n[7] Micha\u0142 Derezi\u00b4nski and Manfred K Warmuth. Unbiased estimates for linear regression via\n\nvolume sampling. arXiv preprint arXiv:1705.06908, 2017.\n\n[8] Golnooshsadat Elhami, Adam James Schole\ufb01eld, Benjamin Bejar Haro, and Martin Vetterli.\nUnlabeled sensing: Reconstruction algorithm and theoretical guarantees. In Proceedings of the\n42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.\n\n[9] Alan M Frieze. On the lagarias-odlyzko algorithm for the subset sum problem. SIAM Journal\n\non Computing, 15(2):536\u2013539, 1986.\n\n[10] Michael R Garey and David S Johnson. Computers and Intractability: A Guide to the Theory of\n\nNP-completeness. WH Freeman and Company, New York, 1979.\n\n[11] Te Sun Han and Sergio Verd\u00fa. Generalizing the Fano inequality.\n\nInformation Theory, 40(4):1247\u20131251, 1994.\n\nIEEE Transactions on\n\n[12] Jeffrey C Lagarias and Andrew M Odlyzko. Solving low-density subset sum problems. Journal\n\nof the ACM, 32(1):229\u2013246, 1985.\n\n9\n\n\f[13] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model\n\nselection. Annals of Statistics, pages 1302\u20131338, 2000.\n\n[14] Lucien Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of\n\nStatistics, pages 38\u201353, 1973.\n\n[15] Michel Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society,\n\n2000.\n\n[16] Arjen Klaas Lenstra, Hendrik Willem Lenstra, and L\u00e1szl\u00f3 Lov\u00e1sz. Factoring polynomials with\n\nrational coef\ufb01cients. Mathematische Annalen, 261(4):515\u2013534, 1982.\n\n[17] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer, 2007.\n[18] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with an\nunknown permutation: Statistical and computational limits. In 54th Annual Allerton Conference\non Communication, Control, and Computing, pages 417\u2013424, 2016.\n\n[19] Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Denoising linear models\n\nwith permuted data. arXiv preprint arXiv:1704.07461, 2017.\n\n[20] Rolf-Dieter Reiss. Approximate distributions of order statistics: with applications to nonpara-\n\nmetric statistics. Springer Science & Business Media, 2012.\n\n[21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme\n\nsingular values. arXiv preprint arXiv:1003.2990, 2010.\n\n[22] Jayakrishnan Unnikrishnan, Saeid Haghighatshoar, and Martin Vetterli. Unlabeled sensing with\n\nrandom linear measurements. arXiv preprint arXiv:1512.00115, 2015.\n\n[23] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in\n\nTheoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\n[24] Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423\u2013435. Springer,\n\n1997.\n\n10\n\n\f", "award": [], "sourceid": 979, "authors": [{"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Kevin", "family_name": "Shi", "institution": "Columbia University"}, {"given_name": "Xiaorui", "family_name": "Sun", "institution": "Columbia University"}]}