The broken sample problem revisited: Proof of a conjecture by Bai and Hsing and extensions
Detection, Estimation, and Reconstruction in Networks April 21, 2025 - April 25, 2025
Location: SLMath: Eisenbud Auditorium, Online/Virtual
We revisit the classical broken sample problem: Two samples of iid data points X={X_1,...,X_n} and Y={Y_1,...,Y_m} are observed without correspondence, with m \leq n. The goal is to distinguish the following two hypotheses: Under the null hypothesis, X and Y are independent. Under the alternative hypothesis, X and Y are correlated in the sense that (X_{\pi(i)}, Y_i)'s are drawn independently from a bivariate distribution for some latent injection \pi: [m] \to [n]. Originally introduced by DeGroot, Feder, and Goel [AoMS 1971] to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes (m=n). Assuming m and n grow proportionally, in this work, we show that the sharp threshold is given by a spectral and an L_2 condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing [PTRF 2005] in the positive. Extensions to high dimensions and linear models are also discussed. This is based on joint work with Shuyang Gong (Peking University), Simiao Jiao (Duke), and Jiaming Xu (Duke).