Abstract

Identifying the number and precise locations of multiple change points

in long sequences is a critical issue in statistics and machine learning.

However, accurate change point detection can be compromised by the presence of

local trends in the sequence when using the conventional parametric piecewiseconstant model. In this paper, we introduce an adaptive Neyman test to assess

the presence of local trends. Subsequently, we develop a novel change point detection procedure based on a partially linear model that incorporates these local

trends. Furthermore, we extend the proposed testing and estimation methods to

multidimensional cases, facilitating the identification of common change points in

array-based data. Our methods are straightforward to implement, and we evaluate their numerical performance through simulations and the analysis of SNP

genotyping data.

Information

Preprint No.SS-2024-0355
Manuscript IDSS-2024-0355
Complete AuthorsShengji Jia, Chunming Zhang, Yiming Tang
Corresponding AuthorsYiming Tang
Emailsjstangyiming@163.com

References

  1. Bleakley, K. and Vert, J. P. (2011). The group fused Lasso for multiple change-point detection. arXiv preprint arXiv: 1106.4199.
  2. Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J. M.
  3. and Wang, K. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res, 36, e126.
  4. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32, 407–489.
  5. Erdman, C. and Emerson, J. (2008). A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics, 24, 2143–2148.
  6. Fan, J. (1997). Comments on “Wavelets in statistics: A review” by A. Antoniadis. Journal of the Italian Statistical Association, 6, 131–138.
  7. Fan, J. and Huang, L. (2001). Goodness-of-fit test for parametric regression models. Jour. Ameri. Statist. Assoc, 96, 640–652.
  8. Fearnhead, P. and Liu, Z. (2007). On-line inference for multiple changepoint problems. J. R. Statist. Soc. B, 69, 589–605.
  9. Fridlyand, J., Snijders, A. M., Pinkel, D., Albertson, D. G. and Jain, A. N. (2004). Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 90, 132–153.
  10. Gijbels, I. and Goderniaux, A.C. (2004). Bootstrap test for change-points in nonparametric regression. Nonparametric Statistics, 16, 591–611.
  11. Gr´egoire, G. and Hamrouni, Z. (2002). Change point estimation by local linear smoothing. Journal of Multivariate Analysis, 83, 56–83.
  12. Harchaoui, Z. and L´evy-Leduc, C. (2010). Multiple changepoint estimation with a total variation penalty. J. Amer. Statist. Assoc, 105, 1480–1493.
  13. Horv´ath, L. (1993). The maximum likelihood method for testing changes in the parameters of normal observations. Ann. Statist., 21, 671–680.
  14. Huang, T., Wu, B., Lizardi, P. and Zhao, H. (2005). Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics, 21, 3811–3817.
  15. Huber, W., Toedling, J. and Steinmetz, L. M. (2006). Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics, 22, 1963–1970.
  16. Jia, S. and Shi, L. (2022). Efficient change-points detection for genomic sequences via cumulative segmented regression. Bioinformatics, 38, 311–317.
  17. Liu, B., Zhang, X. and Liu, Y. (2022). High dimensional change point inference: Recent developments and extensions. Journal of multivariate analysis, 188, 104833.
  18. Marioni, J. C., Thorne, N. P., Valsesia, A., Fitzgerald, T., Redon, R., Fiegler, H., Andrews, T.
  19. D., Stranger, B. E., Lynch, A. G., Dermitzakis, E. T. et al. (2007). Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol., 8, R228.
  20. Muggeo, V. M. R. and Adelfio, G. (2011). Efficient change point detection for genomic sequences of continuous measurements. Bioinformatics, 27, 161–166.
  21. M¨uller, H. G. and Song, K. S. (1997). Two-stage change-point estimators in smooth regression models. Statistics & Probability Letters, 34, 323–335.
  22. M¨uller, H. G. and Stadtm¨uller, U. (1999). Discontinuous versus smooth regression. Ann. Statist., 27, 299–337.
  23. Niu, Y. S. and Zhang, H. (2012). The screening and ranking algorithm to detect DNA copy number variations. Ann. Appl. Stat., 6, 1306–1326.
  24. Niu, Y. S., Hao, N. and Zhang, H. (2016). Multiple change-point detection: a selective overview. Statistical Science, 31, 611–623.
  25. Olshen, A., Venkatraman, E., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557–572.
  26. Rinaldo, A. (2009). Properties and refinements of the fused lasso. Ann. Statist., 37, 2922–2952.
  27. Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group Lasso, Journal of Computational and Graphical Statistics, 22, 231-245.
  28. Song, C., Min, X. and Zhang, H. (2016). The screening and ranking algorithm for change-points detection in multiple samples. Ann. Appl. Stat., 10, 2102–2129.
  29. Tian, Z., Zhang, H. and Kuang, R. (2012). Sparse group selection on fused lasso components for identifying group-specific DNA copy number variations. IEEE International Conference on Data Mining, 12, 665–674.
  30. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58, 267–288.
  31. Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, 9, 18–29.
  32. Vidakovic, B. (1999). Statistical modeling by wavelets. Wiley, New York.
  33. Wang, Y. (1995). Jump and sharp cusp detection by wavelets. Biometrika, 82, 385–397.
  34. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68, 49–67.
  35. Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38, 894–942.
  36. Zhang, N. R., Siegmund, D. O., Ji, H. and Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika, 97, 631–645.
  37. Zhang, Y., Liu, W. and Duan, J. (2024). On the core segmentation algorithms of copy number variations detection tools. Briefings in Bioinformatics, 25(2), 1–10.
  38. Zhao, W., Zhu, X. and Zhu, L. (2023). Detecting multiple change points: The pulse criterion. Statistica Sinica, 33, 431–451.

Acknowledgments

The authors thank the Associate Editor and three reviewers for their careful review and helpful suggestions. Jia’s research was partially supported

by National Natural Science Foundation of China, Grant 12501374, and

Shanghai Natural Science Foundation, Grant 25ZR1402404. The research

of Zhang was supported by U.S. National Science Foundation grants DMS-

2013486 and DMS-1712418, and by the University of Wisconsin-Madison

Office of the Vice Chancellor for Research and Graduate Education with

Supplementary Materials

The online Supplementary Material includes the conditions and proofs of

the theoretical results, additional simulations, and an additional real data

analysis.


Supplementary materials are available for download.