<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn><issn pub-type="ppub">1680-743X</issn><issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS1057</article-id>
<article-id pub-id-type="doi">10.6339/22-JDS1057</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Statistical Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>Sampling-based Gaussian Mixture Regression for Big Data</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Lee</surname><given-names>JooChul</given-names></name><xref ref-type="aff" rid="j_jds1057_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Schifano</surname><given-names>Elizabeth D.</given-names></name><xref ref-type="aff" rid="j_jds1057_aff_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>HaiYing</given-names></name><email xlink:href="mailto:haiying.wang@uconn.edu">haiying.wang@uconn.edu</email><xref ref-type="aff" rid="j_jds1057_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds1057_aff_001"><label>1</label>Department of Statistics, University of Connecticut, Storrs, CT 06269, USA</aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author. Email: <ext-link ext-link-type="uri" xlink:href="mailto:haiying.wang@uconn.edu">haiying.wang@uconn.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><volume>21</volume><issue>1</issue><fpage>158</fpage><lpage>172</lpage>
<supplementary-material id="S1" content-type="document" xlink:href="jds1057supp.pdf" mimetype="application" mime-subtype="pdf"/>
<supplementary-material id="S2" xlink:href="https://github.com/pedigree07/OPTMixture">
<caption>
<title>Supplementary Material</title>
<p>Complete description (about file and link <uri content-type="external-supplement">https://github.com/pedigree07/OPTMixture</uri>).</p>
</caption>
</supplementary-material><history><date date-type="received"><day>29</day><month>5</month><year>2022</year></date><date date-type="accepted"><day>2</day><month>7</month><year>2022</year></date></history>
<permissions><copyright-statement>2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks. A general estimator based on a subsample is investigated, and its asymptotic normality is established. We assign optimal subsampling probabilities to data points that minimize the asymptotic mean squared errors of the general estimator and linearly transformed estimators. Since the proposed probabilities depend on unknown parameters, an implementable algorithm is developed. We first approximate the optimal subsampling probabilities using a pilot sample. After that, we select a subsample using the approximated subsampling probabilities and compute estimates using the subsample. We evaluate the proposed method in a simulation study and present a real data example using appliance energy data.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>EM algorithm</kwd>
<kwd>massive data</kwd>
<kwd>optimal probabilities</kwd>
<kwd>supsampling</kwd>
</kwd-group>
<funding-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000001">US NSF</funding-source><award-id>CCF-2105571</award-id></award-group><funding-statement>HaiYing Wang’s research was partially supported by the US NSF grant CCF-2105571. </funding-statement></funding-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds1057_reflist_001">
<title>References</title>
<ref id="j_jds1057_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Ai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>F</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name> (<year>2021</year>a). <article-title>Optimal subsampling for large-scale quantile regression</article-title>. <source><italic>Journal of Complexity</italic></source>, <volume>62</volume>: <fpage>101512</fpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Ai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2021</year>b). <article-title>Optimal subsampling algorithms for big data regressions</article-title>. <source><italic>Statistica Sinica</italic></source>, <volume>31</volume>: <fpage>749</fpage>–<lpage>772</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_003">
<mixed-citation publication-type="journal"> <string-name><surname>Candanedo</surname> <given-names>LM</given-names></string-name>, <string-name><surname>Feldheim</surname> <given-names>V</given-names></string-name>, <string-name><surname>Deramaix</surname> <given-names>D</given-names></string-name> (<year>2017</year>). <article-title>Data driven prediction models of energy use of appliances in a low-energy house</article-title>. <source><italic>Energy and Buildings</italic></source>, <volume>140</volume>: <fpage>81</fpage>–<lpage>97</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_004">
<mixed-citation publication-type="journal"> <string-name><surname>Dempster</surname> <given-names>AP</given-names></string-name>, <string-name><surname>Laird</surname> <given-names>NM</given-names></string-name>, <string-name><surname>Rubin</surname> <given-names>DB</given-names></string-name> (<year>1977</year>). <article-title>Maximum likelihood from incomplete data via the em algorithm</article-title>. <source><italic>Journal of the Royal Statistical Society: Series B (Methodological)</italic></source>, <volume>39</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>22</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_005">
<mixed-citation publication-type="chapter"> <string-name><surname>Drineas</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>MW</given-names></string-name>, <string-name><surname>Muthukrishnan</surname> <given-names>S</given-names></string-name> (<year>2006</year>). <chapter-title>Sampling algorithms for l2 regression and applications</chapter-title>. In: <source><italic>Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm</italic></source>, <fpage>1127</fpage>–<lpage>1136</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Schifano</surname> <given-names>ED</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2021</year>). <article-title>Fast optimal subsampling probability approximation for generalized linear models</article-title>. <source><italic>Econometrics and Statistics</italic></source>, doi: <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ecosta.2021.02.007" xlink:type="simple">https://doi.org/10.1016/j.ecosta.2021.02.007</ext-link>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Lumley</surname> <given-names>T</given-names></string-name>, <string-name><surname>Scott</surname> <given-names>A</given-names></string-name> (<year>2015</year>). <article-title>Aic and bic for modeling with complex survey data</article-title>. <source><italic>Journal of Survey Statistics and Methodology</italic></source>, <volume>3</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>18</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_008">
<mixed-citation publication-type="chapter"> <string-name><surname>Ma</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mahoney</surname> <given-names>M</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>B</given-names></string-name> (<year>2014</year>). <chapter-title>A statistical perspective on algorithmic leveraging</chapter-title>. In: <source><italic>International Conference on Machine Learning</italic></source>, <fpage>91</fpage>–<lpage>99</lpage>. <publisher-name>PMLR</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_009">
<mixed-citation publication-type="book"> <string-name><surname>McLachlan</surname> <given-names>G</given-names></string-name>, <string-name><surname>Peel</surname> <given-names>D</given-names></string-name> (<year>2004</year>). <source><italic>Finite Mixture Models</italic></source> <series><italic>Wiley Series in Probability and Statistics</italic></series>. <publisher-name>Wiley</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_010">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2019</year>). <article-title>More efficient estimation for logistic regression with optimal subsamples</article-title>. <source><italic>Journal of Machine Learning Research</italic></source>, <volume>20</volume>(<issue>132</issue>): <fpage>1</fpage>–<lpage>59</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_011">
<mixed-citation publication-type="other"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>JK</given-names></string-name> (2020). Maximum sampled conditional likelihood for informative subsampling. arXiv preprint: <uri>https://arxiv.org/abs/2011.05988</uri>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_012">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>Y</given-names></string-name> (<year>2021</year>). <article-title>Optimal subsampling for quantile regression in big data</article-title>. <source><italic>Biometrika</italic></source>, <volume>108</volume>(<issue>1</issue>): <fpage>99</fpage>–<lpage>112</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Stufken</surname> <given-names>J</given-names></string-name> (<year>2019</year>). <article-title>Information-based optimal subdata selection for big data linear regression</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>114</volume>(<issue>525</issue>): <fpage>393</fpage>–<lpage>405</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_014">
<mixed-citation publication-type="journal"> <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Zhu</surname> <given-names>R</given-names></string-name>, <string-name><surname>Ma</surname> <given-names>P</given-names></string-name> (<year>2018</year>). <article-title>Optimal subsampling for large sample logistic regression</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>113</volume>(<issue>522</issue>): <fpage>829</fpage>–<lpage>844</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Yao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name> (<year>2019</year>). <article-title>Optimal subsampling for softmax regression</article-title>. <source><italic>Statistical Papers</italic></source>, <volume>60</volume>(<issue>2</issue>): <fpage>585</fpage>–<lpage>599</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_016">
<mixed-citation publication-type="journal"> <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Ai</surname> <given-names>M</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name> (<year>2022</year>). <article-title>Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data</article-title>. <source><italic>Journal of the American Statistical Association</italic></source>, <volume>117</volume>(<issue>537</issue>): <fpage>265</fpage>–<lpage>276</lpage>.</mixed-citation>
</ref>
<ref id="j_jds1057_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Zuo</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>H</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>L</given-names></string-name> (<year>2021</year>). <article-title>Sampling-based estimation for massive survival data with additive hazards model</article-title>. <source><italic>Statistics in Medicine</italic></source>, <volume>40</volume>(<issue>2</issue>): <fpage>441</fpage>–<lpage>450</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>