Workshop Program

Workshop program will be updated soon.
XX:YYzm Faithful Group Shapley Value
Yuan Zhang (National University of Singapore)
Data Shapley is an important tool for data valuation, which quantifies the contribution of individual data points to machine learning models. In practice, group-level data valuation is desirable when data providers contribute data in batch. However, we identify that existing group-level extensions of Data Shapley are vulnerable to shell company attacks, where strategic group splitting can unfairly inflate valuations. We propose Faithful Group Shapley Value (FGSV) that uniquely defends against such attacks. Building on original mathematical insights, we develop a provably fast and accurate approximation algorithm for computing FGSV. Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.
XX:YYzm On Unbiased Stochastic Approximation
Ajay Jasra (The Chinese University of Hong Kong, Shenzhen, Shenzhen China)
We consider the problem of estimating parameters of statistical models associated with differential equations. In particular, we assume that the differential equation can only be solved up to a numerical error; for instance, in the case of stochastic differential equations (SDEs), the Euler-Maruyama method is often used, which introduces time-discretization bias. We adopt an optimization-based paradigm where the objective function (the likelihood function) to be maximized is not available analytically. In this talk, we show how, for certain classes of models, a new randomized stochastic approximation scheme can be used to obtain parameter estimators that eliminate the aforementioned numerical error in mathematical expectation, under suitable assumptions. We detail several applications, including partially observed SDEs and Bayesian inverse problems. Mathematical results are presented alongside numerical simulations demonstrating the efficacy of our methodology.
XX:YYzm Revisiting Scalarization in Multi-Task Learning
Han Zhao (University of Illinois Urbana-Champaign)
Linear scalarization, i.e., combining all loss functions by a weighted sum, has been the default choice in the literature of multi-task learning (MTL) since its inception. In recent years, there has been a surge of interest in developing Specialized Multi-Task Optimizers (SMTOs) that treat MTL as a multi-objective optimization problem. However, it remains open whether there is a fundamental advantage of SMTOs over scalarization. In this talk, I will revisit scalarization from a theoretical perspective. I will be focusing on linear MTL models and studying whether scalarization is capable of fully exploring the Pareto front. Our findings reveal that, in contrast to recent works that claimed empirical advantages of scalarization, when the model is under-parametrized, scalarization is inherently incapable of full exploration, especially for those Pareto optimal solutions that strike the balanced trade-offs between multiple tasks. I will conclude the talk by briefly discussing the extension of our results to general nonlinear neural networks and our recent work on using online Chebyshev scalarization to controllably steer the search of Pareto optimal solutions.
XX:YYzm LLM-Powered CPI Prediction Inference with Online Text Time Series
Jinchi Lv (USC)
Forecasting the Consumer Price Index (CPI) is an important yet challenging task in economics, where most existing approaches rely on low-frequency, survey-based data. With the recent advances of large language models (LLMs), there is growing potential to leverage high-frequency online text data for improved CPI prediction, an area still largely unexplored. This paper proposes LLM-CPI, an LLM-based approach for CPI prediction inference incorporating online text time series. We collect a large set of high-frequency online texts from a popularly used Chinese social network site and employ LLMs such as ChatGPT and the trained BERT models to construct continuous inflation labels for posts that are related to inflation. Online text embeddings are extracted via LDA and BERT. We develop a joint time series framework that combines monthly CPI data with LLM-generated daily CPI surrogates. The monthly model employs an ARX structure combining observed CPI data with text embeddings and macroeconomic variables, while the daily model uses a VARX structure built on LLM-generated CPI surrogates and text embeddings. We establish the asymptotic properties of the method and provide two forms of constructed prediction intervals. The finite-sample performance and practical advantages of LLM-CPI are demonstrated through both simulation and real data examples. This is a joint work with Yingying Fan, Ao Sun and Yurou Wang.
XX:YYzm Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold
Song Liu (University of Bristol)
Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to efficiently approximate the natural gradient of the KL divergence without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms. The code of the proposed algorithms can be found here.
XX:YYzm GLM Inference with AI-Generated Synthetic Data Using Misspecified Linear Regression
Ali Shojaie (University of Washington)
Privacy concerns in data analysis have led to the growing interest in synthetic data, which strives to preserve the statistical properties of the original dataset while ensuring privacy by excluding real records. Recent advances in deep neural networks and generative artificial intelligence have facilitated the generation of synthetic data. However, although prediction with synthetic data has been the focus of recent research, statistical inference with synthetic data remains underdeveloped. In particular, in many settings, including generalized linear models (GLMs), the estimator obtained using synthetic data converges much more slowly than in standard settings. To address these limitations, we propose a method that leverages summary statistics from the original data. Using a misspecified linear regression estimator, we then develop inference that greatly improves the convergence rate and restores the standard root-n behavior for GLMs.
XX:YYzm Learning the Climate System: Workflows that Connect Physics, Data, and Machine Learning
Tian Zheng (Columbia University)
Machine learning is increasingly used in climate modeling to support system emulation, parameter inference, forecasting, and scientific discovery, addressing challenges such as physical consistency, multi-scale coupling, data sparsity, and integration with existing workflows. In this talk, I will present a series of applied case studies focused on workflow design in climate ML, including surrogate modeling, ML-based parameterization, equation discovery from high-fidelity simulations, probabilistic programming for parameter inference, simulation-based inference in remote sensing, subseasonal forecasting, and physics-informed transfer learning. These examples highlight how ML workflows can be grounded in physical knowledge, shaped by simulation data, and designed to incorporate real-world observations. By unpacking these workflows and their design choices, I will discuss open challenges in building transparent, adaptable, and reproducible ML systems for climate science.
XX:YYzm Do Foundation Models (Really) Need Statistical Foundations?
Weijie Su (Wharton)
In this talk, we advocate for the development of rigorous statistical foundations for large language models (LLMs). We begin by elaborating two key features that motivate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the complexity and black box nature of Transformer architectures. To illustrate how statistical insights can directly benefit LLM development and applications, we present two concrete examples. First, we demonstrate statistical inconsistencies and biases arising from the current approach to aligning LLMs with human preference. We propose a regularization term for aligning LLMs that is both necessary and sufficient to ensure consistent alignment. Second, we introduce a novel statistical framework to analyze the efficiency of watermarking schemes, with a focus on a watermarking scheme developed by OpenAI for which we derive optimal detection rules that outperform existing ones. Time permitting, we will explore how statistical principles can inform rigorous evaluation protocols for LLMs. Collectively, these findings showcase how statistical insights can address pressing challenges in LLMs while simultaneously illuminating new research avenues for the broader statistical community to advance responsible generative AI research. This talk is based on arXiv:2405.16455, 2404.01245, 2503.10990, and 2505.19145.
XX:YYzm Anomaly detection using surprisals
Rob Hyndman (Monash University)
I will discuss a probabilistic approach to anomaly detection based on extreme "surprisal values" aka log scores, equal to minus the log density at each observation. The surprisal approach can be used for any collection of data objects, provided a probability density can be defined on the sample space. It can distinguish anomalies from legitimate observations in a heavy tail, and will identify anomalies that are undetected using methods based on distance measures. I will demonstrate the idea in various real data examples including univariate, multivariate and regression contexts, and when exploring more complicated data objects. I will also briefly outline the underlying theory when the density is known, and when it is estimated using a kernel density estimate. In the latter case, an innovative bandwidth selection method is used based on persistent homology.