Workshop Program

Workshop program will be updated soon.
XX:YYzm Faithful Group Shapley Value
Yuan Zhang (Ohio State University)
Data Shapley is an important tool for data valuation, which quantifies the contribution of individual data points to machine learning models. In practice, group-level data valuation is desirable when data providers contribute data in batch. However, we identify that existing group-level extensions of Data Shapley are vulnerable to shell company attacks, where strategic group splitting can unfairly inflate valuations. We propose Faithful Group Shapley Value (FGSV) that uniquely defends against such attacks. Building on original mathematical insights, we develop a provably fast and accurate approximation algorithm for computing FGSV. Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.
XX:YYzm On Unbiased Stochastic Approximation
Ajay Jasra (The Chinese University of Hong Kong, Shenzhen, Shenzhen China)
We consider the problem of estimating parameters of statistical models associated with differential equations. In particular, we assume that the differential equation can only be solved up to a numerical error; for instance, in the case of stochastic differential equations (SDEs), the Euler-Maruyama method is often used, which introduces time-discretization bias. We adopt an optimization-based paradigm where the objective function (the likelihood function) to be maximized is not available analytically. In this talk, we show how, for certain classes of models, a new randomized stochastic approximation scheme can be used to obtain parameter estimators that eliminate the aforementioned numerical error in mathematical expectation, under suitable assumptions. We detail several applications, including partially observed SDEs and Bayesian inverse problems. Mathematical results are presented alongside numerical simulations demonstrating the efficacy of our methodology.
XX:YYzm Revisiting Scalarization in Multi-Task Learning
Han Zhao (University of Illinois Urbana-Champaign)
Linear scalarization, i.e., combining all loss functions by a weighted sum, has been the default choice in the literature of multi-task learning (MTL) since its inception. In recent years, there has been a surge of interest in developing Specialized Multi-Task Optimizers (SMTOs) that treat MTL as a multi-objective optimization problem. However, it remains open whether there is a fundamental advantage of SMTOs over scalarization. In this talk, I will revisit scalarization from a theoretical perspective. I will be focusing on linear MTL models and studying whether scalarization is capable of fully exploring the Pareto front. Our findings reveal that, in contrast to recent works that claimed empirical advantages of scalarization, when the model is under-parametrized, scalarization is inherently incapable of full exploration, especially for those Pareto optimal solutions that strike the balanced trade-offs between multiple tasks. I will conclude the talk by briefly discussing the extension of our results to general nonlinear neural networks and our recent work on using online Chebyshev scalarization to controllably steer the search of Pareto optimal solutions.
XX:YYzm LLM-Powered CPI Prediction Inference with Online Text Time Series
Jinchi Lv (USC)
Forecasting the Consumer Price Index (CPI) is an important yet challenging task in economics, where most existing approaches rely on low-frequency, survey-based data. With the recent advances of large language models (LLMs), there is growing potential to leverage high-frequency online text data for improved CPI prediction, an area still largely unexplored. This paper proposes LLM-CPI, an LLM-based approach for CPI prediction inference incorporating online text time series. We collect a large set of high-frequency online texts from a popularly used Chinese social network site and employ LLMs such as ChatGPT and the trained BERT models to construct continuous inflation labels for posts that are related to inflation. Online text embeddings are extracted via LDA and BERT. We develop a joint time series framework that combines monthly CPI data with LLM-generated daily CPI surrogates. The monthly model employs an ARX structure combining observed CPI data with text embeddings and macroeconomic variables, while the daily model uses a VARX structure built on LLM-generated CPI surrogates and text embeddings. We establish the asymptotic properties of the method and provide two forms of constructed prediction intervals. The finite-sample performance and practical advantages of LLM-CPI are demonstrated through both simulation and real data examples. This is a joint work with Yingying Fan, Ao Sun and Yurou Wang.
XX:YYzm Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold
Song Liu (University of Bristol)
Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to efficiently approximate the natural gradient of the KL divergence without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms. The code of the proposed algorithms can be found here.
XX:YYzm GLM Inference with AI-Generated Synthetic Data Using Misspecified Linear Regression
Ali Shojaie (University of Washington)
Privacy concerns in data analysis have led to the growing interest in synthetic data, which strives to preserve the statistical properties of the original dataset while ensuring privacy by excluding real records. Recent advances in deep neural networks and generative artificial intelligence have facilitated the generation of synthetic data. However, although prediction with synthetic data has been the focus of recent research, statistical inference with synthetic data remains underdeveloped. In particular, in many settings, including generalized linear models (GLMs), the estimator obtained using synthetic data converges much more slowly than in standard settings. To address these limitations, we propose a method that leverages summary statistics from the original data. Using a misspecified linear regression estimator, we then develop inference that greatly improves the convergence rate and restores the standard root-n behavior for GLMs.
XX:YYzm Learning the Climate System: Workflows that Connect Physics, Data, and Machine Learning
Tian Zheng (Columbia University)
Machine learning is increasingly used in climate modeling to support system emulation, parameter inference, forecasting, and scientific discovery, addressing challenges such as physical consistency, multi-scale coupling, data sparsity, and integration with existing workflows. In this talk, I will present a series of applied case studies focused on workflow design in climate ML, including surrogate modeling, ML-based parameterization, equation discovery from high-fidelity simulations, probabilistic programming for parameter inference, simulation-based inference in remote sensing, subseasonal forecasting, and physics-informed transfer learning. These examples highlight how ML workflows can be grounded in physical knowledge, shaped by simulation data, and designed to incorporate real-world observations. By unpacking these workflows and their design choices, I will discuss open challenges in building transparent, adaptable, and reproducible ML systems for climate science.
XX:YYzm State-of-the-art Confidence Intervals and Confidence Sequences through Testing-by-Betting Algorithms
Francesco Orabona (KAUST)
We consider the problem of constructing optimal confidence intervals and confidence sequences for the mean of a bounded random variable, a key problem in statistics and machine learning when sampling is costly. All the traditional methods, like Hoeffding’s, Bernstein’s, and the Law-of-Iterated-Logarithm (LIL) bounds, while asymptotically optimal, have disappointing performance in the small sample regime. Here, we present the current state-of-the-art approaches for both problems, based on the testing-by-betting framework by Shafer&Vovk. The first one is based on the betting strategy used by the Universal Portfolio Algorithm (Cover and Ordentlich, 1996) and obtains the first never-vacuous confidence sequences satisfying the LIL bound. The second one is a new dynamic betting algorithm that explicitly takes into account the time horizon to achieve the theoretically and numerically tightest confidence intervals.
XX:YYzm Anomaly detection using surprisals
Rob Hyndman (Monash University)
I will discuss a probabilistic approach to anomaly detection based on extreme "surprisal values" aka log scores, equal to minus the log density at each observation. The surprisal approach can be used for any collection of data objects, provided a probability density can be defined on the sample space. It can distinguish anomalies from legitimate observations in a heavy tail, and will identify anomalies that are undetected using methods based on distance measures. I will demonstrate the idea in various real data examples including univariate, multivariate and regression contexts, and when exploring more complicated data objects. I will also briefly outline the underlying theory when the density is known, and when it is estimated using a kernel density estimate. In the latter case, an innovative bandwidth selection method is used based on persistent homology.
XX:YYzm Conversion theorem and minimax optimality for continuum contextual bandits
Alexandre Tsybakov (CREST, ENSAE, IP Paris)
Abstract: We study the continuum contextual bandit problem, where the learner sequentially receives a side information vector (a context) and has to choose an action in a convex set, minimizing a function depending on the context. The goal is to minimize the dynamic contextual regret, which provides a stronger guarantee than the standard static regret. We propose a meta-algorithm that to any input non-contextual bandit algorithm associates an output contextual bandit algorithm, and we prove a conversion theorem, which allows one to derive a bound on the contextual regret from the static regret of the input algorithm. We apply this strategy to obtain upper bounds on the contextual regret in several major settings (losses that are Lipschitz, convex and Lipschitz, strongly convex and smooth with respect to the action variable). Inspired by the interior point method and employing self-concordant barriers, we propose an algorithm achieving a sub-linear contextual regret for strongly convex and smooth functions in noisy setting. We show that it achieves, up to a logarithmic factor, the minimax optimal rate of the contextual regret as a function of the number of queries. Joint work with Arya Akhavan, Karim Lounici and Massi Pontil.
XX:YYzm Adaptive sample splitting for randomization tests
Yao Zhang (NUS)
Randomization tests are widely used to generate finite-sample valid p-values for causal inference on experimental data. However, when applied to subgroup analysis, these tests may lack power due to small subgroup sizes. Incorporating a shared estimator of the conditional average treatment effect (CATE) can substantially improve power across subgroups but requires sample splitting to preserve validity. To this end, we quantify each unit's contribution to estimation and testing using a certainty score, which measures how certain the unit's treatment assignment is given its covariates and outcome. We show that units with higher certainty scores are more valuable for testing but less important for CATE estimation, since their treatment assignments can be accurately imputed. Building on this insight, we propose AdaSplit, a sample-splitting procedure that adaptively allocates units between estimation and testing to maximize their overall contribution across tasks. We evaluate AdaSplit through simulation studies, demonstrating that it yields more powerful randomization tests than baselines that omit CATE estimation or rely on random sample-splitting. Finally, we apply AdaSplit to a blood pressure intervention trial, identifying patient subgroups with significant treatment effects. This is a joint work with Zijun Gao.
XX:YYzm Asymptotic FDR control with model-X knockoffs: is moments matching sufficient?
Yingying Fan (USC)
We propose a unified theoretical framework for studying the robustness of the model-X knockoffs framework by investigating the asymptotic false discovery rate (FDR) control of the practically implemented approximate knockoffs procedure. This procedure deviates from the model-X knockoffs framework by substituting the true covariate distribution with a user-specified distribution that can be learned using in-sample observations. By replacing the distributional exchangeability condition of the model-X knockoff variables with three conditions on the approximate knockoff statistics, we establish that the approximate knockoffs procedure achieves the asymptotic FDR control. Using our unified framework, we further prove that an arguably most popularly used knockoff variable generation method--the Gaussian knockoffs generator based on the first two moments matching--achieves the asymptotic FDR control when the two-moment-based knockoff statistics are employed in the knockoffs inference procedure. For the first time in the literature, our theoretical results justify formally the effectiveness and robustness of the Gaussian knockoffs generator. Simulation and real data examples are conducted to validate the theoretical findings.
XX:YYzm Statistical Inference for High-Dimensional and Functional Data via Bootstrapping
Zhenhua Lin (NUS)
Statistical inference in high-dimensional and functional settings is both central and challenging. We develop a set of bootstrap-based procedures for common inferential tasks, including high-dimensional analysis of variance, two-sample homogeneity testing, and hypothesis tests for the mean function and for the slope function in functional linear models. We establish asymptotic validity and consistency of the proposed methods and derive their convergence rates. As a by-product, we uncover a theoretical distinction between FPCA-based estimation and inference for the slope function. Numerical studies show accurate type-I error control and competitive power, especially with limited samples and weak signals.
XX:YYzm Leveraging synthetic data in statistical inference
Edgar Dobriban (Wharton)
Synthetic data, for instance generated by foundation models, may offer great opportunities to boost sample sizes in statistical analysis. However, the distribution of synthetic data may not be exactly the same as that of the real data, thus incurring the risk of faulty inferences. Motivated by these observations, we study how to use synthetic or auxiliary data in statistical inference problems ranging from predictive inference (conformal prediction) to hypothesis testing. We develop methods that are able to leverage synthetic or auxiliary data in addition to real data. If the synthetic data distribution is similar to that of the real data, our methods improve precision. At the same time, our methods maintain a guardrail level of coverage even if the synthetic data distribution is arbitrarily bad. We illustrate our methods with a variety of examples ranging from AI to the medical domain.
XX:YYzm Semi‑Supervised Learning on Graphs with GNNs
Olga Klopp (ESSEC)
We study semi‑supervised node prediction on graphs where responses arise from a graph‑aware feature operator followed by a smooth regression map. Within a class combining skip‑connected GCN propagation with a fully connected ReLU network, we (i) derive an oracle inequality for population risk under random label masks that separates approximation and estimation error and exposes dependence on the labeled fraction, covering numbers, and a receptive‑field constant; (ii) show skip connections exactly represent multi‑hop polynomial filters, mitigating over‑smoothing; (iii) give covering‑number bounds; and (iv) quantify robustness of our algorithm. These results link classical graph regularization and modern GNN design.
XX:YYzm On the sample complexity of semi-supervised multi-objective learning
Fanny Yang (ETHZ)
In multi-objective learning (MOL), several possibly competing prediction tasks must be solved jointly by a single model. Achieving good trade-offs may require a model class G with larger capacity than what is necessary for solving the individual tasks. This, in turn, increases the statistical cost, as reflected in known MOL bounds that depend on the complexity of G. We show that this cost is unavoidable for some losses, even in an idealized semi-supervised setting, where the learner has access to the Bayes-optimal solutions for the individual tasks as well as the marginal distributions over the covariates. On the other hand, for objectives defined with Bregman losses, we prove that the complexity of G may come into play only in terms of unlabeled data. Concretely, we establish sample complexity upper bounds, showing precisely when and how unlabeled data can significantly alleviate the need for labeled data. These rates are achieved by a simple, semi-supervised algorithm via pseudo-labeling.
XX:YYzm Assessing the Quality of Denoising Diffusion Models in Wasserstein Distance: Noisy Score and Optimal Bounds
Arnak Dalalyan (ENSAE)
Generative modeling aims to produce new random examples from an unknown target distribution, given access to a finite collection of examples. Among the leading approaches, denoising diffusion probabilistic models (DDPMs) construct such examples by mapping a Brownian motion via a diffusion process driven by an estimated score function. In this work, we first provide empirical evidence that DDPMs are robust to constant-variance noise in the score evaluations. We then establish finite-sample guarantees in Wasserstein-2 distance that exhibit two key features: (i) they characterize and quantify the robustness of DDPMs to noisy score estimates, and (ii) they achieve faster convergence rates than previously known results. Furthermore, we observe that the obtained rates match those known in the Gaussian case, implying their optimality.
The talk is based on a joint work with E. Vardanyan and V. Arsenyan.
XX:YYzm Regularized Fine-Tuning for Representation Multi-Task Learning: Adaptivity, Minimax Optimality, and Robustness
Yang Feng (NYU)
We study multi-task linear regression through the lens of regularized fine-tuning, where tasks share a latent low-dimensional structure but may deviate from it or include outliers. Unlike classical models that assume a common subspace, we allow each task’s subspace to drift within a similarity radius and permit an unknown fraction of tasks to violate the shared structure. We propose a penalized empirical-risk algorithm and a spectral method that adapt automatically to both the degree of subspace similarity and the proportion of outliers. We establish information-theoretic lower bounds and show that our methods achieve these rates up to constants, with the spectral method attaining exact minimax optimality in the absence of outliers. Moreover, our estimators are robust: they never perform worse than independent single-task regression and yield strict improvements when tasks are moderately similar and outliers are sparse. A thresholding scheme further adapts to unknown intrinsic dimension, and experiments validate the theory.
XX:YYzm Random Fields on Dynamic Metric Graphs
Emilio Porcu (Khalifa University)
Distinguishing between two distributions based on observed data is a classical problem in statistics and machine learning. But what if we aim to go further—not just test, but actually estimate a distribution close to the true one in, say, Kullback-Leibler divergence? Can we do this knowing only that the true distribution lies in a known class, without structural assumptions on the individual densities? In this talk, I will review classical results and present recent developments on this question. The focus will be on high-probability error bounds that are optimal up to constants in this general setting.
XX:YYzm From Hypothesis Testing to Distribution Estimation
Nikita Zhivotovskiy (Berkeley)
Filler text: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum dignissim et est et euismod. Fusce et metus tempus, pellentesque ex at, convallis nulla. Ut fringilla commodo tincidunt. Fusce sed est eu massa placerat iaculis eu at mauris. Nullam ut mollis nisi, quis malesuada risus. Interdum et malesuada fames ac ante ipsum primis in faucibus. Nam ipsum tortor, suscipit non tincidunt vel, bibendum in libero. Nulla facilisi. Pellentesque vitae neque metus. Cras quis est pharetra, vestibulum nisl et, viverra ipsum. Etiam porta dignissim purus, quis tempor metus volutpat eu. Praesent pulvinar libero eget purus tincidunt finibus.
XX:YYzm PCS uncertainty quantification for regression and classification
Anthony Ozerov (Berkeley)
Trustworthy uncertainty quantification (UQ) is required for good, safe decision-making when using machine learning models. We discuss new UQ methods for regression and classification under the Predictability-Computability-Stability framework. This involves (1) training different models on the same dataset, (2) screening out those which make poor predictions, (3) creating a bootstrap ensemble, and (4) using calibration data to create prediction intervals or sets achieving a specified coverage level. By combining model selection (screening), uncertainty due to model misspecification and noise (ensembling), and calibration, we obtain predictions on real datasets which are sharper (narrower intervals or smaller sets) and more adaptive to data subgroups than standard one-model conformal prediction. Finally, we discuss one challenge that arose in this work: how to evaluate prediction sets in classification, balancing sharpness and adaptivity. Drawing from well-known scoring rules for probabilistic predictions, we propose new evaluation metrics for prediction sets and show how they can be used to choose between or tune prediction algorithms.
XX:YYzm LLM predisposition
Xin Tong (HKU)
We study the faithfulness of LLM-mediated communication by modeling it as a generation-summarization process. Using a novel experimental framework and an adapted benchmark dataset, we introduce a quantitative metric to evaluate faithfulness. Our results reveal significant information distortion in current LLM-mediated communication.
XX:YYzm New Statistical Questions in the Age of Large Language Models
Amit Sharma (MSR India)
Statistical analysis has long relied on a division of labor: domain knowledge is provided by subject-matter experts, while inference is guided by formal statistical methods. Large language models (LLMs) blur this boundary by generating domain knowledge–like priors for a problem, offering new opportunities and statistical challenges. I will first demonstrate how LLMs can propose causal mechanisms across fields such as medicine and environmental science, suggest candidate variables, functional forms, and even robustness checks. Unlike expert knowledge, however, LLM-derived priors cannot be assumed valid—they introduce new, structured but unpredictable forms of error.
This motivates a broader statistical question: what would end-to-end inference look like in an LLM-assisted regime? For example, in causal effect estimation, LLMs may provide distributions over causal graphs that can then be used for effect estimation; conversely, given an effect, they may help check or refine the assumptions. Extending this idea, we arrive at the possibility of inference pipelines that move directly from a scientific question to study design to parameter estimation with LLM input at each stage. Such workflows raise new statistical challenges, including how to construct confidence intervals, quantify uncertainty, and calibrate inference when part of the prior comes from an unreliable but informative model. The talk will motivate these open problems with real-world case studies.
XX:YYzm Causal Modeling with Stationary Processes
Mathias Drton (TUM)
The ultimate aim of many data analyses is to infer cause-and-effect relationships between random variables of interest. While much of the available methodology for addressing causal questions relies on structural causal models, these models are best suited for systems without feedback loops. Extensions to accommodate feedback have been proposed, but often result in models that are challenging to interpret. In this lecture, we present an alternative approach to graphical causal modeling that considers stationary distributions of multivariate diffusion processes.
XX:YYzm Causality-Inspired Distributional Robustness for Nonlinear Models
Peter Buhlmann (ETHZ)
Distributional robustness is a central challenge in predictive modeling, as real-world data often exhibit substantial distribution shifts across environments. Causality offers a principled framework for modeling such distributional perturbations, enabling rigorous guarantees of robustness in nonlinear models through representation learning. We will discuss the framework and its theoretical foundations, and illustrate its applications in perturbation genomics and medical domain adaptation.
XX:YYzm From Intrinsic Dimension to Information Imbalance: Nearest-Neighbor Methods for Dimensionality Reduction, Nonparametric Variable Selection, and Causal Discovery
Antonietta Mira (Università della Svizzera italiana)
This talk presents recent advances in nearest-neighbor methods for understanding complex, high-dimensional data. In the first part, we focus on intrinsic dimension (ID), a simple yet powerful descriptor that reveals the effective number of degrees of freedom in a dataset. We show how ID can be estimated adaptively, self-consistently determining both the optimal scale of analysis and the number of variables required to describe the data without significant information loss. Moreover, different IDs may coexist within the same dataset, pointing to subsets of points lying on distinct manifolds and naturally yielding a clustering of the data. Applications range from gene expression and protein folding to pandemic evolution, fMRI, finance, and network data. In the second part, we introduce the concept of information imbalance (II) and its differentiable extension (DII), which provide nonparametric measures of variable informativeness and causal directionality. Applications to synthetic and real-world datasets, including the EU Emission Trading System, highlight the potential of this framework for dimensionality reduction, variable selection, and causal discovery.
XX:YYzm Bayesian predictive-based uncertainty quantification
Sonia Petrone (Bocconi)
In the rapid evolution of Statistics and AI, we still feel a tension between the “two cultures” - classic statistical inference versus algorithmic prediction. The Bayesian approach has prediction in its foundations, and may naturally combine both cultures. In a Bayesian predictive approach, one directly reasons on prediction of future observations, bypassing models and parameters, or possibly using them implicitly. In a nutshell, while Statistics traditionally goes from inference to prediction, here one goes from prediction to inference. This approach allows us to regard predictive algorithms - computationally convenient approximations of exact Bayesian solutions, or black-box predictive engines - as Bayesian predictive learning rules, and to provide them with full Bayesian uncertainty quantification. In the talk, I will review basic concepts and recent results, and discuss ongoing directions, such as calibrating the predictive rule for predictive ‘efficiency’ and good inferential properties.
XX:YYzm Interesting Title of a Talk
Alexander Giessing (NUS)
In this talk we develop non-asymptotic Gaussian approximation results for the sampling distribution of suprema of empirical processes when the indexing function class \(\mathcal{F}_n\) varies with the sample size \(n\) and may not be Donsker. Prior approximations of this type required upper bounds on the metric entropy of \(\mathcal{F}_n\) and uniform lower bounds on the variance of \(f \in \mathcal{F}_n\) which, both, limited their applicability to high-dimensional inference problems. In contrast, our results hold under simpler conditions on boundedness, continuity, and the strong variance of the approximating Gaussian process. The results are broadly applicable and yield a novel procedure for bootstrapping the distribution of empirical process suprema based on the truncated Karhunen–Loève decomposition of the approximating Gaussian process. We demonstrate the flexibility of this new bootstrap procedure by applying it to three fundamental problems: simultaneous inference on parameter vectors, construction of simultaneous confidence bands for functions in reproducing kernel Hilbert spaces, and inference on shallow neural networks.