Kirill Skobelev

Bio

I am a pre-doctoral researcher at the Center for Applied AI at the University of Chicago Booth School of Business, and an incoming Ph.D. student in Computer Science at Northwestern University. I hold a B.A. in Economics from the New Economic School and Higher School of Economics. My research interests include AI and decision-making in long-horizon, sparse-reward domains such as medicine, business, policy, and law; interpretability methods that create a two-way supervision channel between humans and AI; and meta-science. Before joining the Center for Applied AI, I worked as a data scientist at Banco PLATA.

Research

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X.Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe · 2026

Abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks --- including multimodal data integration, human interaction, and physical effects --- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

arXiv preprint.

[paper]

DELM: a Python toolkit for Data Extraction with Language Models

Eric Fithian, Kirill Skobelev · 2025

Abstract

Large Language Models (LLMs) have become powerful tools for annotating unstructured data. However, most existing workflows rely on ad hoc scripts, making reproducibility, robustness, and systematic evaluation difficult. To address these challenges, we introduce DELM (Data Extraction with Language Models), an open-source Python toolkit designed for rapid experimental iteration of LLM-based data extraction pipelines and for quantifying the trade-offs between them. DELM minimizes boilerplate code and offers a modular framework with structured outputs, built-in validation, flexible data-loading and scoring strategies, and efficient batch processing. It also includes robust support for working with LLM APIs, featuring retry logic, result caching, detailed cost tracking, and comprehensive configuration management. We showcase DELM’s capabilities through two case studies: one featuring a novel prompt optimization algorithm, and another illustrating how DELM quantifies trade-offs between cost and coverage when selecting keywords to decide which paragraphs to pass to an LLM. DELM is available at \href{https://github.com/Center-for-Applied-AI/delm}{\texttt{github.com/Center-for-Applied-AI/delm}}.

arXiv preprint.

[code][paper]

Bank Regulators and Climate Action: Evidence from Supervisory Guidance

Kirill Skobelev, Rimmy E. Tomy · 2025

Abstract

U.S. bank regulators generally view their role as overseeing banks' exposure to climate-related risks rather than actively shaping climate policy. However, the boundary between risk oversight and policy advocacy is a gray area. We examine how U.S. bank regulators respond to climate advocacy pressures by analyzing textual data on public comments and comparing revisions between the draft and final interagency guidance on climate risk management. Utilizing large language models (LLMs) to classify the public comments, we find that these comments are highly polarized, with individuals and climate advocacy groups supporting climate action by bank regulators, and banks and banking associations opposing it. We create textual alignment measures between comments and the draft and final interagency guidance, and find that highly climate-engaged comments are less likely to be reflected in the final guidance, suggesting regulators discounted pro-climate advocacy. Finally, we use LLMs to simulate regulators with varying incentives and find that policies most similar to the actual outcome emerge when regulators follow their institutional mandates, while partisan political considerations produce less similar results.

Work in Progress. Reach out for the latest draft.

Statistical Learning Meets Analyst Forecasts of Corporate Earnings

Kirill Skobelev · 2024

Abstract

This paper explores the use and interpretation of ML methods in asset-pricing. First, I propose a robust market anomaly test based on analyst errors. I derive a link between excess returns and market's errors in earnings forecasts relative to a conditionally optimally estimator. I then contrast a Gradient Boosted Decision Tree algorithm to IBES consensus forecasts to show that companies for which analyst predictions exceeded ML forecasts earned lower out-of-sample returns and vice versa, which is consistent with the theoretical result. Further, I identify behavioral causes of analyst errors: I illustrate that analyst errors are associated with earnings growth in previous quarters (negatively), with book values (positively), and with capital expenditures (positively), thereby explaining some pricing anomalies. Second, I explore the application of ML for studying analyst behavior. I find that they submitted more Net Income forecasts for companies with intrinsically more predictable earnings as measured by R-Squared of ML forecasts after controlling for size and reporting history length.

Undergraduate thesis.

[paper]

Bitcoin Production Cost: Demystifying the Mining Industry

Kirill Skobelev · 2021

Abstract

This study examines the Bitcoin mining industry. It proposes a novel approach rooted in behavioral patterns for estimating the structure of the equipment utilized in the Bitcoin network. This study models industry’s capital expenditures, aggregate efficiency of mining hardware, and aggregate miner profitability. It finds that profits cluster in time and are relatively low compared to pure Bitcoin returns. This study considers two Bitcoin valuation methods proposed earlier in the literature: the marginal cost of production approach and 51-percent attack arbitrage. The study finds that the marginal cost hypothesis is invalid, while 51-percent attack arbitrage theory is well-grounded. This study introduces the hashrate adjustment hypothesis, which states that miners in aggregate adjust network hashrate to their revenues, which are primarily a function of the price of Bitcoin. The study presents evidence in its support.

Winner of the Student Research Paper Competition held by the HSE University.

[paper]