In this talk, we present a comprehensive overview of the construction of the Duolingo English Test (DET), a high-stakes, large-scale, fully online English language proficiency test built using human-in-the-loop (HIL) AI. The DET is a computerized adaptive test where test takers respond to items designed to assess their proficiency in speaking, writing, reading, and listening. Human oversight plays a critical role in ensuring the DET’s fairness, lack of bias, construct validity, and security. Additionally, AI and foundation models enable the scalability of key processes, including real-time security features and automated scoring.
We will take a tour of this process, organized into five parts. First, items are constructed using generative AI models like GPT, followed by human review to evaluate potential bias and fairness. Second, machine learning models automatically grade items by predicting expert human ratings based on construct-aligned AI features. Third, items are calibrated using explanatory item response theory models and custom item embeddings, aligning them with historical test taker performance. Fourth, tests are administered via Thompson sampling, motivated by framing computerized adaptive testing as a contextual bandit problem, and scored using a fully Bayesian approach. Finally, automated AI signals, such as eye-gaze tracking and LLM response detection, support proctors in detecting cheating and ensuring adherence to test rules. Throughout this process, humans and AI collaborate seamlessly to maintain the DET’s integrity and effectiveness.
The talk covered our systematic approach to building a reliable, scalable, and fair language assessment:
Item Construction: Leveraging generative AI models like GPT for initial item creation, followed by rigorous human review for bias and fairness evaluation.
Automated Scoring: Employing machine learning models that predict expert human ratings using construct-aligned AI features.
Item Calibration: Implementing explanatory item response theory models with custom item embeddings to align with historical performance data.
Adaptive Testing: Using Thompson sampling within a contextual bandit framework for test administration, combined with fully Bayesian scoring approaches.
Security and Integrity: Deploying automated AI signals including eye-gaze tracking and LLM response detection to support human proctors in maintaining test security.
You can view the full keynote details at NeurIPS 2024.
]]>I am co-organizing a two-day workshop for AAAI 2024 on “AI for Education: Bridging Innovation and Responsibility at the 38th AAAI Annual Conference on AI”. For full info see ai4ed.cc/workshops/aaai2024.
Discover the transformative potential of GenAI and RAI on education - join us in our 2-day workshop to explore new research opportunities, technological advances, and the crucial ethical implications towards a better, equitable educational future. You are the change we need! “You are invited to engage in a pivotal discussion about AI’s impact on education! Join us in our 2-day workshop to influence how AI transforms learning, teaching, and assessment, and help shape responsible AI practices. Let’s foster an inclusive, effective educational ecosystem together!
This two-day workshop explores the innovations in artificial intelligence (AI), specifically generative-AI (GenAI), in educational applications, and discusses the related ethical implications of responsible-AI (RAI). Over two days, attendees will examine GenAI technologies, potential vulnerabilities, and the development of RAI standards in an educational context. Through a variety of formats like papers, demonstrations, posters, a global competition on Math reasoning, and opportunities to hear experts and representatives from various communities, participants will explore AI’s impact on instruction quality, learner outcomes, and ethics. This workshop ultimately aims to inspire novel ideas, foster partnerships, and navigate the ethical complexities of AI in education.
We welcome different kinds of submissions:
All submissions must follow the PMLR style template. To ensure a fair review process, all submissions will be evaluated through a double-blind review.
Accepted full papers will be invited to submit an extended version, addressing the remarks of the reviewers, to PMLR https://proceedings.mlr.press/ to be published as part of the Workshop Proceedings.
Global challenge on math problem solving and reasoning.
We invite researchers and practitioners worldwide to investigate the opportunities of automatically solving math problems via LLM approaches. More details about this competition and instructions for submission can be found at https://ai4ed.cc/competitions/aaai2024competition.
Important Dates
Workshop URL. https://ai4ed.cc/workshops/aaai2024; For any questions email us at [email protected].
]]>Thesis: Conditional Independence Testing for Neural Networks
While the title of Xiaoliu’s thesis is quite specific his thesis was actually a diverse group of practical and methodological work. This is because Xiaoliu was instrumental in the Healthy Davis Together effort, which pulled him away from the usual Machine Learning methods work for my students and into the fast paced world of doing data science in real time for Covid response. With his help we were able to do the following…

The imputed concentration time series for the wastewater treatment plan compared to simpler imputation methods.
His work on Conditional Independence Testing proposed and studied a new way to do test the independence of binary variables (think treatment and effect) given a complex confounding variable (think medical images). His solution actually worked, and was not just a theoretical upper bound. I am still convinced that it is one of the few practical approaches to the problem (as far as I know). The publication remains to be completed however! (cough, cough)
Thesis: Statistical Learning for High-Dimensional Networked Data in Transportation Systems
Ran is a researcher in tranportation science and will be going on to do a postdoc at U of Michigan! Ran has a clear research direction: finding and exploiting low dimensional representations of traffic data in road networks. It is surprising how complicated analyses of traffic data over networks can get, and there are many complications (in contrast to the simplified problems that we typically study in graph signal processing). For example, data may be traffic flows from sensors in traffic lanes. They are directed, noisy, and measured on the edges of the network. Also, traffic networks are heterogeneous and only partially observed (you have to do some work to go from the road network and traffic to get to the actual graph). He has three distinct chapters that address real and important problems in transportation science using new graph signal processing methods. Here is an image of a road network in LA…

I published (finally) my paper on Nearest Neighbor Matching (NNM) in IEEE TransIT. It basically shows that if you have a biased non-missing sample and an unbiased partially missing sample, where X is non-missing and Y is missing for the population of interest, then it works to just match the partially missing data to the closest non-missing biased sample to estimate the population mean of Y. This is NNM, and we know that it is consistent when the two distributions satisfy a Renyi entropy constraint and the expectation of Y given X satisfies a moment constraint. Previously, it was known that it was consistent under Lipschitz smoothness constraints (which is easy to show) and we even had rates! Unfortunately, these rates were suboptimal under the Lipschitz assumption. So, debiasing approaches were proposed, but at that point it was unclear why you would not just use model based or regression approaches. However, without the Lipschitz assumption it is much harder to show even consistency. This is what I established, but there are many questions that I have!
One nice thing is that it works for any Borel measurable regression function, but a downside is that it requires finite dimensions and a density. In fact, we are awfully close to showing that this can work without the finite dimensionality assumption. I outlined in that paper how we can prove this for a broader set of separable metric spaces. But it requires some new results which would typically be lemmas for nearest neighbor regression. Then the remainder of my results would extend pretty naturally to this setting.
Nearest neighbor matching is very appealing because it can be used in combination with any metric and can benefit from data structures that target nearest neighbor lookups. For example, you could imagine a situation where an SQL query in a massive distributed data set, i.e. search applications, is
select average(Y) from bigtable where X < 1
If Y was partially missing then you could just match the missing entries to the non-missing entries and then use NNM. This situation is pretty common in search, especially for logs. Any search engine with an HNSW backend can do this on the fly.
J. Sharpnack, “On L2-consistency of Nearest Neighbor Matching,” in IEEE Transactions on Information Theory, doi: 10.1109/TIT.2022.3226479.
]]>Distribution shift is when the training and test data in supervised learning have different distributions.
Domain adaptation means training a classifier using only training data that gets good accuracy on test data under distribution shift.
In our recent ICLR submission called RLSbench we loosely defined relax label shift as
The label marginal distribution, \(p(y)\), can shift arbitrarily and the class conditionals, \(p(x\vert y)\), can shift in seemingly natural ways.
This is admittedly, purposefully vague. But you can see it in our gravitational lens detection problem.
I have written in more detail about it here, but the setting fits the relaxed label shift definition.
The basic idea is that we want to detect a rare phenomena called gravitational lenses, when very distant (old) light bends around a distant massive object like a galaxy and we can see it.
In our study, we can simulate a very many realistic lenses over real astronomy images without lenses.

Non-lens (left), simulated lens (center), real lens (right)
You can assume that there are no real lenses in our training set (we actually curate it). So because we get many simulated lenses, but there are very few real ones in the test set, the marginal label distribution \(p(y)\) shifts (\(y=1\) means it is a lens, \(y=0\) means it is not). Also, because the simulations definitely look different than the real lenses (they aren’t THAT good), then \(p(x\vert y=1)\) shifts. However, \(p(x\vert y=0)\) is the same since the non-lenses in the training and test data are pulled from the set of survey images. Hence, we have relaxed label shift, and typical assumptions like covariate shift (\(p(y\vert x)\) doesn’t shift) and label shift (\(p(x\vert y)\) doesn’t shift) don’t hold.
The RLSbench work does A LOT, and it is all due to Saurabh Garg. One takeaway of many is that FixMatch works quite well for domain adaptation under many of our benchmark datasets, and with appropriate modification it can almost always outperform standard supervised learning. This is important because we wont have a chance to do model selection for our domain adaptation method without peeking at the test data, and so we can’t really choose it. We want it to be safe in the sense that it typically improves performance.
Consistency regularization is where two independently augmented samples of the same test image are encouraged to produce similar predictions. This needs test images, but that is fine because we still wont peek at the test labels - a setting called semi-supervised learning (SSL). The Pi-model algorithm directly uses consistency regularization. The idea is to take two random augmentations of the same sample data point x, and compute the squared difference of the model outputs for the augmented copies. We use \(\text{aug}, \widetilde{\text{aug}}\) to denote two independent augmentations, which can be produced by selecting different randomization seeds. The unsupervised loss is then
\[\ell_U(X) = \left\| p_\Theta(\text{aug}(X)) - p_\Theta(\widetilde{\text{aug}}(X))\right\|^2\]This unsupervised loss is added to the supervised loss (usually with data augmentation as well). The choice of stochastic augmentation function is up to the modeler and will often be domain specific. FixMatch and MixMatch employ both consistency regularization. MixMatch was originally proposed as a heuristic approach, and FixMatch was later derived as a more principled simplification of MixMatch and other related SSL methods. When we look at the relative performances of different SSL algorithms for detecting gravitational lenses, we see that consistency regularization does best. (This is combined with GAN and smart data augmentation.) For more details on the pipeline for detecting gravitational lenses see our recent preprint.
]]>I received tenure in the summer of 2021 in the UC Davis Statistics department (currently the 13th best statistics department in the country). This was the culmination of a long process that was overall very gratifying. I am eternally grateful to my students, my department, my colleagues and advisors, and my supportive wife and beautiful children.
The basis for my tenure case was my research, teaching, and service (as usual).
My research has a few main threads, these are,
My Ph.D. thesis and early career research studied and proposed non-parametrics statistical and machine learning methods. Some of these problems were primarily theoretical such as my work on nearest neighbor methods, scan statistics, and trend filtering. Some has been driven by the need for new methods, such as graph anomaly detection, scalable personalized recommendation systems, and robust contextual bandits. Initially, I was the lead researcher and then as is the nature of academic labs, I shifted to PI work and lead a team.
As my research progressed, I became less interested in methodological research, i.e. proposing new methods for idealized problems. There is nothing really wrong with this type of research, but it lacks the connection to applications. In contrast, my most gratifying project to date has probably been detecting gravitational lenses in real astronomical surveys. This is an end-to-end data science problem, where we started with real data and experiments, had to make strategic decisions, design a machine learning and signal processing pipeline, and implement it. It has lead directly to new science and the discovery of new gravitational lenses. Some of the things that I learned in my recent years are how to manage large teams and coordinate dependent efforts. These soft skills are hard to obtain and master, but are critical to a successful team.
Teaching data science is a mixed blessing in academia. It can, and has been extremely gratifying, such as when a star student goes on to be very successful (looking at you Andrew Chin). It also is a very onerous task, and some courses are far more difficult to teach than others. Typically, applied data science courses are hard to teach because there is not enough canonical course material, there are not enough qualified or interested instructors (why teach when you can do and get paid way more), and the administrative burden of running a computation heavy course is large. I found that out the hard way, but I am very proud and grateful for the experience (now that it is over).
I taught the following courses…
Teaching also doesn’t have to be this way. Most of the administrative work can be automated or distributed within the department, but we don’t prioritize this. Course material could be shared between institutions and we don’t need to give the same lectures every year. Office hours can be greatly expanded if we remove these unnecessary burdens, and then large classes will feel smaller to the students.
I have decided to take an indefinite hiatus from academia and explore options in industry (or wherever except for tenure track professorships). I see a lot of notes from disgruntled former academics explaining why the academic job is so bad. I think that these are a bit overblown. I hold the opinion that the academic job is good for a very specific type of person. There are a few things that I have seen.
Academic freedom. You can have academic freedom if you decide to exercise it. However, for an early and mid career prof you can end up sacrificing your progression up the ranks (via citation count, awards, etc.) if you do not pursue “hot topics” in the field. It may be that you find those interesting, but for me there were some unknown things (like the consistency of nearest neighbor matching) that seemed insane that we didn’t know. I’m not complaining, but I am saying that these papers didn’t help my career, instead my professorship enabled their study. One of the few ways that you can actually do both is to be established enough to promote your interests within the community enough to make it a viable research topic.
Mentorship. Mentoring Ph.D. students was one of the greatest privileges of my life. It is hard to give up, and I think is the underappreciated part of a professorship. However, mentorship for it’s own sake is not directly appreciated since it does not always lead to publications (since some students are not saavy enough to publish as much as others).
Service. One has to be very motivated to be an influential member of their community to prioritize service. Service is a very nebulous thing, and it is unclear how to assess that you are doing enough. I think that this subjectiveness can lead to inequities, particularly towards women in our fields.
Teaching. Teaching is ostensibly what we are paid to do. But teaching loads can vary heavily. It is another place where inequity can creep in since we do not attempt to measure how hard a course is to teach.
Basically, the culture of the department and individual personalities have a huge impact on your career and life as a professor. My department has been very good to me, but I know of others that are not so good. I am very proud of my work in academia thus far, but I also have an itch to try my hand at industry. My goal is to lead a science team that is working toward a few difficult business goals, and to build a product or service.
]]>There are several challenges to this study…
Addressing all of these concerns was a tour de force in data science, requiring more planning than the usual academic team. The main milestones were

Regions of upstream locations for a sampling site.

An illustration of how we form the cross-walk table between the waterwater regions and CBG.

The imputed concentration time series for the wastewater treatment plan compared to simpler imputation methods.
The paper has been published in ACS EST Water,
Safford, Hannah, et al. “Wastewater-Based Epidemiology for COVID-19: Handling qPCR Nondetects and Comparing Spatially Granular Wastewater and Clinical Data Trends.” ACS Es&t Water (2022).
]]>pip, virtualenv, and pycharm and is good for simultaneously writing python packages and doing data analyses in jupyter. My preferred infrastructure solution is AWS. Disclaimer: I work for Amazon AWS.
I consolidate my EC2 setup using .ssh/Config, this greatly reduces my setup time. First, I fire up an EC2 instance, using an AMI such as one of the Deep Learning community images. I generate or use an existing key-pair pem file, and add it to my ~/.ssh folder on my local machine. I copy my Public IP, it looks like ec2-[IP].compute.1.amazonaws.com, but may differ. Then I add this to my ~/.ssh/Config:
Host EC2dev
HostName ec2-[IP].compute-1.amazonaws.com
User ubuntu
IdentityFile ~/.ssh/[MY-PEM-FILE].pem
Now my EC2 server has the alias EC2dev which I use for the rest of my setup. This way every time I fire up a new instance, or stop and start this one, I just have to change this IP.
Now I am ready to set up my python package, which I usually will host on github. For example, if I am working on AutoGluon then I will clone the repo on both my server and my local machine. I use SSH keys for authentication so I need to add my rsa public keys for both machines to my github settings.
$ cd ~
$ git clone [email protected]:awslabs/autogluon.git
Alternatively, I can just scp over my repo on my local machine, which will be what we are effectively doing with pycharm deployment anyway. On my EC2 server, I will create and activate a virtualenv,
$ python3 -m pip install --upgrade pip
$ python3 -m pip install virtualenv
$ python3 -m virtualenv env
$ source env/bin/activate
This way I can install my development packages in a contained environment. I go into the package that I am writing (e.g. cd autogluon/shift) and install with pip install -e .. The -e flag means that my install updates as I make changes.
You should start by installing pycharm.
To set up pycharm to work with your remote installation you have to set up the
To set up the remote interpreter, you need to go to File > Preferences > Python Interpreter. In the settings, add a new interpreter and select SSH Interpreter. In Host put EC2dev and in Username put ubuntu. Then for the interpreter find the location of your virtualenv, i.e. /home/ubuntu/env/bin/python. I will also deploy my project to /home/ubuntu/autogluon so I add that mapping to the Sync folders.
To set up my deployment, I go to Tools > Deployment > Configuration and either find the SSH connection or add it. I make sure that the mapping from local path to deployment path matches my project folder in both: ~/autogluon -> /home/ubuntu/autogluon. You can test the connection.
With this set up you should be able to debug with breakpoints, unittest, and deploy your edits. If you turn on automatic upload then this should be done for you, but you can manually upload individual files as well.
First you need to have remote forwarding set up on your local machine. In your ~/.ssh/Config file add
RemoteForward 8888 EC2dev:22
On the remote instance you can launch jupyter and then access it via port forwarding. The way to do this is to launch a new screen or tmux on the remote server. Then activate the environment, install and run jupyter.
$ ssh EC2dev
$ tmux
$ source env/bin/activate
$ pip install jupyter
Then I have to add the virtual environment to jupyter:
$ pip install ipykernel
$ python -m ipykernel install --name=env
Then fire up jupyter:
$ jupyter notebook
The jupyter notebook server should start and you should see an link with a token such as
http://localhost:8888/?token=...
Then I turn on ssh forwarding on my local machine with
$ ssh -N -f -L localhost:8888:localhost:8888 EC2dev
Then copy and paste the link into your browser, and you should see your jupyter instance.
There are a host of AWS services, but the ones that I have used the most are the following.
A gravitational lens occurs when the light from a background source like a galaxy bends around a massive object, forming a lens in space. However, because they are rare and this is a new survey, we are in a zero-shot learning situation, but we also have distributional shift because any training data from another survey will look completely different. We solve this by forming realistic simulated lenses, and pairing this with semi-supervised learning to transfer to the DLS.

Non-lens (left), simulated lens (center), real lens (right)
At no point do we use the real lenses to train or validate our model, rather only in testing the handful of trained models do we get to see our real lenses. The devil is in the details however, and the success of our method relied on a lot of engineering and method selection. The key components of our best method were
You can see our final work in the recent paper which was published in AIStats.
Stephen Sheng, Keerthi Vasan GC, Chi Po P Choi, James Sharpnack, and Tucker Jones. An unsupervised hunt for gravitational lenses. In International Conference on Artificial Intelligence and Statistics, pages 9827–9843. PMLR, 2022.
Strong gravitational lenses allow us to peer into the farthest reaches of space by bending the light from a background object around a massive object in the foreground. Unfortunately, these lenses are extremely rare, and manually finding them in astronomy surveys is difficult and time-consuming. We are thus tasked with finding them in an automated fashion with few if any, known lenses to form positive samples. To assist us with training, we can simulate realistic lenses within our survey images to form positive samples. Naively training a ResNet model with these simulated lenses results in a poor precision for the desired high recall, because the simulations contain artifacts that are learned by the model. In this work, we develop a lens detection method that combines simulation, data augmentation, semi-supervised learning, and GANs to improve this performance by an order of magnitude. We perform ablation studies and examine how performance scales with the number of non-lenses and simulated lenses. These findings allow researchers to go into a survey mostly “blind” and still classify strong gravitational lenses with high precision and recall.
We have confirmed two of gravitational lenses that we have discovered, a process that was slowed down by the pandemic lockdowns. The first was confirmed spectroscopically (basically see that it is really really old light) at the Keck observatory. Here is the image from the DLS survey.

The second lens was also confirmed at Keck, but we were also able to get a glimpse of it on Hubble. We can see the Hubble image (top) and the DLS image (bottom) below.

Qin Ding, Shitong Wei, me, Lifeng Wei (left to right)
Thesis: Advances in Stochastic Contextual Bandits
Qin has established herself as an expert in contextual bandits with an ambitious research agenda. She has focused on extending the scope and capabilities of contextual bandit algorithms. This includes robustness, automated hyperparameter tuning, fast stochastic updating, and non-stationarity. You can see work via her google scholar page.
One thing that I think was very nice was her use of bandit-over-bandits to adapt to adversarial attacks. She found that the meta-bandit could be used to tune any number of parameters, and this actually worked very well in practice. This was even true for theoretically derived parameters such as what is used in UCB algorithms! You should not leave home without it.
Thesis: Multidimensional Graph Trend Filtering
Shitong has produced a nice body of work on trend filtering for exponential families in multiple dimensions. One application was adaptive small area estimation for Covid-19 case proportion in CA. A common approach to dealing with small counts in epidemiology is to average over large spatial regions. The idea behind her method is that if you have low counts during a time period than you need to have a more coarse spatial resolution. You can do this using spatio-temporal graph trend filtering with the Poisson likelihood over a mobility graph. In the image below you can see the different spatial regions identified during a low count (left) and high count (right) time period.

Thesis: Applications of Statistics in Machine Learning Problems
Lifeng has a dissertation that represents his diverse interests and expertise. Centered around statistics in machine learning, his chapters are:
Lifeng’s efforts were critical to the operations of the modeling team in Healthy Davis Together. He worked with the Safegraph mobility data to produce mobility networks that were used to identify commute patterns to and from Davis. One notable contribution was his work on mapping the Covid-19 wastewater sampling data to saliva testing. This was used to validate the wastewater sampling by correlating it with the test positivity within census blocks. He did this by associating nodes in the wastewater network with census blocks. Then all locations upstream of a maintenance hole were used to build the population that is upstream of the sampling location. You can see an example of upstream locations to a sampling site in the image below.
