The rapid proliferation of generative AI has raised questions about the competitiveness of lower-parameter, locally tunable, open-weight models relative to high-parameter, API-guarded, closed-weight models in terms of performance, domain adaptation, cost, and generalization. Centering under-resourced yet risk-intolerant settings in government, research, and healthcare, we see for-profit closedweight models as incompatible with requirements for transparency, privacy, adaptability, and standards of evidence. Yet the performance penalty in using open-weight models, especially in low-data and low-resource settings, is unclear. We assess the feasibility of using smaller, open-weight models to replace GPT-4-Turbo in zero-shot, few-shot, and fine-tuned regimes, assuming access to only a single, low-cost GPU. We assess value-sensitive issues around bias, privacy, and abstention on three additional tasks relevant to those topics. We find that with relatively low effort, very low absolute monetary cost, and relatively little data for fine-tuning, small open-weight models can achieve competitive performance in domain-adapted tasks without sacrificing generality. We then run experiments considering practical issues in bias, privacy, and hallucination risk, finding that open models offer several benefits over closed models. We intend this work as a case study in understanding the opportunity cost of reproducibility and transparency over for-profit state-of-the-art zero shot performance, finding this cost to be marginal under realistic settings.
]]>An invasive species of grass known as โbuffelgrassโ contributes to severe wildfires and biodiversity loss in the Southwest United States. We tackle the problem of predicting buffelgrass โgreen-upsโ (i.e. readiness for herbicidal treatment). To make our predictions, we explore temporal, visual and multi-modal models that combine satellite sensing and deep learning. We find that all of our neural-based approaches improve over conventional buffelgrass green-up models, and discuss how neural model deployment promises significant resource savings.
]]>Documentation burnout contributes to clinician job dissatisfaction, and clinical notes often omit salient information. The automatic generation of notes from doctor-patient conversations using a computerized medical scribe, referred to as a Digital Scribe, provides an alternative, potentially time-saving documentation process. Generating notes from the transcribed patient interviews requires reorganizing utterances by topical note sections, identifying the clinically significant information, and generating a medical language summary. The MEDIQA-Sum task of ImageCLEF 2023 explores the development of Digital Scribe through the generation of clinical note summaries of transcribed patient visits. We participated in all three Subtasks and made contributions related to note subsection classification and dialogue-note alignment. We achieved high classification accuracy (81.5%) for Subtask A by fine-tuning T5-large, which ranked 2ND among 10 participants. We explored the capabilities of state-of-the-art large language models in the Subtask B summarization task. For Subtask C, we manually annotated the alignment between dialogue transcripts and clinical notes for a subset of training examples, to assist in learning the mapping from dialogue content to clinical note subsection.
]]>Social determinants of health (SDOH) documented in the electronic health record through unstructured text are increasingly being studied to understand how SDOH impacts patient health outcomes. In this work, we utilize the Social History Annotation Corpus (SHAC), a multi-institutional corpus of de-identified social history sections annotated for SDOH, including substance use, employment, and living status information. We explore the automatic extraction of SDOH information with SHAC in both standoff and inline annotation formats using GPT-4 in a one-shot prompting setting. We compare GPT-4 extraction performance with a high-performing supervised approach and perform thorough error analyses. Our prompt-based GPT-4 method achieved an overall 0.652 F1 on the SHAC test set, similar to the 7th best-performing system among all teams in the n2c2 challenge with SHAC.
]]>Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria.
Materials and Methods: The task of query creation from eligibility criteria requires solving several textprocessing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries.
Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI.
Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival a human programmer in finding patients eligible for clinical trials.
Keywords: clinical trials, natural language processing, machine learning, electronic health records, cohort definition
]]>Abstract We adapt image inpainting techniques to impute large, irregular missing regions in urban settings characterized by sparsity, variance in both space and time, and anomalous events. Missing regions in urban data can be caused by sensor or software failures, data quality issues, interference from weather events, incomplete data collection, or varying data use regulations; any missing data can render the entire dataset unusable for downstream applications. To ensure coverage and utility, we adapt computer vision techniques for image inpainting to operate on 3D histograms (2D space + 1D time) commonly used for data exchange in urban settings.
Adapting these techniques to the spatiotemporal setting requires handling skew: urban data tend to follow population density patterns (small dense regions surrounded by large sparse areas); these patterns can dominate the learning process and fool the model into ignoring local or transient effects. To combat skew, we 1) train simultaneously in space and time, and 2) focus attention on dense regions by biasing the masks used for training to the skew in the data. We evaluate the core model and these two extensions using the NYC taxi data and the NYC bikeshare data, simulating different conditions for missing data. We show that the core model is effective qualitatively and quantitatively, and that biased masking during training reduces error in a variety of scenarios. We also articulate a tradeoff in varying the number of timesteps per training sample: too few timesteps and the model ignores transient events; too many timesteps and the model is slow to train with limited performance gain.
For more details, please refer to repo
]]>Abstract We consider the use of AI techniques to expand the coverage, access, and equity of urban data. We aim to enable holistic research on city dynamics, steering AI research attention away from profitoriented, societally harmful applications (e.g., facial recognition) and toward foundational questions in mobility, participatory governance, and justice. By making available high-quality, multi-variate, cross-scale data for research, we aim to link the macrostudy of cities as complex systems with the reductionist view of cities as an assembly of independent prediction tasks. We identify four research areas in AI for cities as key enablers: interpolation and extrapolation of spatiotemporal data, using NLP techniques to model speechand text-intensive governance activities, exploiting ontology modeling in learning tasks, and understanding the interaction of fairness and interpretability in sensitive contexts.
Topics include:
Interpolation of spatiotemporal data using deep learning
Trade-off among distributive and procedural fairness
Hierarchical multi-label classification
Modeling governance behaviors
Objectives: We study interpretable recidivism prediction using machine learning (ML) models and analyze performance in terms of prediction ability, sparsity, and fairness. Unlike previous works, this study trains interpretable models that output probabilities rather than binary predictions, and uses quantitative fairness definitions to assess the models. This study also examines whether models can generalize across geographic locations.
Methods: We generated black-box and interpretable ML models on two different criminal recidivism datasets from Florida and Kentucky. We compared predictive performance and fairness of these models against two methods that are currently used in the justice system to predict pretrial recidivism: the Arnold PSA and COMPAS. We evaluated predictive performance of all models on predicting six different types of crime over two time spans.
Results: Several interpretable ML models can predict recidivism as well as black-box ML models and are more accurate than COMPAS or the Arnold PSA. These models are potentially useful in practice. Similar to the Arnold PSA, some of these interpretable models can be written down as a simple table. Others can be displayed using a set of visualizations. Our geographic analysis indicates that ML models should be trained separately for separate locations and updated over time. We also present a fairness analysis for the interpretable models.
Conclusions: Interpretable ML models can perform just as well as non-interpretable methods and currently-used risk assessment scales, in terms of both prediction accuracy and fairness. ML models might be more accurate when trained separately for distinct locations and kept up-to-date.
Abstract of Main Results for Criminal Justice Practitioner
Our goal is to study the predictive performance, interpretability, and fairness of machine learning (ML) models for pretrial recidivism prediction. ML methods are known for their ability to automatically generate high-performance models that sometimes even surpass human performance from data alone. However, many of the most common ML approaches produce โblack-boxโ modelsโmodels that perform well, but are too complicated for humans to understand. โInterpretableโ ML techniques seek to produce the best of both worlds: models that perform as well as black-box approaches, but also are understandable to humans. In this study, we generate multiple black-box and interpretable ML models. We compare the predictive performance and fairness of the ML models we generate, against two models that are currently used in the justice system to predict pretrial recidivismโnamely, the Risk of General Recidivism and Risk of Violent Recidivism scores from the COMPAS suite, and the New Criminal Activity and New Violent Criminal Activity scores from the Arnold Public Safety Assessment.
We first evaluate the predictive performance of all models, based on their ability to predict recidivism for six different types of crime: general, violent, drug, property, felony, and misdemeanor. Recidivism is defined as a new charge for which an individual is convicted within a specified time frame, which we specify as 6 months or 2 years. We consider each type of recidivism over the two time periods to control for time, rather than to consider predictions over an arbitrarily long or short pretrial period. Next, we examine whether a model constructed using data from one region suffers in predictive performance when applied to predict recidivism in another region. Finally, we consider the latest fairness definitions created by the ML community. Using these definitions, we examine the behavior of the interpretable models, COMPAS, and the Arnold Public Safety Assessment, on race and gender subgroups.
Our findings and contributions can be summarized as follows:
We contribute a set of interpretable ML models that can predict recidivism as well as black-box ML methods and better than COMPAS or the Arnold Public Safety Assessment for the location they were designed for. These models are potentially useful in practice. Similar to the Arnold Public Safety Assessment, some of these interpretable models can be written down as a simple table that fits on one page of paper. Others can be displayed using a set of visualizations.
We find that recidivism prediction models that are constructed using data from one location do not tend to perform as well when they are used to predict recidivism in another location, leading us to conclude that models should be constructed on data from the location where they are meant to be used, and updated periodically over time.
We reviewed the recent literature on algorithmic fairness, but most of the fairness criteria donโt pertain to risk scores, they pertain only to yes/no classification decisions. Since we are interested in criminal justice risk scores in this work, the vast majority of the algorithmic fairness criteria are not relevant. We chose to focus on the evaluation criteria that were relevant, namely calibration and balanced group AUC (BG-AUC). We present an analysis of these fairness measures for two of the interpretable models (RiskSLIM and Explainable Boosting Machine) and the Arnold Public Safety Assessment (New Criminal Activity score) on the two-year general recidivism outcome in Kentucky. We found that the fairness criteria were approximately met for both interpretable models for blacks/whites and males/femalesโthat is, the models were fair according to these criteria. The Arnold Public Safety Assessmentโs New Criminal Activity score failed to satisfy calibration for higher values of the score. The results on fairness were not as consistent for the โOtherโ race category. It is difficult to interpret the fairness result for the โOtherโ race category, due to low-resolution race data.
Background and Purpose: Rates of emergency medical services (EMS) utilization for acute stroke remain low nationwide, despite the time-sensitive nature of the disease. Prior research suggests several demographic and social factors are associated with EMS use. We sought to evaluate which demographic or socioeconomic factors are associated with EMS utilization in our region, thereby informing future education efforts.
Methods: We performed a retrospective analysis of patients for whom the stroke code system was activated at 2 hospitals in our region. Univariate and logistic regression analysis was performed to identify factors associated with use of EMS versus private vehicle.
Results: EMS use was lower in patients who were younger, had higher income, were married, more educated and in those who identified as Hispanic. Those arriving by EMS had significantly faster arrival to code, arrival to imaging, and arrival to thrombolytic treatment times.
Conclusion: Analysis of regional data can identify specific populations underutilizing EMS services for acute stroke symptoms. Factors effecting EMS utilization varies by region and this information may be useful for targeted education programs promoting EMS use for acute stroke symptoms. EMS use results in more rapid evaluation and treatment of stroke patients.
]]>Cognitive Impairment (CI) is defned as the loss of ability in cognitive functions, such as remembering, learning, and concentrating, which negatively impacts affected individualsโ daily activities. In the stage of mild cognitive impairment (MCI), affected individuals start to experience memory issues without seriously hindering their abilities to execute daily activities. In the stage of severe cognitive impairment, which is referred as dementia, individuals tend to lose basic functionalities of comprehending, memorizing, or even talking and writing. Many diseases are associated with the development of CI, such as Alzheimerโs Disease (AD), Vascular Dementia, Parkinsonโs Disease (PD), Progressive Supranuclear Palsy, and Lewy Body Disease.
There have been previous studies on the pertinent factors associated with cognitive impairment. For example, using discovery and multiple replication cohorts, Davies et al. identifed several significant genetic loci that are associated with CI, such as rs2075650 and rs115566 located in TOMM40 and rs429358 located in the APOE region. Lv et al. probed the association between the rate of cognitive decline and the mortality rate. They concluded that faster cognitive decline rate is associated with higher mortality rate, specifcally among individuals aged between 65-79 years old and cognitively normal individuals, regardless of their initial cognitive abilities. With respect to the analysis on non-genetic factors, Casanova et al. constructed a random forest model to investigate important predictors for cognitive trajectories, identifying education, age, and gender, as top predictors. Many other studies have also analyzed the genetic and non-genetic contributors to CI in community-based cohorts. Nevertheless, there are few studies that analyze the effects from both genetic and non-genetic factors on CI to our knowledge. Consequently, the primary goal of our research is to investigate which factors, both genetic and non-genetic, are signifcantly associated with cognitive impairment, in contrast to intact cognition in late life.
In our study, we split 2156 individuals from the Chinese Longitudinal Healthy Longevity Survey (CLHLS) data into two groups, establishing a phenotype of exceptional longevity & normal cognition versus cognitive impairment. We conducted a genome-wide association study (GWAS) to identify signifcant genetic variants and biological pathways that are associated with cognitive impairment and used these results to construct polygenic risk scores. We elucidated the important and robust factors, both genetic and non-genetic, in predicting the phenotype, using several machine learning models.
The GWAS identifed 28 signifcant SNPs at $p<3e10^{-5}$ signifcance level and we pinpointed four genes, ESR1, PHB, RYR3, GRIK2, that are associated with the phenotype though immunological systems, brain function, metabolic pathways, infammation and diet in the CLHLS cohort. Using both genetic and non-genetic factors, four machine learning models have close prediction results for the phenotype measured in Area Under the Curve: random forest (0.782), XGBoost (0.781), support vector machine with linear kernel (0.780), and โ2 penalized logistic regression (0.780). The top four important and congruent features in predicting the phenotype identifed by these four models are: polygenic risk score, sex, age, and education.
]]>