Xuan Wang

AAAI 2025 Tutorial

2025-02-25T09:00:00-08:00

AI for Science in the Era of Large Language Models

Department of Computer Science, Virginia Tech, USA

Time: February 26, 2025, 8:30 am - 12:30 pm EST

Location: Philadelphia, Pennsylvania @ Philadelphia Convention Center, Room 119A

Abstract:

The capabilities of AI in the realm of science span a wide spectrum, from the atomic level, where it solves partial differential equations for quantum systems, to the molecular level, predicting chemical or protein structures, and even extending to societal predictions like infectious disease outbreaks. Recent advancements in large language models (LLMs), exemplified by models like ChatGPT, have showcased significant prowess in tasks involving natural language, such as translating languages, constructing chatbots, and answering questions. When we consider scientific data, we notice a resemblance to natural language in terms of sequences – scientific literature and health records presented as text, bio-omics data arranged in sequences, or sensor data like brain signals. The question arises: Can we harness the potential of these recent LLMs to drive scientific progress? In this tutorial, we will explore the application of large language models to three crucial categories of scientific data: 1) textual data, 2) biomedical sequences, and 3) brain signals. Furthermore, we will delve into LLMs’ challenges in scientific research, including ensuring trustworthiness, achieving personalization, and adapting to multi-modal data representation.

Tutorial Recording:

A recording of our tutorial can be found here.

Slides [Combined]:

Introduction [Slides]
Part I: Scientific Text [Slides]
Part II: Brain Signals [Slides]
Part III: Biological Sequences [Slides]
Summary [Slides]

Presenters:

Xuan Wang is an Assistant Professor in the Department of Computer Science Department at Virginia Tech. Her research interests are in natural language processing, data mining, AI for sciences, and AI for healthcare. Her current research directions include natural language understanding with limited supervision, complex reasoning and planning with large language models, and scientific discoveries with multi-modal science foundation models. She was a recipient of the Cisco Research Award 2025, NSF NAIRR Pilot Award 2024-2025, and NAACL Best Demo Paper Award 2021. She received a Ph.D. degree in Computer Science, an M.S. degree in Statistics, and an M.S. degree in Biochemistry from the University of Illinois Urbana-Champaign in 2022, 2017, and 2015, respectively, and a B.S. degree in Biological Science from Tsinghua University in 2013. She has delivered tutorials in IEEE-BigData 2019, WWW 2022, KDD 2022, and EMNLP 2024.

EMNLP 2024 Tutorial

2024-11-16T09:00:00-08:00

AI for Science in the Era of Large Language Models

Zhenyu Bi, Minghao Xu, Jian Tang, Xuan Wang

Department of Computer Science, Virginia Tech, USA

Mila - Quebec AI Institute, Canada

Time: November 16, 2024, 9 am - 12:30 pm EST

Location: Miami, Florida @ Monroe Ballroom (Terrace Level)

Abstract:

Tutorial Recording:

A recording of our tutorial can be found here.

Slides [Combined]:

Introduction [Slides]
Part I: Scientific Text [Slides]
Part II: Brain Signals [Slides]
Part III: Biological Sequences [Slides]
Summary [Slides]

Presenters:

	Zhenyu Bi is a Ph.D. student in the Computer Science Department at Virginia Tech. His research area lies in the field of natural language processing, emphasizing real-world applications of Large Language Models. He is mainly interested in information extraction with weak supervision, especially text mining and event extraction; as well as fact-checking and trustworthy NLP. He received an M.S. degree in Intelligent Information Systems from Carnegie Mellon University in 2023, a B.S. degree in Cognitive Science, and a B.S. Degree in Computer Science from the University of California, San Diego in 2021.
	Minghao Xu is a Ph.D. student at Mila - Quebec AI Institute, Canada. His research interests mainly lie in protein function understanding and protein design. He aims to understand diverse protein functions with joint guidance from protein sequences, structures, and biomedical text, especially boosted by large-scale multi-modal pre-training. He is also pursuing structure- and sequence-based protein design via generative AI, geometric deep learning and dry-wet experiment closed looping. He has given an Oral presentation at the main conference of ICML’23.
	Jian Tang is an Associate Professor at Mila - Quebec AI Institute, Canada. His long-term interests focus on understanding the language of life (DNA, RNAs, and Proteins) with generative AI and geometric deep learning, with applications in biomedicine and synthetic biology. His group has developed one of the first open-source machine learning frameworks on drug discovery, TorchDrug (for small molecules) and TorchProtein (for proteins), and developed the first diffusion models for 3D molecular structure generation, GeoDiff (among the 50 most cited AI paper in 2022). He has given a few tutorials at international AI and data mining conferences including KDD 2017, AAAI 2019, AAAI 2022.
	Xuan Wang is an Assistant Professor in the Computer Science Department at Virginia Tech. Her research focuses on natural language processing and text mining, emphasizing applications to science and healthcare domains. Her current projects include NLP and text mining with extremely weak supervision; text-augmented knowledge graph reasoning; fact-checking and trustworthy NLP, AI for science; and AI for healthcare. She received a Ph.D. degree in Computer Science, an M.S. degree in Statistics, and an M.S. degree in Biochemistry from the University of Illinois Urbana-Champaign in 2022, 2017, and 2015, respectively, and a B.S. degree in Biological Science from Tsinghua University in 2013. She has delivered tutorials in IEEE-BigData 2019, WWW 2022, and KDD 2022.

CS 6804 2024 Spring

2024-01-16T09:00:00-08:00

CS 6804: Trustworthiness of Large Language Models

Note: This page will not be updated after January 15, 2024. The most up-to-date course syllabus and materials can be found on Canvas.

Time and Location: Monday and Wednesday, 5:30 PM - 6:45 PM. DDS 180.

Instructor: Xuan Wang

Office: Data and Decision Sciences 376
Phone: (540) 231-4061
Email: xuanw [at] vt [dot] edu
Office Hours: Monday 4:00 PM - 5:00 PM (by appointment)

Course Description

Large Language Models (LLMs) and applications powered by them have recently received tremendous attention, especially since ChatGPT was released in November 2022. This course will provide an introduction to state-of-the-art methods and tools aimed at bolstering the trustworthiness of LLMs, spanning both their theoretical underpinnings and practical implementations. This course begins with an introduction to LLMs and a retrieval-augmented generation question-answering application. Various LLM trustworthiness metrics will be discussed in detail, encompassing diverse topics including relevance, groundedness, confidence, calibration, uncertainty, explainability, privacy, fairness, toxicity, and adversarial attacks. Applications in critical domains, such as healthcare, education, and security, will also be covered. Finally, students will work on a course project focusing on open research questions in trustworthy LLMs. The course materials are adapted from Stanford CS 329T: Trustworthy Machine Learning - Large Language Models and Applications.

Prerequisites

Students should have experience with machine learning, data analytics, and deep learning. Strong programming skills in a high-level language such as Python, as well as frameworks for rapid deep learning prototyping, e.g., PyTorch, Tensorflow, Keras, etc., are essential for implementing and experimenting with the concepts covered in this course. While not mandatory, familiarity with natural language processing would be advantageous.

Course Format

The course is a role-playing paper reading seminar that is structured around reading, presenting, and discussing weekly papers. Each class will involve the presentation and discussion of one paper. Each student will have a unique, rotating role per week. This role defines the lens through which each student reads the paper and determines what they prepare for the group in-class discussion. All students, irrespective of their role, are expected to have read the paper before each class and come to class ready to discuss. There will be no exams or traditional assignments. Instead, students will work on a course project focusing on open research questions in trustworthy LLMs.

Presentation Roles: This seminar is organized around the different “roles” students play each week, that define the lens through which students read the paper. Students will be divided into two groups, one group presenting on Tuesdays and the other on Thursdays. In a given class session, students in the presenting groups will each be given a rotating role (described below): Presenter, Reviewer, Archaeologist, Academic Researcher, Industry Expert, and Host. Presenting groups should create a formal presentation, i.e., have slides prepared for the group in-class discussion. For each student in a presenting group, their assigned role determines what they should include in the slides.

Depending on changes in course enrollment, the roles might change, for example, remove roles or make roles optional in case enrollment decreases or allow more students for each role in the event of enrollment increase.

Presenter (1-2 students): Create the main presentation for around 30 min, describing the motivation, problem definition, method, and experimental findings of this paper. Each student should serve as the presenter at least once.
Reviewer (1-5 students): Complete a full—critical but not necessarily negative—review of the paper. Follow the guidelines for NeurIPS reviewers (under “Review Content”). Please answer questions 1-6 under “Review Content”, and assign an Overall score (question 9) and a Confidence score (question 10). Skip the rest of the review, including writing a summary. Note that you can bypass questions by filling N/A. For example, you really liked the paper and can’t think of any disadvantages. Therefore you can skip the respective question (but use this skip option sparingly). Also, please note that this role does not require going over related work, and is not an exhaustive list of all arguments you can think of. The goal is to enhance your overall critical thinking. The instructor reserves the right to contact students who overuse the N/A option.
Archaeologist (1-5 students): Determine where this paper sits in the context of previous and subsequent work. Find and report on one older paper that has substantially influenced the current paper and one newer paper citing this current paper. Create 3-5 slides for each paper, summarizing the key points such as the research problem, methodology innovation, and main results.
Researcher (1-3 students): Propose an imaginary follow-up project that has now become possible due to the existence and success of the current paper.
Industry Expert (1-3 students): Propose a new application or company product for the method in the paper (not already discussed in class), and discuss at least one positive and negative impact of this application. Convince your industry boss that it’s worth investing time and money to implement this paper. Your arguments should be particularly applicable to the chosen industry market.
Host (1-2 students): Host the whole discussion in class, including introducing presenters and answering all the questions proposed by the non-presentating group on Piazza.

Non-presenter Assignment:

If you are not in the presenting group during a class session, please submit the day before class (due 11:59pm EST) at least one question on Piazza non-anonymously. The questions can be about either paper (e.g., something you’re confused about) or something you’d like to hear discussed more. Questions that open debates and make in-class discussions explore different viewpoints are a plus.
After class and before the end of the day 11:59pm EST, provide constructive feedback to the presenting group on Piazza non-anonymously. You may focus on one or more reading roles, or the presentation holistically. Evaluate the clarity of the presentation, the strength of the arguments, and the quality of visuals, if any. Highlighting strengths and areas for improvement. This feedback will be shared post each presentation.

Everyone, every week (Optional): After each class session, you may post your thoughts on Piazza. For example, which parts did you enjoy reading, what results and insights you found interesting, a missing result the paper could have included, any useful additional links and resources, etc. Whenever you agree with the comments of a student’s post, make sure to endorse their answer. You can also post a reply with your additional thoughts.

Final Project: The main project goal is to engage students in research on trustworthy LLMs. In particular, students should try to extend papers from topics covered in class and present the research outcomes as a research paper, in a standard conference paper format (NeurIPS or ACL). Students are encouraged to work in groups of no more than four members, taking into consideration that the work produced should be proportional to the number of members in a team. Groups are required to include a “contributions” section in the final project report, listing each member’s contributions in detail. Projects should include a written report accompanied by the code (in a zip file or on GitHub). In addition, groups will present their final projects during the last two class sessions. A PowerPoint or LaTex final presentation is required.

Technology: Piazza will be used for announcements, general questions, and discussions.

Grading Policy

Readings: 60 points: Each student will be in the presenting role for 10 classes and the non-presenting role for the other 10 classes. You can earn up to 4 points each time you present (all presenting roles are considered equal). You will receive full credit if you do a thorough job of undertaking your role and present it clearly and compellingly. When you aren’t presenting, you can earn up to 2 points by completing the non-presenting assignment and by participating in the class.

Final Project: 40 points divided into the following categories:

Proposal: 5 points.
In-class presentation: 15 points; your final presentation should be clear to the audience and provide a solid review of your work as if you were presenting at a conference.
Final report: 20 points:
- Novelty: 5 points; your paper should propose something new (a new method, application, or perspective).
- Clarity: 10 points; your paper should be readable, contain well-defined and clear motivation and contribution statements and appropriately make connections with related work. In general, your project report should follow standard machine learning conference paper formatting and style.
- Code: 5 points; the code accompanying your project should be well-documented and your experimental results should be reproducible. Your repository should include a README file with full instructions on how to run the code. Moreover, your code should be easy to run with one simple command; if there are multiple steps involved, please make a bash script.

Late and Missed Work

All assignments are due on the date assigned at the listed time. No late assignments will be accepted.

Course Schedule

This is a tentative lecture schedule. Please check the schedule on Canvas for recent updates. The papers can be found in this folder.

Date	Topic and Papers
01/17 (Wed)	Introduction
01/22 (Mon)	Introduction
	Groundedness
01/24 (Wed)	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
01/29 (Mon)	TRUE: Re-evaluating Factual Consistency Evaluation
01/31 (Wed)	Measuring Reliability of Large Language Models through Semantic Consistency
02/05 (Mon)	RARR: Researching and Revising What Language Models Say, Using Language Models
02/07 (Wed)	Do Language Models Know When They’re Hallucinating References?
02/12 (Mon)	SELFCHECKGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
02/14 (Wed)	The Internal State of an LLM Knows When It’s Lying
	Confidence
02/19 (Mon)	Teaching models to express their uncertainty in words
02/21 (Wed)	Reducing conversational agents’ overconfidence through linguistic calibration
02/26 (Mon)	SEMANTIC UNCERTAINTY: LINGUISTIC INVARIANCES FOR UNCERTAINTY ESTIMATION IN NATURAL LANGUAGE GENERATION
02/28 (Wed)	GENERATING WITH CONFIDENCE: UNCERTAINTY QUANTIFICATION FOR BLACK-BOX LARGE LANGUAGE MODELS
03/04 (Mon)	Spring Break (No Class)
03/06 (Wed)	Spring Break (No Class)
03/11 (Mon)	Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
	Explainability
03/13 (Wed)	Axiomatic Attribution for Deep Networks
03/18 (Mon)	Representer Point Selection for Explaining Deep Neural Networks
03/20 (Wed)	The Explanation Game: Explaining Machine Learning Models Using Shapley Values
03/25 (Mon)	Estimating Training Data Influence by Tracing Gradient Descent
03/27 (Wed)	Understanding Black-box Predictions via Influence Functions
04/01 (Mon)	Influence Patterns for Explaining Information Flow in BERT
04/03 (Wed)	Studying Large Language Model Generalization with Influence Functions
04/08 (Mon)	Universal and Transferable Adversarial Attacks on Aligned Language Models
	Invited Talks
04/10 (Wed)	Yu Meng (University of Virginia) Title: Synthetic Supervision Generation with Large Language Models Abstract: Large language models (LLMs) have demonstrated remarkable text generation capability, and an emerging machine learning paradigm for NLP tasks is to replace manually-labeled training data use with synthetic data generated by LLMs. In this talk, I will present my past research on using LLMs as training data generators for zero-/few-shot learning in NLP. In the first part, I will introduce a simple method to prompt LLMs to generate training data for different types of NLP tasks. To overcome noise in the synthetic training data, two regularization methods will be employed. The resulting model trained on synthetic data demonstrates strong performance, even outperforming models trained with human annotations. In the second part, I will discuss how to extend the zero-shot training data generation method to few-shot scenarios by effectively leveraging a small amount of manually-annotated data. Collectively, the methods introduced in this talk represented early attempts to use LLMs for high-quality synthetic supervision generation. Bio: Yu Meng is an assistant professor of the Computer Science department at the University of Virginia. His research focuses on text representation learning and weakly-supervised learning, with 40+ papers published in machine learning and NLP conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, NAACL, etc.) and 8 conference tutorials delivered in AI conferences (KDD, WWW, AAAI, etc.). He received the Google PhD Fellowship (2021). He has served as the area chair/action editor of ML/LLM venues including ICML, NeurIPS, TMLR and COLM.
04/15 (Mon)	Jessica Newman (UC Berkeley) Title: Bridging the Gap Between AI Trustworthiness and Risk Management Abstract: In this lecture, Jessica Newman will discuss the history and landscape of AI trustworthiness and risk management within policy debates in the United States and Europe. In particular, she will discuss the US National Institutes of Standards and Technology (NIST) AI Risk Management Framework (AI RMF) and NIST’s characteristics of trustworthiness for artificial intelligence. She will then introduce the Taxonomy she developed with her team at UC Berkeley, reviewing some of the 150 properties of trustworthiness they document, as well as how they connected the properties of trustworthiness to particular times during the AI lifecycle and to particular sub-categories of the AI RMF. Bio: Jessica Newman is the Founder and Director of the AI Security Initiative, housed at the UC Berkeley Center for Long-Term Cybersecurity. She is also the Founding Co-Director of the UC Berkeley AI Policy Hub and Co-Director of the university research group, the Algorithmic Fairness and Opacity Group (AFOG). Through these programs, she leads partnerships with the public and private sector, supports graduate student researchers, facilitates university events, and publishes and speaks widely on the governance, policy, and global security implications of artificial intelligence. Jessica has published more than 100 papers and articles on the societal implications of AI and emerging technologies in outlets including Brookings TechStream, Tech Policy Press, and The Georgetown Journal of International Affairs. She serves on several international AI governance and standards bodies including the OECD Expert Group on AI Risk & Accountability and the IEEE Working Group on Recommended Practice for Organizational Governance of Artificial Intelligence.
04/17 (Wed)	Haohan Wang (University of Illinois Urbana-Champaign) Title: Guardian of Trust in Language Models: Automatic Jailbreak and Systematic Defense Abstract: Large Language Models (LLMs) excel in Natural Language Processing (NLP) with human-like text generation, but the misuse of them has raised a significant concern. In this talk, we introduce an innovative system designed to address these challenges. Our system leverages LLMs to play different roles, simulating various user personas to generate “jailbreaks” – prompts that can induce LLMs to produce outputs contrary to ethical standards or specific guidelines. Utilizing a knowledge graph, our method efficiently creates new jailbreaks, testing the LLMs’ adherence to governmental and ethical guidelines. Empirical validation on diverse models, including Vicuna-13B, LongChat-7B, Llama-2-7B, and ChatGPT, has demonstrated its efficacy. The system’s application extends to Visual Language Models, highlighting its versatility in multimodal contexts. The second part of our talk shifts focus to defensive strategies against such jailbreaks. Recent studies have uncovered various attacks that can manipulate LLMs, including manual and gradient-based jailbreaks. Our work delves into the development of robust prompt optimization as a novel defense mechanism, inspired from principled solutions from trustworthy machine learning. This approach involves system prompts – parts of the input text inaccessible to users – and aims to counter both manual and gradient-based attacks effectively. Despite current methods, adaptive attacks like GCG remain a challenge, necessitating a formalized defensive objective. Our research proposes such an objective and demonstrates how robust prompt optimization can enhance the safety of LLMs, safeguarding against realistic threat models and adaptive attacks. Bio: Haohan Wang is an assistant professor in the School of Information Sciences at the University of Illinois Urbana-Champaign. His research focuses on the development of trustworthy machine learning methods for computational biology and healthcare applications, such as decoding the genomic language of Alzheimer’s disease. In his work, he uses statistical analysis and deep learning methods, with an emphasis on data analysis using methods least influenced by spurious signals. Wang earned his PhD in computer science through the Language Technologies Institute of Carnegie Mellon University. In 2019, Wang was recognized as the Next Generation in Biomedicine by the Broad Institute of MIT and Harvard because of his contributions in dealing with confounding factors with deep learning.
04/22 (Mon)	Ziang Xiao (Johns Hopkins University) Title: AI-mediated Human-Information Interaction. Abstract: Bio: Ziang Xiao is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. His main research areas are human-computer interaction and natural language processing. Ziang’s research is motivated by the fundamental question of understanding humans at scale. His goal is to study human-AI interaction to expand our knowledge about our behaviors and decision-making processes. His recent projects centered around three topics, AI for Social Science, Human-centered model evaluation, and human-information interaction.
04/24 (Wed)	Boxin Wang (Nvidia) Title: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance – where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/. Bio: Boxin Wang is a research scientist at NVIDIA. He obtained his Ph.D. from the Computer Science Department of the University of Illinois at Urbana-Champaign (UIUC). He was awarded with NeurIPS Outstanding Paper Award, NeurIPS Scholar Award, Yunni & Maxine Pao Memorial Fellowship, and has been selected as Two Sigma Fellowship Finalist and The Norton Labs Graduate Fellowship Finalist. He had multiple research internships at Google, Microsoft, and NVIDIA. His research interests are scalable and trustworthy large language models (LLMs), including scaling LLMs with retrieval augmentation, exploring the vulnerabilities of existing state-of-the-art LLMs, as well as designing robust, private, and generalizable models for social goods. Additional information is available at https://boxin.wang.
04/29 (Mon)	Project Presentations
05/01 (Wed)	Project Presentations

Accommodation Statement

Virginia Tech welcomes students with disabilities into the University’s educational programs. The University promotes efforts to provide equal access and a culture of inclusion without altering the essential elements of coursework. If you anticipate or experience academic barriers that may be due to disability, including but not limited to ADHD, chronic or temporary medical conditions, deaf or hard of hearing, learning disability, mental health, or vision impairment, please contact the Services for Students with Disabilities (SSD) office (540-231-3788, [email protected], or visit ssd.vt.edu to an external site). If you have an SSD accommodation letter, please meet with the instructors privately during office hours as early in the semester as possible to deliver your letter and discuss your accommodations. You must give the instructors a reasonable notice to implement your accommodations, which is generally 5 business days and 10 business days for final exams.

Academic Integrity Statement

The tenets of the Virginia Tech Graduate Honor Code will be strictly enforced in this course, and all assignments shall be subject to the stipulations of the Graduate Honor Code. Any suspected violations of the Honor Code will be promptly reported to the honor system. Honesty in your academic work will develop into professional integrity. The faculty and students of Virginia Tech will not tolerate any form of academic dishonesty. For more information on the Graduate Honor Code, please refer to the GHS Constitution.

Absence Policy

Regular class attendance is expected of all students. However, attendance will not be taken and will not be used in determining your final course grade in this class.

VT Principles of Community Statement

KDD 2022 Tutorial

2022-08-14T09:00:00-07:00

New Frontiers of Scientific Text Mining: Tasks, Data, and Tools

Xuan Wang, Hongwei Wang, Heng Ji, Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign

Time: Aug 14, 2022, 9am - 12pm ET

Location: Washington DC, USA

Abstract:

Exploring the vast amount of rapidly growing scientific text data is highly beneficial for real-world scientific discovery. However, scientific text mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge. This tutorial presents a comprehensive overview of recent research and development on scientific text mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of scientific text mining. Then we discuss a set of methods that perform effective scientific information extraction, such as named entity recognition, relation extraction, and event extraction. We also introduce real-world applications such as textual evidence retrieval, scientific topic contrasting for drug discovery, and molecule representation learning for reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further scientific discovery. We also discuss the emerging research problems and future directions for scientific text mining.

Tutorial Recording:

A recording of our tutorial will be available after the conference.

Slides [Combined]:

Introduction [Slides]
Part I: Scientific Information Extraction and Analysis [Slides]
Part II: Scientific Information Search and Evidence Mining [Slides]
Part III: Topic Discovery, Text Classification, and Multi-Dimensional Text Analysis [Slides]
Summary and Future Directions [Slides]

Presenters:

	Xuan Wang is a Ph.D. candidate at Computer Science Department, University of Illinois at Urbana-Champaign. Her research focuses on mining and constructing structured knowledge from massive unstructured corpora with minimum human supervision, emphasizing applications to biological and health sciences. She received M.S degree in Statistics and M.S. degree in Biochemistry from University of Illinois at Urbana-Champaign in 2017 and 2015, respectively, and B.S. degree in Biological Science from Tsinghua University in 2013. She is the recipient of YEE Fellowship Award in 2020-2021.
	Hongwei Wang is a postdoctoral researcher at Computer Science Department, University of Illinois Urbana-Champaign. His research interests include machine learning and data mining, particularly in graph representation learning mechanisms, algorithms, and their applications in real-world data mining scenarios such as knowledge graphs, recommender systems, social networks, and sentiment analysis. He received Ph.D. degree from Department of Computer Science, Shanghai Jiao Tong University in 2018, and B.E. degree from ACM Class, Shanghai Jiao Tong University in 2014. He was a postdoctoral researcher at Computer Science Department, Stanford University, from 2019 to 2021. He was one of the recipients of 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation Award and 2018 Google Ph.D. Fellowship.
	Heng Ji is a Professor at Computer Science Department of University of Illinois Urbana-Champaign, and an Amazon Scholar. She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge Base Population and Knowledge-driven Generation. She was selected as “Young Scientist” and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. The awards she received include “AI’s 10 to Watch” Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018, Amazon AWS Award in 2021, ACL2020 Best Demo Paper Award, and NAACL2021 Best Demo Paper Award. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She has served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018. She is elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2021. Additional information is available at my website.
	Jiawei Han is Michael Aiken Chair Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. His research areas encompass data mining, text mining, data warehousing, and information network analysis, with over 1000 research publications. He is Fellow of ACM, Fellow of IEEE, and received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He delivered 50+ conference tutorials or keynote speeches (e.g., SIGKDD 2017-2021 tutorials and WSDM 2018 keynote).

The Web Conference 2022 Tutorial

2022-04-26T15:45:00-07:00

Modern Natural Language Processing Techniques for Scientific Web Mining: Tasks, Data, and Tools

Xuan Wang, Hongwei Wang, Heng Ji, Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign

Time: April 26, 2022, 15:45 PM - 17:15 PM (CET)

Location: Online, hosted by Lyon, France

Abstract:

This tutorial targets researchers and practitioners who are interested in natural language processing (NLP) technologies for scientific web mining. Exploring the vast amount of rapidly growing scientific literature available on the web is highly beneficial for scientific discovery. However, scientific web mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge.

This tutorial presents a comprehensive overview of recent research and development on using NLP techniques for scientific web mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of web mining in the scientific domains. Then we discuss a set of methods that perform effective information extraction (named entity recognition, relation extraction, and event extraction), information retrieval (textual evidence retrieval, cross-modal molecule retrieval, and chemical reaction tracking) from scientific literature, and their applications on reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further exploratory analysis. We also discuss the emerging research problems and future directions of using NLP techniques for scientific web mining.

Tutorial Recording:

A recording of our tutorial is available here.

Slides (Combined):

Introduction [Slides]
Scientific Information Extraction and Analysis [Slides]
Scientific Information Search and Evidence Mining [Slides]
Summary and Future Directions [Slides]

Presenters:

	Xuan Wang is a Ph.D. candidate at Computer Science Department, University of Illinois at Urbana-Champaign. Her research focuses on mining and constructing structured knowledge from massive unstructured corpora with minimum human supervision, emphasizing applications to biological and health sciences. She received M.S degree in Statistics and M.S. degree in Biochemistry from University of Illinois at Urbana-Champaign in 2017 and 2015, respectively, and B.S. degree in Biological Science from Tsinghua University in 2013. She is the recipient of YEE Fellowship Award in 2020-2021.
	Hongwei Wang is a postdoctoral researcher at Computer Science Department, University of Illinois Urbana-Champaign. His research interests include machine learning and data mining, particularly in graph representation learning mechanisms, algorithms, and their applications in real-world data mining scenarios such as knowledge graphs, recommender systems, social networks, and sentiment analysis. He received Ph.D. degree from Department of Computer Science, Shanghai Jiao Tong University in 2018, and B.E. degree from ACM Class, Shanghai Jiao Tong University in 2014. He was a postdoctoral researcher at Computer Science Department, Stanford University, from 2019 to 2021. He was one of the recipients of 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation Award and 2018 Google Ph.D. Fellowship.
	Heng Ji is a Professor at Computer Science Department of University of Illinois Urbana-Champaign, and an Amazon Scholar. She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge Base Population and Knowledge-driven Generation. She was selected as “Young Scientist” and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. The awards she received include “AI’s 10 to Watch” Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018, Amazon AWS Award in 2021, ACL2020 Best Demo Paper Award, and NAACL2021 Best Demo Paper Award. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She has served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018. She is elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2021. Additional information is available at my website.
	Jiawei Han is Michael Aiken Chair Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. His research areas encompass data mining, text mining, data warehousing, and information network analysis, with over 1000 research publications. He is Fellow of ACM, Fellow of IEEE, and received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He delivered 50+ conference tutorials or keynote speeches (e.g., SIGKDD 2017-2021 tutorials and WSDM 2018 keynote).

IEEE BigData 2019 Tutorial

2019-12-11T14:15:00-08:00

Taming Unstructured Big Data: Automated Information Extraction from Massive Text

Xuan Wang, Yu Zhang, Qi Li, Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA

Time: 10:05am-12:10pm, Wednesday, December 11, 2019

Location: Los Angeles, CA, USA

Slides:

Our tutorial slides can be downloaded here.

Abstract:

Text data is a powerful information source that covers almost every aspect of our life. Automated information extraction has attracted considerable attention with various approaches developed to mine structured knowledge from unstructured text. In this tutorial, we present an organized picture of automated information extraction from massive text to answer the need of a systematic review and comparison of the techniques. We first introduce major tasks of information extraction such as named entity recognition and relation extraction. Then we introduce downstream applications such as heterogeneous information network construction and claim mining that utilize the extracted information. Specifically, we focus on the methods that are scalable, effective, minimum supervised and working on various kinds of text (e.g., news and biomedical science). We also demonstrate on a real-world dataset, PubMed that includes over 29 million biomedical literature, how the heterogeneous information network can be constructed and how the scientific claims can be automatically retrieved based on automated information extraction. The covered topics will be interesting to both advanced researchers and beginners in data mining, text mining, natural language processing and machine learning.

Outline:

Introduction
- Motivations
- Overview of automatic information extraction from massive text
- Application of utilizing the extracted information
Named Entity Recognition
- What is Named Entity Recognition (NER)?
- Traditional Supervised Methods
  - CorNLL03 Shared Task
  - Sequence Labeling Framework
  - Conditional Random Fields (CRFs)
  - Handcrafted Features
- Modern Neural Models
  - Bidirectional Long Short-term Memory (BiLSTM)-based Models
  - Language Model and Contextualized Representations
  - End-to-end Neural Models
- Distantly Supervised Methods
  - Entity Typing
  - Learning from Domain-Specific Dictionaries
Relation Extraction
- What is Relation Extraction (RE)? – Supervised Methods
  - Multi-relational Embedding
  - Position-aware Neural Models
  - Dependency-path-based Neural Models
- Pattern-based Methods
  - Sequential Textual Patterns
  - Patterns with Linguistic Features
  - Synonym Pattern Grouping
- Open-domain Approaches
  - How to exploit local structure?
  - How to exploit global consistency?
  - High-arity OpenIE
- Weakly-supervised Methods
  - Bootstrapping Methods
  - Pattern-enhanced Embedding Learning
Heterogeneous Information Network Construction
- What is a Heterogenous Information Network (HIN)?
- Network-based Document Summarization
  - What is a concept map?
  - Constructing Concept Maps from Text
- Factual Network Construction and Exploration
  - Network Construction
  - Factual Knowledge Exploration
Claim Mining
- What is a claim?
- Comparison between Opinion Mining, Argument Mining and Claim Mining
- Context-independent claim mining
  - What is context-independent claim mining?
  - Supervised Machine Learning Methods
- Context-dependent claim mining
  - What is a topic and a context-dependent claim?
  - Supervised Machine Learning Methods
  - Weakly/Distantly-supervised Methods
  - Unsupervised Claim Mining
System Demos
- Named Entity Recognition & APIs
- Open Relation Extraction & APIs
- Scientific Claim Mining & APIs
- BioText Mining Demo
Summary and Future Directions
- Summary
  - Principles and Techniques
  - Advantages and Limitations
  - How to choose a method based on your application?
- Future Directions