RStar is a fairly simple method but have achieved great performance on small model. The basic idea is to introduce varies types of queries (actions) during the process of MCTS, to decompose the question, response to sub-questions or rephrase the sub-questions. This mimics cognitive reasoning and adds exploration to tree searching.
Its codebase is interesting to read. The key elements lie in a single python file: run_src/MCTS_for_reasoning.py defining three classes:
Generator
Reasoning_MCTS_Node
search_for_answers
In short, what the codebase do is to define a complex language node data structure Reasoning_MCTS_Node and set up a MCTS searcher to navigate the pathway by creating children nodes inside Reasoning_MCTS_Node which inherit all necessary information.
I found it interesting as I was participating in another open-source project: OpenR (big shoutout to the team!), where we also support a vanilla version of MCTS reasoning. The main difference in terms of implementation is that: RStar pass LLM call between nested nodes whereas OpenR follows conventiaonl RL framework and use LLM call in a centralized env entity. I feel like the latter way is more user-friendly and would like to transfer RStar to our developing codebase.
The language node class Reasoning_MCTS_Node contains basic attributes such as parent, depth, node type, as well as key information for building children, such as generator function and the question status. It will first inherit almost everything from its parent and then define detailed rules for action generation, simple as that.
In its _create_children function you will see the essence of the project. Those are five ways of action generation written as def do_action_generate_xxx, each of them will query the generator to create specific prompts and generate response to form children nodes. All children nodes will be added to the current language node and wait for selection and value update.
Ok, now I have almost finished the re-implementation of rStar in OpenR. At least I make it executable :). The motivation behind is that I hope to scale it up by borrowing the usage of Ray package in OpenR. I have also noticed a tiny bug in the original rStar repo (might be wrong). The is_valid_solution_node seems to consider the DIRECT ANSWER, SUBQUESTION and OST node type but during answer extracting it throws OST away. My current task is to run experiments and demonstrate that this week’s work is not wasted ! Fingers crossed!
Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu2∗
In graph reasoning tasks, traditional methods often use a single LLM whereas the paper propose a framework based on multi-agent collaboration to solve graph reasoning problems. On each node of the graph, a LLM receives and passes messages until maximum iterations.
KeyWord: Multi-agent, Graph Reasoning
Qiyuan Zhang, Yufei Wang, Tiezheng Yu, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma1
Good related work on LLM-as-Judge. But the methodology seems trivial. For a generated response, the paper use an LLM to generate a reference, and use another LLM to evaluate the input-output-reference combination.
KeyWord: LLM-as-Judge
Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua
A Speculative Decoding variant
KeyWord: LLM Inference
Haonan Li, Xudong Han, Hao Wang, Yuxia Wang, Minghan Wang, Rui Xing, Yilin Geng, Zenan Zhai, Preslav Nakov, Timothy Baldwin (LibrAI, MBZUAI, Monash University, The University of Melbourne)
The paper use direct prompting to decompose the task of fact checking into five processes, and the sub-moudles are: Decomposer, Checkworthiness Identifier, Query Generator, Evidence Retriever and Claim Verifier. I like the idea of breaking the thinking process into pre-defined stages and direct prompting is a straightforward implementation. The paper also talk about practical parallel implementation.
KeyWord: Direct Prompting
R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, Thomas L. Griffiths (Yale, OpenAI, Princeton, )
A technical report finding that o1 scores substantially better on examples with high-probability outputs than ones with low-probability outputs. And o1 shows substantially less sensitivity to task frequency than the other LLMs. The problem of auto-regression still exists.
KeyWord: Evaluation, O1
Sam Earle, Samyak Parajuli, Andrzej Banburski-Fahey LLM assistant for Game design through direct prompting. The key is to decompose the task and assign the subtasks to each agents.
KeyWord: Direct Prompting, Human-AI Interaction
Alexey Kutalev, Sergei Markoff
A survey on RLHF (Well it save my time :))
KeyWord: Survey, RLHF
Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez
LLM + Moe, train the model to generate retrieval/no_retrieval reflection tokens and measure the confidence of outputs conditioned on enforced no_retrieval during inference, to decide whether to do retrival.
KeyWord: RAG
Spencer Frei, Gal Vardi (UC Davis)
Investigate the generalization abilities of a linear transformer on linear classification tasks. It generalize nicely when data has label-flipping noise, or in ICL, the model remember the noise but still generalize.
KeyWord: fundation model, generalization
Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu (University of Central Florida)
This one is interesting. The question it tries to answer is: how LLMs can guide investment decisions by analyzing earnings call transcripts? The goal is to dig for key factors that are important for decision-making. The paper interprete in a Bayesian inference way. Given an article, the LLM is firstly used (as a prior proposal) to summerize factors, then they use train a Bradle-Terry Model to score the likelihood (conditiaonl likelihood function). A direct posterior sampling analogy.
KeyWord: factor analysis, bayesian sense
Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass (Harvard, MIT, Chicago, Meta)
The paper propose an evaluation pipeline to test generalization. The key insight here is to query the LLM for its in-distribution data, and sample OOD data from its complement, then test LLM on both of the synthetic data. A dynamic generalization evaluation piptline, also intersting. Wonder what an open-ended version will look like?
KeyWord: evaluation, generalization
Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen (Unversity of Pennsylvania)
LLM for bioinformatics! Always happy to see these cross-area paper. The data quality is always the bottleneck due to high dimension, missing data or small size. The paper utilize prior knowledge embedded in LLM and apply self-consistent inference to select key features from tabular data.
KeyWord: bioinformatics, feature selection, application of LLM
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu (TsingHua)
The paper propose a novel way to do selective KV cache eviction for long-text inference. They train a retaining head to estimate the causal importance of each cache unit. Such a training paradigm is able to provide accurate token importance scoring prediction and can be integrated with other efficient inference algorithms.
KeyWord: hardward-level optimization
Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag (Google)
In machine translation, verbosity is bad and prevalent. The paper conduct exploration of what causing verbosity and how it affect evalution.
KeyWord: NLP
Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh (Microsoft, University of Massachusetts, Amherst, University of Maryland, College Park,)
RAG-related LLM has shortcut to answer question using context information instead of model prior. The paper proposes three mechanistical way to identify how LLM utilize information: causal inference method, attention weight checking, and attention knockout (remove one attention edge and see how the performance degrade).
KeyWord: RAG
Hongyin Luo & Wei Sun (BitEnergy AI, Inc)
Approximate tensor multiplication with integer addition. The paper do detailed computational complexity analysis. Achieve similar performance to full-scale model but use less energy. Worth reading.
KeyWord: computation complexity, hardward-level optimization
Kowe Kadoma, Danaë Metaxa, Mor Naaman
How much harm caused to users when others perceive or suspect them of using AI. The paper have done social experiments to verify perceptual harms.
KeyWord: AI for social good
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh (University of Wisconsin-Madison, Adobe Research)
Solve the data inbalance problem in image-text alignment problems, by generating negative samples from the positive.
KeyWord: data augmentation
Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, Werner Geyer (IBM)
The paper investigate how human behave as an evaluator for LLLM’s outputs, trying to understand what practitioners prioritize in evaluation criteria when using LLMs as judges and how these priorities differ. The paper also provide a tool to help practitioners refine evaluation criteria using both direct and pairwise assessment strategies.
KeyWord: Trustworthy AI, human-AI interaction
]]>