🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity |
Published at 2025-02-17 |
|
#ML
|
The paper introduces a new method to improve language models' performance in proof-oriented programming, which addresses data scarcity by creating synthetic data for training. This method helps the models to both generate and repair proofs for function- and repository-level code, outperforming GPT-4o by 64% relative margin and improving GPT-4o's performance by 54%.... |
Read More |
|
|
|
![]() |
CRANE: Reasoning with constrained LLM generation |
Published at 2025-02-17 |
|
#ML
|
This research explains why strict constrained LLM generation can limit reasoning abilities and proposes CRANE, a new algorithm that balances constrained generation correctness with unconstrained flexibility, leading to better performance on symbolic reasoning benchmarks.... |
Read More |
|
|
|
|
![]() |
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest |
Published at 2025-02-17 |
|
#ML
|
A new information extraction (IE) model called Cuckoo has been developed by leveraging the resources of large language models (LLMs). This model is trained using a new paradigm called next tokens extraction (NTE), which converts LLM's pre-training and post-training data into extractive data and adapts to traditional and complex instruction-following IE tasks under few-shot setting, outperforming existing pre-trained IE models.... |
Read More |
|
|
|
![]() |
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening |
Published at 2025-02-17 |
|
#ML
|
Diffusion-Sharpening is a fine-tuning method for diffusion models that optimizes sampling trajectories using a path integral framework and reward feedback, improving training efficiency and inference costs, outperforming other methods in diverse metrics.... |
Read More |
|
|
|
|
![]() |
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation |
Published at 2025-02-17 |
|
#ML
|
HermesFlow is proposed as a general framework to bridge the gap between understanding and generation in Multimodal Large Language Models (MLLMs). It aligns multimodal understanding and generation using homologous preference data, and experimental results show its superiority in narrowing the gap compared to prior methods.... |
Read More |
|
|
|
![]() |
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs |
Published at 2025-02-17 |
|
#ML
|
The authors propose a novel approach to enhance Large Language Models' (LLMs) mathematical reasoning and proof capabilities through counterexamples. They introduce a high-quality, university-level mathematical benchmark, CounterMATH, and a data engineering framework to improve model performance by providing counterexamples, thereby promoting a deeper understanding of mathematical concepts.... |
Read More |
|
|
|
|
![]() |
SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL |
Published at 2025-02-17 |
|
#ML
|
The proposed SAFE-SQL framework addresses the limitation of existing Text-to-SQL methods by generating and filtering its own examples, enhancing SQL generation and executing accuracy, especially in unseen scenarios.... |
Read More |
|
|
|
![]() |
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems |
Published at 2025-02-17 |
|
#ML
|
The proposed Talk Structurally, Act Hierarchically (TalkHier) framework improves upon existing multi-agent systems by introducing a structured communication protocol and hierarchical refinement for better collaboration and output accuracy in complex tasks. TalkHier outperforms various state-of-the-art models in diverse tasks such as question answering and text generation, demonstrating its potential to become a new standard for LLM-based multi-agent systems.... |
Read More |
|
|
|
|
![]() |
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation |
Published at 2025-02-18 |
|
#ML
|
This survey provides a comprehensive analysis of Multimodal Retrieval-Augmented Generation systems, addressing their benefits, challenges, and recent developments, aiming to support advancements in AI systems that effectively utilize multimodal dynamic external knowledge bases.... |
Read More |
|
|
|
![]() |
Better Embeddings with Coupled Adam |
Published at 2025-02-18 |
|
#ML
|
This research discusses how anisotropic word representations in large language models (LLMs) can be improved by addressing the second moment in the Adam optimizer. The authors propose a new optimizer, Coupled Adam, which significantly enhances the quality of embeddings and overall performance in upstream and downstream tasks.... |
Read More |
|
|
|
|
![]() |
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model |
Published at 2025-02-18 |
|
#ML
|
The abstract discusses a unified approach, CALM, to address the limitations of current conversational agents, which either struggle with multi-turn conversations or API usage. CALM is trained on a multi-task dataset, CALM-IT, to outperform top domain-specific models across three popular benchmarks.... |
Read More |
|
|
|
![]() |
Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning |
Published at 2025-02-18 |
|
#ML
|
This research introduces a method using small neural networks, named InfluenceNetwork, to estimate influence values in models, reducing costs by a maximum of 99%. The authors apply this method, called NN-CIFT, to the task of subset selection for instruction fine-tuning, demonstrating that it performs as well as existing methods while being faster.... |
Read More |
|
|
|
|
![]() |
Dyve: Thinking Fast and Slow for Dynamic Process Verification |
Published at 2025-02-18 |
|
#ML
|
The presented Dyve is a dynamic process verifier that improves error detection in large language models by combining fast and slow thinking, inspired by Kahneman's Systems Theory. It utilizes a step-wise consensus-filtered process supervision technique to create high-quality supervision signals from noisy data and outperforms existing process-based verifiers.... |
Read More |
|
|
|
![]() |
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling |
Published at 2025-02-18 |
|
#ML
|
The paper presents EQ-VAE, a regularization technique that encourages equivariance in the latent space of generative models, improving their performance. EQ-VAE simplifies the latent space, enhancing the quality of image synthesis by pre-trained autoencoders, and is compatible with both continuous and discrete autoencoders.... |
Read More |
|
|
|
|
![]() |
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents |
Published at 2025-02-18 |
|
#ML
|
The study develops a scalable method to create a large web trajectory dataset containing over 94K web agent trajectories, 720K screenshots, and 33M web elements. Utilizing this dataset, they train a multimodal web agent named Explorer, which exhibits strong performance on offline and online benchmarks, emphasizing the importance of data scaling in enhancing web agent capabilities.... |
Read More |
|
|
|
![]() |
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training |
Published at 2025-02-18 |
|
#ML
|
The paper explores how Large Language Models (LLMs) acquire new knowledge and the process of embedding it in their neural computations through knowledge circuits. Key findings include the influence of relevance to pre-existing knowledge, a phase shift from formation to optimization, and a deep-to-shallow evolution pattern. Understanding these mechanisms can help improve continual pre-training strategies for better model performance.... |
Read More |
|
|
|
|
![]() |
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models |
Published at 2025-02-18 |
|
#ML
|
The proposed ThinkDiff approach enhances text-to-image diffusion models with multimodal in-context understanding and reasoning abilities by utilizing vision-language models (VLMs). ThinkDiff overcomes existing challenges in multimodal diffusion finetuning by leveraging VLMs as a proxy task with an LLM decoder, significantly improving accuracy on the CoBSAT benchmark without complex training or datasets.... |
Read More |
|
|
|
![]() |
IHEval: Evaluating Language Models on Following the Instruction Hierarchy |
Published at 2025-02-18 |
|
#ML
|
IHEval is a new benchmark for evaluating language models' ability to follow instruction hierarchy, covering cases of either alignment or conflict in instructions. The benchmark reveals that popular language models have difficulty recognizing instruction priorities and struggle with conflicting instructions, highlighting the need for further optimization.... |
Read More |
|
|
|
|
![]() |
ILIAS: Instance-Level Image retrieval At Scale |
Published at 2025-02-18 |
|
#ML
|
ILIAS is a new, large-scale dataset for object recognition in images. It is designed to evaluate the performance of models in challenging conditions and diverse domains, and benchmarking with ILIAS shows areas where models can improve, such as generalizing across domains and handling background clutter.... |
Read More |
|
|
|
![]() |
Intuitive physics understanding emerges from self-supervised pretraining on natural videos |
Published at 2025-02-18 |
|
#ML
|
This study shows that deep neural network models trained for predicting masked regions in natural videos can develop intuitive physics understanding, as evidenced by their performance in violation-of-expectation tasks. This is in contrast to models trained in pixel space or multimodal large language models, suggesting that learning an abstract representation space while predicting missing sensory input is key to acquiring intuitive physics knowledge.... |
Read More |
|
|
|
|
![]() |
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance |
Published at 2025-02-18 |
|
#ML
|
The study examines how well large language models (LLMs) can compute the LIX readability metric and perform dependency parsing using Swedish essays. The research found that ChatGPT-o1-mini performs most consistently and has a strong correlation between its LIX computation accuracy and performance on the Massive Multitask Language Understanding benchmark, suggesting that language complexity measurement abilities can be a noisy zero-shot proxy for assessing LLMs' general capabilities.... |
Read More |
|
|
|
![]() |
Large Language Models and Mathematical Reasoning Failures |
Published at 2025-02-18 |
|
#ML
|
This study uses 50 new high-school-level word problems to examine the mathematical reasoning of eight advanced language models. The findings show that while newer models perform better, all models still make errors in spatial reasoning, strategic planning, and arithmetic, and have trouble with problems needing multi-step deduction or real-world knowledge.... |
Read More |
|
|
|
|
![]() |
Learning Getting-Up Policies for Real-World Humanoid Robots |
Published at 2025-02-18 |
|
#ML
|
This study presents a learning framework for humanoid robots to recover from falls in various configurations and terrains. The approach involves a two-phase method that first discovers a getting-up trajectory and then refines it into a smooth and slow motion, enabling a real-world G1 humanoid robot to get up from lying face up and down on flat, deformable, slippery surfaces and slopes.... |
Read More |
|
|
|
![]() |
MagicArticulate: Make Your 3D Models Articulation-Ready |
Published at 2025-02-18 |
|
#ML
|
MagicArticulate is a framework to convert static 3D models into articulation-ready assets automatically. It uses a sequence modeling problem for skeleton generation, a functional diffusion process for skinning weights prediction, and a large-scale benchmark Articulation-XL for training.... |
Read More |
|
|
|
|
![]() |
Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning |
Published at 2025-02-18 |
|
#ML
|
MIKASA, a new benchmark for memory reinforcement learning, is proposed. It consists of a classification framework for memory-intensive tasks, a unified benchmark called MIKASA-Base, and a novel benchmark called MIKASA-Robo specifically designed for tabletop robotic manipulation tasks.... |
Read More |
|
|
|
![]() |
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention |
Published at 2025-02-18 |
|
#ML
|
The paper introduces NSA, a hardware-aligned and natively trainable sparse attention mechanism designed for long-context modeling. NSA combines coarse-grained token compression and fine-grained token selection using a dynamic hierarchical sparse strategy and achieves substantial speedups through arithmetic intensity-balanced algorithm design, enabling end-to-end training for long-context tasks and instruction-based reasoning.... |
Read More |
|
|
|
|
![]() |
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning |
Published at 2025-02-18 |
|
#ML
|
PhysReason is a new benchmark with 1,200 problems for physics-based reasoning in large language models. Top models struggle with it, especially with hard problems, and there are four main areas where they struggle: applying physics theorems, understanding physics processes, doing calculations, and analyzing physics conditions.... |
Read More |
|
|
|
![]() |
ReLearn: Unlearning via Learning for Large Language Models |
Published at 2025-02-18 |
|
#ML
|
The proposed ReLearn method uses data augmentation and fine-tuning for effective unlearning in large language models, avoiding the performance degradation and linguistic coherence issues of reverse optimization. It also introduces Knowledge Forgetting Rate, Knowledge Retention Rate, and Linguistic Score as a comprehensive evaluation framework for measuring knowledge-level preservation and generation quality.... |
Read More |
|
|
|
|
![]() |
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors |
Published at 2025-02-18 |
|
#ML
|
The comprehensive benchmark, SURGE, is introduced to investigate the capability of large language models (LLMs) as general-purpose surrogate code executors. The study reveals that while LLMs can predict code execution results in specific cases, they have limitations in general-purpose surrogate execution, providing empirical insights into the feasibility of using LLMs as surrogate code executors.... |
Read More |
|
|
|
![]() |
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? |
Published at 2025-02-18 |
|
#ML
|
SWE-Lancer is a benchmark of over 1,400 freelance software engineering tasks from Upwork, with end-to-end tests and monetary value, which can help researchers understand the economic impact of AI model development.... |
Read More |
|
|
|
|
![]() |
Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking |
Published at 2025-02-18 |
|
#ML
|
The paper discusses the need for explainable automated fact-checking tools in the context of increasing misinformation and examines fact-checkers' requirements through interviews, finding unmet explanation needs and identifying criteria for effective fact-checking explanations.... |
Read More |
|
|
|
![]() |
System Message Generation for User Preferences using Open-Source Models |
Published at 2025-02-18 |
|
#ML
|
The study presents SysGen, a pipeline for generating system messages that improve assistant responses' alignment with user instructions. SysGen, which uses a supervised fine-tuning dataset without system messages, has shown significant improvements in model response alignment with system messages and user instructions across various open-source models on the Multifacet benchmark, while minimally impacting other unseen benchmarks such as Open LLM Leaderboard 2.... |
Read More |
|
|
|
|
![]() |
The Mirage of Model Editing: Revisiting Evaluation in the Wild |
Published at 2025-02-18 |
|
#ML
|
The study challenges the near-perfect results of model editing in controlled environments by evaluating its performance in real-world applications using a new benchmark and evaluation framework, revealing that current editing methods are less effective than previously reported due to issues in evaluation practices, such as the inappropriate use of teacher forcing, and fail drastically with only a small number of edits, suggesting the need for a reevaluation of existing editing methods and their ... |
Read More |
|
|
|
![]() |
Towards Data-Efficient Pretraining for Atomic Property Prediction |
Published at 2025-02-18 |
|
#ML
|
A study challenges the notion that larger datasets are always better for atomic property prediction. It introduces the Chemical Similarity Index (CSI) to select the most relevant dataset for pretraining. The results show that pretraining on a smaller, focused dataset can outperform large-scale pretraining, even when the larger datasets contain relevant data.... |
Read More |
|
|
|
|
![]() |
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model |
Published at 2025-02-18 |
|
#ML
|
This open-source tool, video-SALMONN-o1, is the first reasoning-enhanced audio-visual LLM for general video understanding tasks. It uses a reasoning-intensive dataset and pDPO to improve accuracy by 3-8% over the LLaVA-OneVision baseline and 6-8% over the supervised fine-tuning model on RivaBench, while enabling zero-shot synthetic video detection capabilities.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|