🤗 Daily Paper(2025-02-18)

19 views

Skip to first unread message

deep.di...@gmail.com

unread,

Feb 18, 2025, 3:19:05 PM2/18/25

to hf-daily-pap...@googlegroups.com

🤗 Daily Paper Newsletter

Hope you found some gems!

This newsletter delivers you the curated list of papers by 🤗 Daily Papers.

project page

🤗 daily paper

Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity

Published at 2025-02-17

#ML

The paper introduces a new method to improve language models' performance in proof-oriented programming, which addresses data scarcity by creating synthetic data for training. This method helps the models to both generate and repair proofs for function- and repository-level code, outperforming GPT-4o by 64% relative margin and improving GPT-4o's performance by 54%....

CRANE: Reasoning with constrained LLM generation

Published at 2025-02-17

#ML

This research explains why strict constrained LLM generation can limit reasoning abilities and proposes CRANE, a new algorithm that balances constrained generation correctness with unconstrained flexibility, leading to better performance on symbolic reasoning benchmarks....

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

Published at 2025-02-17

#ML

A new information extraction (IE) model called Cuckoo has been developed by leveraging the resources of large language models (LLMs). This model is trained using a new paradigm called next tokens extraction (NTE), which converts LLM's pre-training and post-training data into extractive data and adapts to traditional and complex instruction-following IE tasks under few-shot setting, outperforming existing pre-trained IE models....

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

Published at 2025-02-17

#ML

Diffusion-Sharpening is a fine-tuning method for diffusion models that optimizes sampling trajectories using a path integral framework and reward feedback, improving training efficiency and inference costs, outperforming other methods in diverse metrics....

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Published at 2025-02-17

#ML

HermesFlow is proposed as a general framework to bridge the gap between understanding and generation in Multimodal Large Language Models (MLLMs). It aligns multimodal understanding and generation using homologous preference data, and experimental results show its superiority in narrowing the gap compared to prior methods....

One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Published at 2025-02-17

#ML

The authors propose a novel approach to enhance Large Language Models' (LLMs) mathematical reasoning and proof capabilities through counterexamples. They introduce a high-quality, university-level mathematical benchmark, CounterMATH, and a data engineering framework to improve model performance by providing counterexamples, thereby promoting a deeper understanding of mathematical concepts....

SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL

Published at 2025-02-17

#ML

The proposed SAFE-SQL framework addresses the limitation of existing Text-to-SQL methods by generating and filtering its own examples, enhancing SQL generation and executing accuracy, especially in unseen scenarios....

Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems

Published at 2025-02-17

#ML

The proposed Talk Structurally, Act Hierarchically (TalkHier) framework improves upon existing multi-agent systems by introducing a structured communication protocol and hierarchical refinement for better collaboration and output accuracy in complex tasks. TalkHier outperforms various state-of-the-art models in diverse tasks such as question answering and text generation, demonstrating its potential to become a new standard for LLM-based multi-agent systems....

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Published at 2025-02-18

#ML

This survey provides a comprehensive analysis of Multimodal Retrieval-Augmented Generation systems, addressing their benefits, challenges, and recent developments, aiming to support advancements in AI systems that effectively utilize multimodal dynamic external knowledge bases....

Better Embeddings with Coupled Adam

Published at 2025-02-18

#ML

This research discusses how anisotropic word representations in large language models (LLMs) can be improved by addressing the second moment in the Adam optimizer. The authors propose a new optimizer, Coupled Adam, which significantly enhances the quality of embeddings and overall performance in upstream and downstream tasks....

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model

Published at 2025-02-18

#ML

The abstract discusses a unified approach, CALM, to address the limitations of current conversational agents, which either struggle with multi-turn conversations or API usage. CALM is trained on a multi-task dataset, CALM-IT, to outperform top domain-specific models across three popular benchmarks....

Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning

Published at 2025-02-18

#ML

This research introduces a method using small neural networks, named InfluenceNetwork, to estimate influence values in models, reducing costs by a maximum of 99%. The authors apply this method, called NN-CIFT, to the task of subset selection for instruction fine-tuning, demonstrating that it performs as well as existing methods while being faster....

Dyve: Thinking Fast and Slow for Dynamic Process Verification

Published at 2025-02-18

#ML

The presented Dyve is a dynamic process verifier that improves error detection in large language models by combining fast and slow thinking, inspired by Kahneman's Systems Theory. It utilizes a step-wise consensus-filtered process supervision technique to create high-quality supervision signals from noisy data and outperforms existing process-based verifiers....

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Published at 2025-02-18

#ML

The paper presents EQ-VAE, a regularization technique that encourages equivariance in the latent space of generative models, improving their performance. EQ-VAE simplifies the latent space, enhancing the quality of image synthesis by pre-trained autoencoders, and is compatible with both continuous and discrete autoencoders....

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

Published at 2025-02-18

#ML

The study develops a scalable method to create a large web trajectory dataset containing over 94K web agent trajectories, 720K screenshots, and 33M web elements. Utilizing this dataset, they train a multimodal web agent named Explorer, which exhibits strong performance on offline and online benchmarks, emphasizing the importance of data scaling in enhancing web agent capabilities....

How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

Published at 2025-02-18

#ML

The paper explores how Large Language Models (LLMs) acquire new knowledge and the process of embedding it in their neural computations through knowledge circuits. Key findings include the influence of relevance to pre-existing knowledge, a phase shift from formation to optimization, and a deep-to-shallow evolution pattern. Understanding these mechanisms can help improve continual pre-training strategies for better model performance....

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Published at 2025-02-18

#ML

The proposed ThinkDiff approach enhances text-to-image diffusion models with multimodal in-context understanding and reasoning abilities by utilizing vision-language models (VLMs). ThinkDiff overcomes existing challenges in multimodal diffusion finetuning by leveraging VLMs as a proxy task with an LLM decoder, significantly improving accuracy on the CoBSAT benchmark without complex training or datasets....

IHEval: Evaluating Language Models on Following the Instruction Hierarchy

Published at 2025-02-18

#ML

IHEval is a new benchmark for evaluating language models' ability to follow instruction hierarchy, covering cases of either alignment or conflict in instructions. The benchmark reveals that popular language models have difficulty recognizing instruction priorities and struggle with conflicting instructions, highlighting the need for further optimization....

ILIAS: Instance-Level Image retrieval At Scale

Published at 2025-02-18

#ML

ILIAS is a new, large-scale dataset for object recognition in images. It is designed to evaluate the performance of models in challenging conditions and diverse domains, and benchmarking with ILIAS shows areas where models can improve, such as generalizing across domains and handling background clutter....

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Published at 2025-02-18

#ML

This study shows that deep neural network models trained for predicting masked regions in natural videos can develop intuitive physics understanding, as evidenced by their performance in violation-of-expectation tasks. This is in contrast to models trained in pixel space or multimodal large language models, suggesting that learning an abstract representation space while predicting missing sensory input is key to acquiring intuitive physics knowledge....

Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance

Published at 2025-02-18

#ML

The study examines how well large language models (LLMs) can compute the LIX readability metric and perform dependency parsing using Swedish essays. The research found that ChatGPT-o1-mini performs most consistently and has a strong correlation between its LIX computation accuracy and performance on the Massive Multitask Language Understanding benchmark, suggesting that language complexity measurement abilities can be a noisy zero-shot proxy for assessing LLMs' general capabilities....

Large Language Models and Mathematical Reasoning Failures

Published at 2025-02-18

#ML

This study uses 50 new high-school-level word problems to examine the mathematical reasoning of eight advanced language models. The findings show that while newer models perform better, all models still make errors in spatial reasoning, strategic planning, and arithmetic, and have trouble with problems needing multi-step deduction or real-world knowledge....

Learning Getting-Up Policies for Real-World Humanoid Robots

Published at 2025-02-18

#ML

This study presents a learning framework for humanoid robots to recover from falls in various configurations and terrains. The approach involves a two-phase method that first discovers a getting-up trajectory and then refines it into a smooth and slow motion, enabling a real-world G1 humanoid robot to get up from lying face up and down on flat, deformable, slippery surfaces and slopes....

MagicArticulate: Make Your 3D Models Articulation-Ready

Published at 2025-02-18

#ML

MagicArticulate is a framework to convert static 3D models into articulation-ready assets automatically. It uses a sequence modeling problem for skeleton generation, a functional diffusion process for skinning weights prediction, and a large-scale benchmark Articulation-XL for training....

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning

Published at 2025-02-18

#ML

MIKASA, a new benchmark for memory reinforcement learning, is proposed. It consists of a classification framework for memory-intensive tasks, a unified benchmark called MIKASA-Base, and a novel benchmark called MIKASA-Robo specifically designed for tabletop robotic manipulation tasks....

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Published at 2025-02-18

#ML

The paper introduces NSA, a hardware-aligned and natively trainable sparse attention mechanism designed for long-context modeling. NSA combines coarse-grained token compression and fine-grained token selection using a dynamic hierarchical sparse strategy and achieves substantial speedups through arithmetic intensity-balanced algorithm design, enabling end-to-end training for long-context tasks and instruction-based reasoning....

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Published at 2025-02-18

#ML

PhysReason is a new benchmark with 1,200 problems for physics-based reasoning in large language models. Top models struggle with it, especially with hard problems, and there are four main areas where they struggle: applying physics theorems, understanding physics processes, doing calculations, and analyzing physics conditions....

ReLearn: Unlearning via Learning for Large Language Models

Published at 2025-02-18

#ML

The proposed ReLearn method uses data augmentation and fine-tuning for effective unlearning in large language models, avoiding the performance degradation and linguistic coherence issues of reverse optimization. It also introduces Knowledge Forgetting Rate, Knowledge Retention Rate, and Linguistic Score as a comprehensive evaluation framework for measuring knowledge-level preservation and generation quality....

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

Published at 2025-02-18

#ML

The comprehensive benchmark, SURGE, is introduced to investigate the capability of large language models (LLMs) as general-purpose surrogate code executors. The study reveals that while LLMs can predict code execution results in specific cases, they have limitations in general-purpose surrogate execution, providing empirical insights into the feasibility of using LLMs as surrogate code executors....

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Published at 2025-02-18

#ML

SWE-Lancer is a benchmark of over 1,400 freelance software engineering tasks from Upwork, with end-to-end tests and monetary value, which can help researchers understand the economic impact of AI model development....

Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking

Published at 2025-02-18

#ML

The paper discusses the need for explainable automated fact-checking tools in the context of increasing misinformation and examines fact-checkers' requirements through interviews, finding unmet explanation needs and identifying criteria for effective fact-checking explanations....

System Message Generation for User Preferences using Open-Source Models

Published at 2025-02-18

#ML

The study presents SysGen, a pipeline for generating system messages that improve assistant responses' alignment with user instructions. SysGen, which uses a supervised fine-tuning dataset without system messages, has shown significant improvements in model response alignment with system messages and user instructions across various open-source models on the Multifacet benchmark, while minimally impacting other unseen benchmarks such as Open LLM Leaderboard 2....

The Mirage of Model Editing: Revisiting Evaluation in the Wild

Published at 2025-02-18

#ML

The study challenges the near-perfect results of model editing in controlled environments by evaluating its performance in real-world applications using a new benchmark and evaluation framework, revealing that current editing methods are less effective than previously reported due to issues in evaluation practices, such as the inappropriate use of teacher forcing, and fail drastically with only a small number of edits, suggesting the need for a reevaluation of existing editing methods and their ...

Towards Data-Efficient Pretraining for Atomic Property Prediction

Published at 2025-02-18

#ML

A study challenges the notion that larger datasets are always better for atomic property prediction. It introduces the Chemical Similarity Index (CSI) to select the most relevant dataset for pretraining. The results show that pretraining on a smaller, focused dataset can outperform large-scale pretraining, even when the larger datasets contain relevant data....

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Published at 2025-02-18

#ML

This open-source tool, video-SALMONN-o1, is the first reasoning-enhanced audio-visual LLM for general video understanding tasks. It uses a reasoning-intensive dataset and pDPO to improve accuracy by 3-8% over the LLaVA-OneVision baseline and 6-8% over the supervised fine-tuning model on RivaBench, while enabling zero-shot synthetic video detection capabilities....

Published at

Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.

(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun.

Visit Developer's Social Media

Reply all

Reply to author

Forward

0 new messages