🤗 Daily Paper Newsletter |
 |
Hope you found some gems! |
This newsletter delivers you the curated list of papers by 🤗 Daily Papers. |
|
|
|
|
|
|
|
|
![]() |
CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark |
Published at 2025-05-22 |
|
#ML
|
The researchers have created a large-scale dataset and model suite called CASS for translating GPU code between different architectures, specifically from Nvidia to AMD. This tool can translate both the source code and assembly language, and it performs much better than commercial translation tools. The researchers also made a benchmark to test the translations and released all their resources for others to use and improve upon.... |
Read More |
|
|
|
![]() |
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers |
Published at 2025-05-24 |
|
#ML
|
The authors present DiffDecompose, a new method for separating semi-transparent or transparent layers in images using diffusion transformers. They also introduce AlphaBlend, a large-scale dataset for training and testing this method, which includes various real-world scenarios like removing flares or decomposing glassware.... |
Read More |
|
|
|
|
![]() |
DLP: Dynamic Layerwise Pruning in Large Language Models |
Published at 2025-05-27 |
|
#ML
|
The authors present a new method called Dynamic Layerwise Pruning (DLP) that determines the importance of each layer in Large Language Models (LLMs) by integrating model weights with input activation information, thereby improving performance at high sparsity levels. DLP reduces perplexity and improves accuracy compared to existing methods and can be combined with other LLM compression techniques.... |
Read More |
|
|
|
![]() |
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models |
Published at 2025-05-29 |
|
#ML
|
The authors present a new framework called Segment Policy Optimization (SPO) for improving large language models using reinforcement learning. SPO offers a middle ground between existing token-level and trajectory-level methods by using segment-level advantage estimation, resulting in more accurate credit assignment and better performance on various tasks.... |
Read More |
|
|
|
|
![]() |
TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence |
Published at 2025-05-30 |
|
#ML
|
The study presents a new method called TimeHC-RL to improve the social intelligence of Large Language Models (LLMs), which have already shown progress in areas like math and coding. The method focuses on the social world's unique timeline and cognitive requirements, outperforming a widely used method and helping a 7B model compete with more advanced ones.... |
Read More |
|
|
|
![]() |
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation |
Published at 2025-05-31 |
|
#ML
|
BenchHub is a new tool that helps researchers and developers test large language models more effectively by collecting and organizing various benchmark datasets from different domains, making it easier to evaluate models for specific needs or use cases. The tool highlights the importance of testing models in specific domains and can improve dataset reuse, model comparisons, and identify areas that need more attention in existing benchmarks.... |
Read More |
|
|
|
|
![]() |
RiOSWorld: Benchmarking the Risk of Multimodal Compter-Use Agents |
Published at 2025-05-31 |
|
#ML
|
The authors present RiOSWorld, a benchmark for evaluating the risks of AI agents using computers in real-world scenarios. RiOSWorld includes various risky tasks and categorizes them into user-originated and environmental risks, demonstrating the significant safety risks current computer-use agents face.... |
Read More |
|
|
|
![]() |
Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents |
Published at 2025-06-02 |
|
#ML
|
This study presents a new method for improving the accuracy and reliability of language models when interpreting flowcharts, which are essential for visualizing decision-making processes. The proposed neurosymbolic agent, FlowPathAgent, reduces visual hallucinations in model responses by segmenting, analyzing, and interacting with the flowchart's structure, outperforming existing methods on a new benchmark dataset.... |
Read More |
|
|
|
|
![]() |
Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation |
Published at 2025-06-02 |
|
#ML
|
This study proposes a new data augmentation method using diffusion techniques to improve knowledge distillation, particularly in cases of unknown covariate shift. By generating challenging samples that maximize disagreement between a teacher and student model, the approach significantly enhances accuracy and robustness on various datasets compared to existing methods.... |
Read More |
|
|
|
![]() |
Small Language Models are the Future of Agentic AI |
Published at 2025-06-02 |
|
#ML
|
The article argues that small language models (SLMs) are more powerful, suitable, and economical than large language models (LLMs) for many agentic AI tasks. The authors advocate for a shift towards SLMs, discussing potential barriers and proposing a conversion algorithm from LLMs to SLMs, to reduce AI costs and promote efficient use of resources.... |
Read More |
|
|
|
|
![]() |
Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models |
Published at 2025-06-02 |
|
#ML
|
The Psi-Sampler framework uses an SMC-based method with a new algorithm called pCNL to improve sampling efficiency in reward alignment tasks for score-based generative models. By initializing particles from the reward-aware posterior and using pCNL for high-dimensional posterior sampling, this method outperforms existing methods in various tasks like layout-to-image generation and aesthetic-preference generation.... |
Read More |
|
|
|
![]() |
A Controllable Examination for Long-Context Language Models |
Published at 2025-06-03 |
|
#ML
|
The authors present LongBioBench, a new benchmark using artificial biographies to test long-context language models. Experiments show that most models still struggle with understanding and reasoning over long contexts, and LongBioBench is more interpretable and controllable than existing synthetic benchmarks.... |
Read More |
|
|
|
|
![]() |
Beyond the Surface: Measuring Self-Preference in LLM Judgments |
Published at 2025-06-03 |
|
#ML
|
This research presents a new method, DBG score, to measure self-preference bias in large language models (LLMs) by using gold judgments as proxies for response quality. The study examines the impact of response text style and post-training data on self-preference bias and explores possible explanations from an attention-based perspective.... |
Read More |
|
|
|
![]() |
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback |
Published at 2025-06-03 |
|
#ML
|
The study addresses limitations of reinforcement learning with numerical feedback in large language models by proposing Critique-GRPO, an online framework that uses both natural language and numerical feedback for policy optimization. Experiments show that Critique-GRPO significantly outperforms other methods in various reasoning tasks, improving average scores by around 5%.... |
Read More |
|
|
|
|
![]() |
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models |
Published at 2025-06-03 |
|
#ML
|
The study presents DenseDPO, a method that improves upon the Direct Preference Optimization technique used for text-to-video diffusion models. DenseDPO addresses limitations in fine-grained comparisons and motion bias by creating aligned video pairs and labeling preferences on short segments, resulting in better motion generation with fewer labels. Additionally, DenseDPO can use Vision Language Models for automatic preference annotation, achieving performance close to human labels.... |
Read More |
|
|
|
![]() |
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning |
Published at 2025-06-03 |
|
#ML
|
The researchers present FinChain, a new tool for testing complex, step-by-step financial reasoning in AI models. They created a large dataset with various financial topics and reasoning difficulties, along with a new way to evaluate both the final answers and the intermediate steps. When they tested 30 AI models on this dataset, they found that even the best models need improvement in multi-step financial reasoning.... |
Read More |
|
|
|
|
![]() |
IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation |
Published at 2025-06-03 |
|
#ML
|
The research presents IllumiCraft, a new video generation framework that uses three types of inputs: lighting maps, synthetically lit images, and 3D geometry data, to create videos with controlled lighting and appearance. This method produces more detailed and coherent videos compared to existing techniques, allowing for better background and text-based video relighting.... |
Read More |
|
|
|
![]() |
Quantitative LLM Judges |
Published at 2025-06-03 |
|
#ML
|
The study presents a new method to enhance the evaluation of large language models (LLMs) by aligning their scores with human scores using regression models. This approach is more efficient and adaptable than traditional methods, as demonstrated through experiments on various datasets. (ELI5: Think of it as teaching a computer to better understand and rate human language by learning from human feedback, making the computer's judgment more accurate and efficient.)... |
Read More |
|
|
|
|
![]() |
RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions |
Published at 2025-06-03 |
|
#ML
|
The authors present RefEdit, a new model for improving image editing in complex scenes with multiple objects, which outperforms existing models trained on millions of data samples. They also introduce RefEdit-Bench, a real-world benchmark for evaluating this task, and release their data and model for others to replicate their results.... |
Read More |
|
|
|
![]() |
Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting |
Published at 2025-06-03 |
|
#ML
|
The authors present a new framework called Asymmetric Dual 3DGS that creates more stable and consistent 3D reconstructions from challenging in-the-wild images by leveraging the randomness of visual artifacts. This method uses two parallel 3D Gaussian Splatting models with a consistency constraint and a divergent masking strategy to reduce shared error modes, and a lightweight variant called Dynamic EMA Proxy to improve training efficiency.... |
Read More |
|
|
|
|
![]() |
Robustness in Both Domains: CLIP Needs a Robust Text Encoder |
Published at 2025-06-03 |
|
#ML
|
This research focuses on enhancing the robustness of text encoders in CLIP, an AI model, by introducing LEAF, a method for creating robust text encoders that can be scaled for large CLIP models. The improved text encoders help maintain performance in various applications like text-to-image generation and multimodal retrieval tasks, even under adversarial attacks.... |
Read More |
|
|
|
![]() |
SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation |
Published at 2025-06-03 |
|
#ML
|
The study presents SVGenius, a comprehensive benchmark for evaluating large language models and multimodal LLMs in processing SVG files, which covers 2,377 queries across three dimensions: understanding, editing, and generation. The benchmark assesses 22 models across various categories and reveals that while proprietary models perform better than open-source ones, all models struggle with complexity, suggesting that reasoning-enhanced training is more effective than scaling for improving perfor... |
Read More |
|
|
|
|
![]() |
Solving Inverse Problems with FLAIR |
Published at 2025-06-03 |
|
#ML
|
This study presents FLAIR, a new method that uses flow-based generative models to improve the quality of solutions for inverse imaging problems. By introducing a variational objective for flow matching and combining it with deterministic trajectory adjustments, FLAIR can recover rare, atypical data modes and consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.... |
Read More |
|
|
|
![]() |
TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models |
Published at 2025-06-03 |
|
#ML
|
The authors describe TalkingMachines, a system that creates real-time, audio-driven character animations using pretrained video generation models and an audio large language model, enabling natural conversations. They achieve this by adapting a pretrained model, enabling infinite video streaming, and designing a high-throughput inference pipeline with various engineering optimizations.... |
Read More |
|
|
|
|
![]() |
Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem |
Published at 2025-06-03 |
|
#ML
|
This study presents a new method called Critique Fine-Tuning (CFT) that efficiently unlocks the reasoning potential of large language models (LLMs) by training them on only one problem. CFT uses diverse model-generated solutions and detailed critiques from teacher LLMs to improve performance on various reasoning tasks, requiring significantly less computational power than existing methods.... |
Read More |
|
|
|
![]() |
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning |
Published at 2025-06-03 |
|
#ML
|
The authors present a new framework called Video-Skill-CoT that tailors video reasoning to specific domains by creating skill-based reasoning annotations and training specialized expert modules. They show that this approach outperforms existing methods on various video understanding tasks, and provide detailed analysis on the effectiveness of their method.... |
Read More |
|
|
|
|
![]() |
Adapt before Continual Learning |
Published at 2025-06-04 |
|
#ML
|
This study presents a new method called ACL that improves the ability of pre-trained models to learn new tasks while retaining old knowledge, by refining the model's backbone through a special adaptation phase. The method enhances the model's learning capacity and prevents forgetting, as demonstrated through various experiments and benchmarks.... |
Read More |
|
|
|
![]() |
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning |
Published at 2025-06-04 |
|
#ML
|
This study improves multimodal reasoning in large language models by optimizing cold start initialization, addressing gradient stagnation in reinforcement learning, and introducing a staged training approach that balances perceptual and cognitive development, resulting in a new state-of-the-art model called ReVisual-R1.... |
Read More |
|
|
|
|
![]() |
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment |
Published at 2025-06-04 |
|
#ML
|
The researchers present AmbiK, a dataset of ambiguous instructions for a robot in a kitchen environment, which aims to provide a universal benchmark for comparing various methods of task ambiguity detection. AmbiK, which was created using Large Language Models and validated by humans, contains 1000 pairs of ambiguous and unambiguous tasks, along with their corresponding details, for a total of 2000 tasks.... |
Read More |
|
|
|
![]() |
CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents |
Published at 2025-06-04 |
|
#ML
|
The researchers have developed a new method called CRAWLDoc to effectively rank relevant web documents by starting from a publication's URL and retrieving all linked resources. They tested CRAWLDoc on a new dataset of 600 publications and found that it can robustly rank documents across different publishers and formats, improving metadata extraction from varied web documents.... |
Read More |
|
|
|
|
![]() |
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis |
Published at 2025-06-04 |
|
#ML
|
This study identifies that data contamination in language model evaluations is often due to shortcut solutions in model training. The researchers propose a new method to detect and suppress these shortcut neurons, resulting in more accurate and trustworthy evaluations, with a strong correlation to a recent benchmark.... |
Read More |
|
|
|
![]() |
HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction |
Published at 2025-06-04 |
|
#ML
|
A new dataset called HTSC-2025 has been created to help improve AI predictions for high-temperature superconducting materials. This dataset includes various superconducting materials discovered from 2023 to 2025 and is available for use and updates online.... |
Read More |
|
|
|
|
![]() |
Image Editing As Programs with Diffusion Models |
Published at 2025-06-04 |
|
#ML
|
The authors present a new method called Image Editing As Programs (IEAP) that improves instruction-driven image editing with diffusion models. IEAP breaks down complex editing instructions into simpler operations, which are then executed in sequence to achieve the desired edit, resulting in better accuracy and semantic fidelity, especially for complex edits.... |
Read More |
|
|
|
![]() |
LayerFlow: A Unified Model for Layer-aware Video Generation |
Published at 2025-06-04 |
|
#ML
|
LayerFlow is a system that creates videos with separate foreground, background, and blended scenes using text prompts for each layer. It uses a trained model to generate smooth videos with desired layers, starting from a text-to-video diffusion transformer and a multi-stage training strategy.... |
Read More |
|
|
|
|
![]() |
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos |
Published at 2025-06-04 |
|
#ML
|
The study presents MMR-V, a video benchmark that tests multimodal deep reasoning in videos, focusing on long-range, multi-frame reasoning, hidden information, and reliability. Experiments show that current models struggle with this task, even with advanced reasoning strategies, highlighting the need for further research in this area.... |
Read More |
|
|
|
![]() |
MiMo-VL Technical Report |
Published at 2025-06-04 |
|
#ML
|
The researchers have developed two advanced vision-language models, MiMo-VL-7B-SFT and MiMo-VL-7B-RL, which outperform existing models in various tasks, including multimodal reasoning and GUI grounding. The models were trained using a combination of pre-training and mixed reinforcement learning, and the researchers have also created a comprehensive evaluation suite to assess their performance.... |
Read More |
|
|
|
|
![]() |
OpenThoughts: Data Recipes for Reasoning Models |
Published at 2025-06-04 |
|
#ML
|
The OpenThoughts project created an open-source dataset called OpenThoughts2-1M for training reasoning models, leading to the development of OpenThinker2-32B, the first model trained on public reasoning data to match a state-of-the-art model on standard reasoning benchmarks. Further improvements to the dataset resulted in the OpenThinker3-7B model, which achieved state-of-the-art results on various reasoning benchmarks.... |
Read More |
|
|
|
![]() |
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games |
Published at 2025-06-04 |
|
#ML
|
The authors present Orak, a new benchmark for training and evaluating language model agents in various video games. Orak includes 12 popular games across major genres, allowing for a comprehensive study of language model capabilities and essential modules for complex gameplay, and it provides a plug-and-play interface and a fine-tuning dataset for consistent evaluation.... |
Read More |
|
|
|
|
![]() |
POSS: Position Specialist Generates Better Draft for Speculative Decoding |
Published at 2025-06-04 |
|
#ML
|
This research presents Position Specialists (PosS) to enhance speculative decoding in Large Language Models (LLMs). PosS improves token accuracy at later positions by using position-specialized draft layers, reducing error accumulation and boosting acceptance rates. Experiments show that PosS outperforms baselines in average acceptance length and speed-up ratio on various datasets.... |
Read More |
|
|
|
![]() |
Rectified Sparse Attention |
Published at 2025-06-04 |
|
#ML
|
The authors present a new method called Rectified Sparse Attention (ReSA) that enhances the efficiency of long-sequence generation in Large Language Models. ReSA minimizes approximation errors and preserves generation quality by combining block-sparse attention with periodic dense rectification, resulting in faster processing times and maintaining near-perfect generation quality.... |
Read More |
|
|
|
|
![]() |
Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective |
Published at 2025-06-04 |
|
#ML
|
This study explores the balance between stability and plasticity in neural networks from an architectural perspective. The researchers find that deeper networks are more adaptable, while wider networks are more stable, and propose a new framework called Dual-Arch to improve existing continual learning methods by combining the strengths of two specialized networks.... |
Read More |
|
|
|
![]() |
Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning |
Published at 2025-06-04 |
|
#ML
|
The study presents Rex-Thinker, a model that uses a chain-of-thought reasoning approach for object referring, which enhances explainability and accuracy. Rex-Thinker identifies potential object candidates and assesses them step-by-step to match a given expression, supported by a large-scale CoT-style referring dataset. The model outperforms standard baselines in precision and interpretability, and demonstrates improved ability to reject hallucinated outputs and strong generalization in out-of-do... |
Read More |
|
|
|
|
![]() |
Sounding that Object: Interactive Object-Aware Image to Audio Generation |
Published at 2025-06-04 |
|
#ML
|
The authors present a model that generates sounds for specific objects in an image, allowing users to interactively choose which object's sound to produce. This model uses a conditional latent diffusion model and multi-modal attention to connect image regions with their corresponding sounds, outperforming existing methods in aligning objects with their sounds.... |
Read More |
|
|
|
![]() |
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models |
Published at 2025-06-04 |
|
#ML
|
The authors present SuperWriter-Agent, a new framework that improves long-form text generation by incorporating structured thinking and planning stages, similar to a professional writer's process. This framework, combined with a hierarchical optimization technique, results in a state-of-the-art language model that outperforms larger baseline models in both automatic and human evaluations.... |
Read More |
|
|
|
|
![]() |
Survey of Active Learning Hyperparameters: Insights from a Large-Scale Experimental Grid |
Published at 2025-06-04 |
|
#ML
|
This study explores the impact of hyperparameters on Active Learning (AL) performance by conducting the largest AL experiment to date, with over 4.6 million hyperparameter combinations. The findings aim to provide guidelines for setting up AL, increase trust in its effectiveness, and promote reproducible AL research.... |
Read More |
|
|
|
![]() |
TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems |
Published at 2025-06-04 |
|
#ML
|
This review explores how to manage trust, risk, and security in advanced AI systems that use large language models and work together in groups, focusing on unique challenges and solutions for these complex systems.... |
Read More |
|
|
|
|
![]() |
VLMs Can Aggregate Scattered Training Patches |
Published at 2025-06-04 |
|
#ML
|
The study reveals that vision-language models can reassemble harmful image fragments that have been split and scattered across many training samples, posing safety risks. This ability, called visual stitching, allows models to learn dangerous content by associating it with benign descriptions, enabling the generation of harmful responses during use.... |
Read More |
|
|
|
![]() |
VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation |
Published at 2025-06-04 |
|
#ML
|
Researchers created VisCode-200K, a large dataset with over 200K examples for training LLMs in generating executable Python visualization code, including 45K multi-turn correction dialogues. They fine-tuned Qwen2.5-Coder-Instruct on this dataset to develop VisCoder, which significantly outperforms open-source baselines and is close to proprietary models in generating accurate, executable visualization code.... |
Read More |
|
|
|
|
![]() |
Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation |
Published at 2025-06-04 |
|
#ML
|
This study presents Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path, without requiring 3D reconstruction pipelines. The method uses three key components: World-Consistent Video Diffusion, Long-Range World Exploration, and Scalable Data Engine, resulting in improved visual quality and geometric accuracy for various applications.... |
Read More |
|
|
|
|
|
Tags are generated by Google's Gemini Pro API, and the summary and translation are generated by Upstage's SOLAR mini chat model derived from SOLAR-10.7B open LLM.
(Experimental) The full paper is translated in korean with enko-t5-small-v0 model developed by Kim Kihyun. |
Visit Developer's Social Media |
|
|
|
|
|
|