HF Daily Papers
huggingface.co · Release Trackers · 105 items
Proposes using frozen vision-language models as teacher signals to improve video reasoning in smaller models through adaptive test-time optimization.
Introduces Brain-IT-VQA, a benchmark and model for answering visual questions grounded in brain-imaging data (fMRI/EEG), enabling neural-decoding-based VQA.
StreamChar presents a decoupled orchestration framework for generating coherent long-horizon character audio-video streams, separating high-level planning from low-level synthesis.
Combines pipeline parallelism with speculative decoding to eliminate pipeline-bubble overhead and improve token acceptance rates in large-model inference.
Investigates whether foundation models can actively navigate to a specified viewpoint through visual exploration, revealing promising capabilities alongside systematic failure modes.
ESPO introduces an early-stopping criterion for PPO that prevents reward over-optimization during RLHF training, improving alignment stability across diverse preference datasets.
NITP proposes predicting latent implicit token representations rather than explicit tokens during pre-training, improving LLM reasoning quality without extra inference cost.
X-Stream frames multimodal large language models as multiplexers that simultaneously process and reason over multiple heterogeneous input streams, enabling richer multi-source understanding.
Empirically compares discriminative vision-language and generative video diffusion pretraining for spatial intelligence tasks, finding generation-focused models often learn richer 3D representations.
Draft-OPD trains speculative decoding draft models via on-policy distillation from the target model, improving draft token quality and boosting overall inference throughput.
SkillAdaptor enables LLM agents to automatically refine reusable skills by learning from both successful and failed action trajectories, improving generalization to novel tasks.
K-BrowseComp benchmarks web browsing agents on tasks anchored in Korean-language web contexts, testing navigation, search, and comprehension on Korean-language sites.
VideoMLA applies low-rank KV cache compression to autoregressive video diffusion transformers, enabling minute-scale video generation while keeping memory footprint tractable.
Crafter introduces a multi-agent pipeline that generates and iteratively edits publication-quality scientific figures from diverse inputs including tables, captions, and reference images.
Explores how parameter-efficient fine-tuning methods scale when simultaneously training millions of personal LoRA adapters on a single shared trillion-parameter base model.
Research shows LLM agent populations spontaneously develop compressed private languages that improve token efficiency but risk evading human oversight.
Proposes using on-policy feedback to iteratively self-improve reward models in RLHF, addressing distributional mismatch that degrades reward quality at inference time.
Presents a vision-language model architecture that scales linearly with video length, enabling practical long-video understanding without quadratic attention cost.
Introduces a method for automatically discovering latent failure signals in vision-language-action agent trajectories to enable runtime safety monitoring.
Studies how embedding models bind multiple concepts in shared vector spaces, finding superposition as the key representational binding strategy.
An automated benchmark for systematically auditing the quality and diversity of skills available to LLM agents in open-domain skill ecosystems.
A benchmark revealing that current LLM agents systematically fail on long-horizon data analysis tasks requiring sustained multi-step planning and context management.
Demonstrates that standard vision-language models can learn 3D spatial representations natively without specialized 3D architectures, given appropriate training signals.
Proposes a decoupled memory architecture that separates long-term world state from short-term rendering, enabling consistent video world generation at minute-long timescales.
A knowledge distillation method for efficiently selecting essential keyframes from video streams, reducing compute while preserving task-critical visual information.
Enables open-ended skill acquisition in AI agents through self-play with co-evolving policies, generating increasingly challenging tasks without a fixed reward function.
A self-aware RL framework that trains agentic search systems to recognize when to stop exploring, preventing over-search that wastes compute without improving answer quality.
Introduces a benchmark and trajectory synthesis pipeline targeting policy-induced errors in GUI agents, enabling training of more error-resilient automation systems.
Analyzes trojan backdoor attacks on agentic AI harnesses spanning from prompt injection to persistent control hijacking, and proposes corresponding defense strategies.
Investigates AI agents autonomously running data engineering pipelines to generate specialized training data for fine-tuning foundation models on new domains.
A large-scale evaluation of neural TTS systems across diverse real-world scenarios, exposing consistent quality gaps in long-form and spontaneous speech generation.
Proposes selective task-focused memorization for multimodal agents, retaining only relevant information while discarding noise across long interaction histories.
Introduces a mixture-of-experts architecture for diffusion language models with learnable block-level experts that improve generation quality and computational efficiency.
Shows that teacher-student disagreements vary in learnability during on-policy distillation, and proposes token teachability scores to focus training on tractable disagreements.
Trains LLM agents to directly query text corpora via grep-style operations, improving information retrieval accuracy by bypassing dense vector indexing.
Automates the generation of reusable AI agent skills by distilling domain expertise from human experts, enabling systematic skill library construction at scale.
Applies trust-region constraints to blend teacher and student behaviors during on-policy distillation, improving training stability and final model performance.
Introduces representation forcing — aligning intermediate hidden states — to train unified multimodal models without the quality bottleneck imposed by discrete tokenization.
A zero-shot speech synthesis system that generates expressive long-form audio for both monologue and natural multi-speaker dialogue without speaker-specific fine-tuning.
Trains LLMs to reason over long contexts by learning from search agent trajectories guided by rubric-based reward signals, improving structured long-context reasoning.
Technical report for Mellum2, JetBrains' updated code-focused language model optimized for developer tooling tasks including completion and code understanding.
A 100K-image dataset using generative models to produce high-quality ground truth for training generalizable image restoration networks on diverse real-world degradations.
Generates structured 3D indoor scene layouts from functional textual descriptions, bridging the gap between natural-language room requirements and geometric scene generation.
Proposes a streaming spatial audio generation system synchronized to video in real time using an autoregressive diffusion transformer, enabling immersive on-the-fly audio synthesis.
SANA-Streaming enables real-time video editing using a hybrid diffusion transformer that balances generation quality and latency for streaming applications.
AdaState introduces self-evolving anchor states that adapt dynamically to video content, enabling more coherent and efficient streaming video generation.
A unified decoding framework lets LLMs reason before applying output constraints, improving constrained generation quality across diverse tasks.
3D geometric priors from foundation models improve cross-image semantic correspondence learning by providing view-consistent structural cues.
Larger neural networks retain more knowledge because they experience less cross-task interference and have greater capacity to memorize rare training examples.
ChildVox is a new benchmark for evaluating large audio-language models on children's speech and environmental sound understanding across developmental stages.
Parallax is a parameterized local linear attention mechanism for language models that achieves strong performance with improved computational efficiency over standard attention.
PANDO enables efficient multimodal agents by distilling task-specific skills online from large foundation models into smaller specialized subagents.
A unified risk map framework integrates uncertainty from partial observability in autonomous driving to improve safety and planning decisions in real-world environments.
CoHyDE iteratively co-trains an LLM query rewriter and a dense retriever to significantly improve tool retrieval accuracy for AI agents.
SmartDirector generates cinematic videos conditioned on keyframes with narrative pacing control, enabling story-driven video synthesis from minimal inputs.
CONF-KV uses token confidence scores to guide KV cache eviction and mixed-precision storage, reducing memory usage in long-context LLM inference with minimal quality loss.
A new method generates multi-view consistent 3D Gaussian head avatars without requiring multi-view data during training, significantly reducing data requirements.
REPOT improves LLM reasoning by inserting checkpoints during program-of-thought generation that allow the model to detect and repair errors mid-task.
Corpus-grounded process supervision extends verifiable-reward RL training to factual QA beyond math and code, enabling open-domain factual reasoning improvements.
NeuROK generates 4D neural representations of object kinematics, enabling realistic synthesis of objects in motion using learned physical motion priors.
Proposes a multi-agent harness that interleaves retrieval, reasoning, and writing steps to produce verifiable multimodal deep-research reports.
Introduces UI-KOBE, a knowledge-oriented exploration framework that uses lightweight graph guidance to improve GUI agent behavior coverage.
Presents PRISM, a multi-dimensional benchmark for measuring how well LLMs perform scientific peer review across quality, coverage, and reasoning axes.
Proposes RUBRIC-ARROW, an alternating rubric reward modeling approach for post-training LLMs in domains where correctness cannot be automatically verified.
Introduces PhyGenHOI, a physically-aware diffusion model for generating plausible 4D dynamic human-object interaction sequences that obey physics constraints.
Presents DynaFLIP, a tri-modal dynamics-guided representation framework that improves robotic perception by fusing visual, force, and proprioceptive signals.
Introduces WorldMemArena, a benchmark that evaluates multimodal agent memory by measuring how well agents retain and use context across long action-world interactions.
Studies when LLMs should update their beliefs given new context, finding that most models are miscalibrated—either too stubborn or too easily swayed.
Shows that using colored (correlated) noise instead of white noise during diffusion sampling improves sample quality and coherence with minimal overhead.
Provides mechanistic explanations for how dense retrieval models build and use internal representations, making their behavior more interpretable.
Presents CausaLab, a scalable interactive environment where AI agents can design experiments, observe outcomes, and perform causal discovery autonomously.
Investigates whether position bias in dense retrieval—where earlier documents rank higher—is baked into model architecture or shaped by training data, finding both factors contribute.
Reports practical lessons from deploying hybrid cloud-device multi-agent systems, covering latency, state management, failure handling, and cost tradeoffs.
Introduces LiteCoder-Terminal, a scalable terminal environment for training language agents on long-horizon coding and system-administration tasks.
Presents AsyncTool, a benchmark that tests LLMs on asynchronous parallel function-calling scenarios where multiple tools must be invoked and coordinated concurrently.
Presents Qwen-VLA, a unified vision-language-action model that generalizes across diverse robotic tasks, environments, and embodiments using a single pretrained backbone.
Proposes OmniRetrieval, a unified framework that retrieves from heterogeneous knowledge sources—documents, tables, images—using a single query interface.
Introduces CollectionLoRA, which distills 50 distinct visual style effects into a single LoRA adapter through multi-teacher on-policy distillation.
Releases minWM, a full-stack open-source framework for building and running real-time interactive video world models with low latency.
Analyzes the gap between modern video generation and true world modeling through a causality lens, finding that current models lack causal grounding.
Probes how vision-language models represent spatial concepts, finding systematic biases in how they encode directionality and distance.
Presents GenClaw, an agentic image generation system that produces and executes code to control image synthesis, enabling precise programmatic edits.
Derives a parametric memory law showing how LoRA adapter rank and training steps jointly determine how much new knowledge a fine-tuned LLM retains.
Proposes EarlyTom, which compresses video token sequences early in transformer processing to accelerate video understanding with minimal accuracy loss.
Introduces a native audio-visual alignment method for generation that jointly models audio and video without requiring separate synchronization post-processing.
Presents UniSteer, a text-guided activation-space steering method using flow matching that can direct LLM behavior across diverse objectives without fine-tuning.
Proposes LaRA, which uses layer-wise representation analysis to detect when RL post-training data has been contaminated or over-represented.
Introduces Skill0.5, a framework for jointly internalizing reusable skills and learning when to deploy them, enabling out-of-distribution generalization in agentic RL.
Presents LoMo, a local modality substitution technique that replaces visual tokens with contextually matched language tokens to enable deeper vision-language fusion.
Presents AgentDoG 1.5, a lightweight alignment framework for AI agents that adds safety guardrails and security checks with minimal performance overhead.
Nothing matches.