About 453,000 results
Open links in new tab
  1. Submissions | OpenReview

    Jan 22, 2025 · Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers Lorenzo Pacchiardi, Marko Tesic, Lucy G Cheke, Jose Hernandez-Orallo 27 …

  2. EvoTest: Evolutionary Test-Time Learning for Self-Improving …

    Sep 16, 2025 · A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel …

  3. CLEVER: A Curated Benchmark for Formally Verified Code …

    Jul 8, 2025 · TL;DR: We introduce CLEVER, a hand-curated benchmark for verified code generation in Lean. It requires full formal specs and proofs. No few-shot method solves all stages, making …

  4. STAIR: Improving Safety Alignment with Introspective Reasoning

    May 1, 2025 · One common approach is training models to refuse unsafe queries, but this strategy can be vulnerable to clever prompts, often referred to as jailbreak attacks, which can …

  5. Towards Faithful Reasoning in Remote Sensing: A...

    Sep 18, 2025 · The semi-automated pipeline used to create it is a clever and practical approach to large-scale data generation. Strong and Comprehensive Empirical Validation: The paper …

  6. Contrastive Learning Via Equivariant Representation - OpenReview

    Sep 25, 2024 · In this paper, we revisit the roles of augmentation strategies and equivariance in improving CL's efficacy. We propose CLeVER (Contrastive Learning Via Equivariant …

  7. LongWriter: Unleashing 10,000+ Word Generation from Long …

    Jan 22, 2025 · The work includes a new benchmark (LongBench-Write) for evaluating ultra-long generation. Reviewers highlighted the paper's clear identification of the problem, the clever and …

  8. Do Histopathological Foundation Models Eliminate Batch Effects?

    Oct 11, 2024 · Deep learning has led to remarkable advancements in computational histopathology, e.g., in diagnostics, biomarker prediction, and outcome prognosis. Yet, the lack …

  9. Can ChatGPT Defend its Belief in Truth? Evaluating LLM …

    Oct 7, 2023 · Upon mitigating the Clever Hans effect, our task requires the LLM to not only achieve the correct answer on its own, but also be able to hold and defend its belief instead of blindly …

  10. Evaluating the Robustness of Neural Networks: An Extreme Value...

    Feb 15, 2018 · Our analysis yields a novel robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork Robustness. The proposed CLEVER score is attack …