new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

May 18

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce Web2BigTable, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of 38.50 (7.5times the second best at 5.10), Row F1 of 63.53 (+25.03 over the second best), and Item F1 of 80.12 (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.

  • 9 authors
·
Apr 28 5

A hybrid deep-learning-metaheuristic framework for bi-level network design problems

This study proposes a hybrid deep-learning-metaheuristic framework with a bi-level architecture for road network design problems (NDPs). We train a graph neural network (GNN) to approximate the solution of the user equilibrium (UE) traffic assignment problem and use inferences made by the trained model to calculate fitness function evaluations of a genetic algorithm (GA) to approximate solutions for NDPs. Using three test networks, two NDP variants and an exact solver as benchmark, we show that on average, our proposed framework can provide solutions within 1.5% gap of the best results in less than 0.5% of the time used by the exact solution procedure. Our framework can be utilized within an expert system for infrastructure planning to determine the best infrastructure planning and management decisions under different scenarios. Given the flexibility of the framework, it can easily be adapted to many other decision problems that can be modeled as bi-level problems on graphs. Moreover, we foreseen interesting future research directions, thus we also put forward a brief research agenda for this topic. The key observation from our research that can shape future research is that the fitness function evaluation time using the inferences made by the GNN model was in the order of milliseconds, which points to an opportunity and a need for novel heuristics that 1) can cope well with noisy fitness function values provided by deep learning models, and 2) can use the significantly enlarged efficiency of the evaluation step to explore the search space effectively (rather than efficiently). This opens a new avenue for a modern class of metaheuristics that are crafted for use with AI-powered predictors.

  • 2 authors
·
Mar 10, 2023

Accelerating Scientific Discovery with Autonomous Goal-evolving Agents

There has been unprecedented interest in developing agents that expand the boundary of scientific discovery, primarily by optimizing quantitative objective functions specified by scientists. However, for grand challenges in science, these objectives may only be imperfect proxies. We argue that automating objective function design is a central, yet unmet need for scientific discovery agents. In this work, we introduce the Scientific Autonomous Goal-evolving Agent (SAGA) to address this challenge. SAGA employs a bi-level architecture in which an outer loop of LLM agents analyzes optimization outcomes, proposes new objectives, and converts them into computable scoring functions, while an inner loop performs solution optimization under the current objectives. This bi-level design enables systematic exploration of the space of objectives and their trade-offs, rather than treating them as fixed inputs. We demonstrate the framework through a wide range of design applications, including antibiotics, nanobodies, functional DNA sequences, inorganic materials, and chemical processes. Notably, our experimental validation identifies a structurally novel hit with promising potency and safety profiles for E. coli in the antibiotic design task, and three de novo PD-L1 binders in the nanobody design task. These results suggest that automating objective formulation can substantially improve the effectiveness of scientific discovery agents.

  • 28 authors
·
Mar 29

Advancing Model Pruning via Bi-level Optimization

The deployment constraints in practical applications necessitate the pruning of large-scale deep learning models, i.e., promoting their weight sparsity. As illustrated by the Lottery Ticket Hypothesis (LTH), pruning also has the potential of improving their generalization ability. At the core of LTH, iterative magnitude pruning (IMP) is the predominant pruning method to successfully find 'winning tickets'. Yet, the computation cost of IMP grows prohibitively as the targeted pruning ratio increases. To reduce the computation overhead, various efficient 'one-shot' pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. This raises the question of how to close the gap between pruning accuracy and pruning efficiency? To tackle it, we pursue the algorithmic advancement of model pruning. Specifically, we formulate the pruning problem from a fresh and novel viewpoint, bi-level optimization (BLO). We show that the BLO interpretation provides a technically-grounded optimization base for an efficient implementation of the pruning-retraining learning paradigm used in IMP. We also show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure. By leveraging such bi-linearity, we theoretically show that BiP can be solved as easily as first-order optimization, thus inheriting the computation efficiency. Through extensive experiments on both structured and unstructured pruning with 5 model architectures and 4 data sets, we demonstrate that BiP can find better winning tickets than IMP in most cases, and is computationally as efficient as the one-shot pruning schemes, demonstrating 2-7 times speedup over IMP for the same level of model accuracy and sparsity.

  • 8 authors
·
Oct 8, 2022

Communication Learning in Multi-Agent Systems from Graph Modeling Perspective

In numerous artificial intelligence applications, the collaborative efforts of multiple intelligent agents are imperative for the successful attainment of target objectives. To enhance coordination among these agents, a distributed communication framework is often employed. However, indiscriminate information sharing among all agents can be resource-intensive, and the adoption of manually pre-defined communication architectures imposes constraints on inter-agent communication, thus limiting the potential for effective collaboration. Moreover, the communication framework often remains static during inference, which may result in sustained high resource consumption, as in most cases, only key decisions necessitate information sharing among agents. In this study, we introduce a novel approach wherein we conceptualize the communication architecture among agents as a learnable graph. We formulate this problem as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi-level optimization process. Utilizing continuous relaxation of the graph representation and incorporating attention units, our proposed approach, CommFormer, efficiently optimizes the communication graph and concurrently refines architectural parameters through gradient descent in an end-to-end manner. Additionally, we introduce a temporal gating mechanism for each agent, enabling dynamic decisions on whether to receive shared information at a given time, based on current observations, thus improving decision-making efficiency. Extensive experiments on a variety of cooperative tasks substantiate the robustness of our model across diverse cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies regardless of changes in the number of agents.

  • 4 authors
·
Nov 1, 2024

MedSAM-CA: A CNN-Augmented ViT with Attention-Enhanced Multi-Scale Fusion for Medical Image Segmentation

Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning, where accurate boundary delineation is essential for precise lesion localization, organ identification, and quantitative assessment. In recent years, deep learning-based methods have significantly advanced segmentation accuracy. However, two major challenges remain. First, the performance of these methods heavily relies on large-scale annotated datasets, which are often difficult to obtain in medical scenarios due to privacy concerns and high annotation costs. Second, clinically challenging scenarios, such as low contrast in certain imaging modalities and blurry lesion boundaries caused by malignancy, still pose obstacles to precise segmentation. To address these challenges, we propose MedSAM-CA, an architecture-level fine-tuning approach that mitigates reliance on extensive manual annotations by adapting the pretrained foundation model, Medical Segment Anything (MedSAM). MedSAM-CA introduces two key components: the Convolutional Attention-Enhanced Boundary Refinement Network (CBR-Net) and the Attention-Enhanced Feature Fusion Block (Atte-FFB). CBR-Net operates in parallel with the MedSAM encoder to recover boundary information potentially overlooked by long-range attention mechanisms, leveraging hierarchical convolutional processing. Atte-FFB, embedded in the MedSAM decoder, fuses multi-level fine-grained features from skip connections in CBR-Net with global representations upsampled within the decoder to enhance boundary delineation accuracy. Experiments on publicly available datasets covering dermoscopy, CT, and MRI imaging modalities validate the effectiveness of MedSAM-CA. On dermoscopy dataset, MedSAM-CA achieves 94.43% Dice with only 2% of full training data, reaching 97.25% of full-data training performance, demonstrating strong effectiveness in low-resource clinical settings.

  • 4 authors
·
Jun 30, 2025

Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast

In this paper, we present Pangu-Weather, a deep learning based system for fast and accurate global weather forecast. For this purpose, we establish a data-driven environment by downloading 43 years of hourly global weather data from the 5th generation of ECMWF reanalysis (ERA5) data and train a few deep neural networks with about 256 million parameters in total. The spatial resolution of forecast is 0.25^circtimes0.25^circ, comparable to the ECMWF Integrated Forecast Systems (IFS). More importantly, for the first time, an AI-based method outperforms state-of-the-art numerical weather prediction (NWP) methods in terms of accuracy (latitude-weighted RMSE and ACC) of all factors (e.g., geopotential, specific humidity, wind speed, temperature, etc.) and in all time ranges (from one hour to one week). There are two key strategies to improve the prediction accuracy: (i) designing a 3D Earth Specific Transformer (3DEST) architecture that formulates the height (pressure level) information into cubic data, and (ii) applying a hierarchical temporal aggregation algorithm to alleviate cumulative forecast errors. In deterministic forecast, Pangu-Weather shows great advantages for short to medium-range forecast (i.e., forecast time ranges from one hour to one week). Pangu-Weather supports a wide range of downstream forecast scenarios, including extreme weather forecast (e.g., tropical cyclone tracking) and large-member ensemble forecast in real-time. Pangu-Weather not only ends the debate on whether AI-based methods can surpass conventional NWP methods, but also reveals novel directions for improving deep learning weather forecast systems.

  • 6 authors
·
Nov 3, 2022

Population Aware Diffusion for Time Series Generation

Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.

  • 5 authors
·
Jan 1, 2025 2

A Fully Automated DM-BIM-BEM Pipeline Enabling Graph-Based Intelligence, Interoperability, and Performance-Driven Early Design

Artificial intelligence in construction increasingly depends on structured representations such as Building Information Models and knowledge graphs, yet early-stage building designs are predominantly created as flexible boundary-representation (B-rep) models that lack explicit spatial, semantic, and performance structure. This paper presents a robust, fully automated framework that transforms unstructured B-rep geometry into knowledge-graph-based Building Information Models and further into executable Building Energy Models. The framework enables artificial intelligence to explicitly interpret building elements, spatial topology, and their associated thermal and performance attributes. It integrates automated geometry cleansing, multiple auto space-generation strategies, graph-based extraction of space and element topology, ontology-aligned knowledge modeling, and reversible transformation between ontology-based BIM and EnergyPlus energy models. Validation on parametric, sketch-based, and real-world building datasets demonstrates high robustness, consistent topological reconstruction, and reliable performance-model generation. By bridging design models, BIM, and BEM, the framework provides an AI-oriented infrastructure that extends BIM- and graph-based intelligence pipelines to flexible early-stage design geometry, enabling performance-driven design exploration and optimization by learning-based methods.

  • 6 authors
·
Jan 23

RARTS: An Efficient First-Order Relaxed Architecture Search Method

Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.

  • 3 authors
·
Aug 10, 2020

BIMgent: Towards Autonomous Building Modeling via Computer-use Agents

Existing computer-use agents primarily focus on general-purpose desktop automation tasks, with limited exploration of their application in highly specialized domains. In particular, the 3D building modeling process in the Architecture, Engineering, and Construction (AEC) sector involves open-ended design tasks and complex interaction patterns within Building Information Modeling (BIM) authoring software, which has yet to be thoroughly addressed by current studies. In this paper, we propose BIMgent, an agentic framework powered by multimodal large language models (LLMs), designed to enable autonomous building model authoring via graphical user interface (GUI) operations. BIMgent automates the architectural building modeling process, including multimodal input for conceptual design, planning of software-specific workflows, and efficient execution of the authoring GUI actions. We evaluate BIMgent on real-world building modeling tasks, including both text-based conceptual design generation and reconstruction from existing building design. The design quality achieved by BIMgent was found to be reasonable. Its operations achieved a 32% success rate, whereas all baseline models failed to complete the tasks (0% success rate). Results demonstrate that BIMgent effectively reduces manual workload while preserving design intent, highlighting its potential for practical deployment in real-world architectural modeling scenarios. Project page: https://tumcms.github.io/BIMgent.github.io/

  • 4 authors
·
Jun 8, 2025 1

DataLab: A Unifed Platform for LLM-Powered Business Intelligence

Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.

  • 21 authors
·
Dec 3, 2024

Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema Graph

Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves. We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called k-Min-Cost-Arborescence (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called arborescence. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables). Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.

  • 3 authors
·
Jun 21, 2023

BiBench: Benchmarking and Analyzing Network Binarization

Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research. The code for our BiBench is released https://github.com/htqin/BiBench .

  • 8 authors
·
Jan 26, 2023

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer.

  • 5 authors
·
Mar 15, 2023

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

  • 4 authors
·
Mar 7

The Evolution of Multimodal Model Architectures

This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.

  • 4 authors
·
May 28, 2024

BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

The low-level details and high-level semantics are both essential to the semantic segmentation task. However, to speed up the model inference, current approaches almost always sacrifice the low-level details, which leads to a considerable accuracy decrease. We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for realtime semantic segmentation. To this end, we propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral Segmentation Network (BiSeNet V2). This architecture involves: (i) a Detail Branch, with wide channels and shallow layers to capture low-level details and generate high-resolution feature representation; (ii) a Semantic Branch, with narrow channels and deep layers to obtain high-level semantic context. The Semantic Branch is lightweight due to reducing the channel capacity and a fast-downsampling strategy. Furthermore, we design a Guided Aggregation Layer to enhance mutual connections and fuse both types of feature representation. Besides, a booster training strategy is designed to improve the segmentation performance without any extra inference cost. Extensive quantitative and qualitative evaluations demonstrate that the proposed architecture performs favourably against a few state-of-the-art real-time semantic segmentation approaches. Specifically, for a 2,048x1,024 input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy.

  • 6 authors
·
Apr 4, 2020

Optimizing NOTEARS Objectives via Topological Swaps

Recently, an intriguing class of non-convex optimization problems has emerged in the context of learning directed acyclic graphs (DAGs). These problems involve minimizing a given loss or score function, subject to a non-convex continuous constraint that penalizes the presence of cycles in a graph. In this work, we delve into the optimization challenges associated with this class of non-convex programs. To address these challenges, we propose a bi-level algorithm that leverages the non-convex constraint in a novel way. The outer level of the algorithm optimizes over topological orders by iteratively swapping pairs of nodes within the topological order of a DAG. A key innovation of our approach is the development of an effective method for generating a set of candidate swapping pairs for each iteration. At the inner level, given a topological order, we utilize off-the-shelf solvers that can handle linear constraints. The key advantage of our proposed algorithm is that it is guaranteed to find a local minimum or a KKT point under weaker conditions compared to previous work and finds solutions with lower scores. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in terms of achieving a better score. Additionally, our method can also be used as a post-processing algorithm to significantly improve the score of other algorithms. Code implementing the proposed method is available at https://github.com/duntrain/topo.

  • 4 authors
·
May 26, 2023

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

Given a large and evolving codebase, the ability to automatically generate holistic, architecture-aware documentation that captures not only individual functions but also cross-file, cross-module, and system-level interactions remains an open challenge. Comprehensive documentation is essential for long-term software maintenance and collaboration, yet current automated approaches still fail to model the rich semantic dependencies and architectural structures that define real-world software systems. We present CodeWiki, a unified framework for automated repository-level documentation across seven programming languages. CodeWiki introduces three key innovations: (i) hierarchical decomposition that preserves architectural context across multiple levels of granularity, (ii) recursive multi-agent processing with dynamic task delegation for scalable generation, and (iii) multi-modal synthesis that integrates textual descriptions with visual artifacts such as architecture diagrams and data-flow representations. To enable rigorous evaluation, we introduce CodeWikiBench, a comprehensive benchmark featuring multi-dimensional rubrics and LLM-based assessment protocols. Experimental results show that CodeWiki achieves a 68.79\% quality score with proprietary models, outperforming the closed-source DeepWiki baseline (64.06\%) by 4.73\%, with particularly strong improvements on high-level scripting languages (+10.47\%). We open-source CodeWiki to foster future research and community adoption.

  • 4 authors
·
Oct 28, 2025

ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design

Machine learning is a prevalent approach to tame the complexity of design space exploration for domain-specific architectures. Using ML for design space exploration poses challenges. First, it's not straightforward to identify the suitable algorithm from an increasing pool of ML methods. Second, assessing the trade-offs between performance and sample efficiency across these methods is inconclusive. Finally, lack of a holistic framework for fair, reproducible, and objective comparison across these methods hinders progress of adopting ML-aided architecture design space exploration and impedes creating repeatable artifacts. To mitigate these challenges, we introduce ArchGym, an open-source gym and easy-to-extend framework that connects diverse search algorithms to architecture simulators. To demonstrate utility, we evaluate ArchGym across multiple vanilla and domain-specific search algorithms in designing custom memory controller, deep neural network accelerators, and custom SoC for AR/VR workloads, encompassing over 21K experiments. Results suggest that with unlimited samples, ML algorithms are equally favorable to meet user-defined target specification if hyperparameters are tuned; no solution is necessarily better than another (e.g., reinforcement learning vs. Bayesian methods). We coin the term hyperparameter lottery to describe the chance for a search algorithm to find an optimal design provided meticulously selected hyperparameters. The ease of data collection and aggregation in ArchGym facilitates research in ML-aided architecture design space exploration. As a case study, we show this advantage by developing a proxy cost model with an RMSE of 0.61% that offers a 2,000-fold reduction in simulation time. Code and data for ArchGym is available at https://bit.ly/ArchGym.

  • 11 authors
·
Jun 15, 2023

AlphaGo Moment for Model Architecture Discovery

While AI systems demonstrate exponentially improving capabilities, the pace of AI research itself remains linearly bounded by human cognitive capacity, creating an increasingly severe development bottleneck. We present ASI-Arch, the first demonstration of Artificial Superintelligence for AI research (ASI4AI) in the critical domain of neural architecture discovery--a fully autonomous system that shatters this fundamental constraint by enabling AI to conduct its own architectural innovation. Moving beyond traditional Neural Architecture Search (NAS), which is fundamentally limited to exploring human-defined spaces, we introduce a paradigm shift from automated optimization to automated innovation. ASI-Arch can conduct end-to-end scientific research in the domain of architecture discovery, autonomously hypothesizing novel architectural concepts, implementing them as executable code, training and empirically validating their performance through rigorous experimentation and past experience. ASI-Arch conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures. Like AlphaGo's Move 37 that revealed unexpected strategic insights invisible to human players, our AI-discovered architectures demonstrate emergent design principles that systematically surpass human-designed baselines and illuminate previously unknown pathways for architectural innovation. Crucially, we establish the first empirical scaling law for scientific discovery itself--demonstrating that architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process. We provide comprehensive analysis of the emergent design patterns and autonomous research capabilities that enabled these breakthroughs, establishing a blueprint for self-accelerating AI systems.

  • 7 authors
·
Jul 23, 2025 1

Exploring the Reasoning Depth of Small Language Models in Software Architecture: A Multidimensional Evaluation Framework Towards Software Engineering 2.0

In the era of "Software Engineering 2.0" (SE 2.0), where intelligent agents collaborate with human engineers, Generative AI is advancing beyond code generation into Software Architecture (SA). While Large Language Models (LLMs) demonstrate superior capabilities, computational costs and data privacy concerns drive interest in Small Language Models (SLMs) with fewer than 7 billion parameters. However, the reasoning limits of these resource-constrained models remain unexplored. This study benchmarks 10 state-of-the-art SLMs on Architectural Decision Records generation, introducing a multi-dimensional framework evaluating Technical Compliance and Semantic Diversity. Our empirical results reveal a significant reasoning gap: models above the 3B-parameter threshold demonstrate robust zero-shot capabilities, while sub-2B models show the strongest BERTScore gains from Fine-Tuning, though compliance improvements are not guaranteed. Contrary to assumptions regarding context saturation, Few-Shot prompting serves as a highly effective calibration mechanism for select mid-sized models with short context windows. Furthermore, high semantic diversity in off-the-shelf small models often correlates with hallucination rather than productive exploration. These findings establish a rigorous baseline for deploying sustainable, locally hosted architectural assistants.

  • 5 authors
·
Mar 6

Towards Responsible AI in the Era of ChatGPT: A Reference Architecture for Designing Foundation Model-based AI Systems

The release of ChatGPT, Bard, and other large language model (LLM)-based chatbots has drawn huge attention on foundations models worldwide. There is a growing trend that foundation models will serve as the fundamental building blocks for most of the future AI systems. However, incorporating foundation models in AI systems raises significant concerns about responsible AI due to their black box nature and rapidly advancing super-intelligence. Additionally, the foundation model's growing capabilities can eventually absorb the other components of AI systems, introducing the moving boundary and interface evolution challenges in architecture design. To address these challenges, this paper proposes a pattern-oriented responsible-AI-by-design reference architecture for designing foundation model-based AI systems. Specially, the paper first presents an architecture evolution of AI systems in the era of foundation models, from "foundation-model-as-a-connector" to "foundation-model-as-a-monolithic architecture". The paper then identifies the key design decision points and proposes a pattern-oriented reference architecture to provide reusable responsible-AI-by-design architectural solutions to address the new architecture evolution and responsible AI challenges. The patterns can be embedded as product features of foundation model-based AI systems and can enable organisations to capitalise on the potential of foundation models while minimising associated risks.

  • 5 authors
·
Apr 13, 2023

CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

  • 9 authors
·
May 25, 2025

Efficient and Scalable Agentic AI with Heterogeneous Systems

AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.

  • 3 authors
·
Jul 25, 2025

A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications

This survey examines the rapidly evolving field of Deep Research systems -- AI-powered applications that automate complex research workflows through the integration of large language models, advanced information retrieval, and autonomous reasoning capabilities. We analyze more than 80 commercial and non-commercial implementations that have emerged since 2023, including OpenAI/Deep Research, Gemini/Deep Research, Perplexity/Deep Research, and numerous open-source alternatives. Through comprehensive examination, we propose a novel hierarchical taxonomy that categorizes systems according to four fundamental technical dimensions: foundation models and reasoning engines, tool utilization and environmental interaction, task planning and execution control, and knowledge synthesis and output generation. We explore the architectural patterns, implementation approaches, and domain-specific adaptations that characterize these systems across academic, scientific, business, and educational applications. Our analysis reveals both the significant capabilities of current implementations and the technical and ethical challenges they present regarding information accuracy, privacy, intellectual property, and accessibility. The survey concludes by identifying promising research directions in advanced reasoning architectures, multimodal integration, domain specialization, human-AI collaboration, and ecosystem standardization that will likely shape the future evolution of this transformative technology. By providing a comprehensive framework for understanding Deep Research systems, this survey contributes to both the theoretical understanding of AI-augmented knowledge work and the practical development of more capable, responsible, and accessible research technologies. The paper resources can be viewed at https://github.com/scienceaix/deepresearch.

  • 2 authors
·
Jun 14, 2025

UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for production systems. We present UI-CUBE (UiPath Computer Use BEnchmark), a systematic benchmark comprising 226 tasks across two difficulty tiers designed to expose fundamental architectural limitations in current CUAs. Our evaluation covers simple UI interactions (136 tasks) and complex workflows including copy-paste tasks (50 tasks) and enterprise application scenarios (40 tasks), with systematic interface variation coverage, multi-resolution testing and automated validation of task success through the application state. Evaluation of five state-of-the-art models reveals a sharp capability cliff rather than gradual performance degradation. Simple UI interactions achieve 67-85% success rates (compared to 97.9% human performance), but complex workflows drop precipitously to 9-19%. Human evaluators with no prior application experience achieve only 61.2% on complex tasks despite near-perfect performance on simple tasks, establishing realistic performance ceilings. This discontinuous performance pattern -- where agents achieve 68-87% of human performance on simple tasks but only 15-32% on complex workflows -- indicates fundamental architectural limitations in memory management, hierarchical planning, and state coordination rather than incremental capability gaps addressable through better training or prompting. UI-CUBE functions as an enterprise-readiness diagnostic, revealing that while current CUAs can manipulate individual interface elements, they cannot yet function as reliable workflow automation tools. These findings provide architectural insights essential for developing production-ready CUAs capable of managing complex, multi-step enterprise processes.

  • 6 authors
·
Nov 21, 2025

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.

  • 11 authors
·
Jan 18

Bi-Mem: Bidirectional Construction of Hierarchical Memory for Personalized LLMs via Inductive-Reflective Agents

Constructing memory from users' long-term conversations overcomes LLMs' contextual limitations and enables personalized interactions. Recent studies focus on hierarchical memory to model users' multi-granular behavioral patterns via clustering and aggregating historical conversations. However, conversational noise and memory hallucinations can be amplified during clustering, causing locally aggregated memories to misalign with the user's global persona. To mitigate this issue, we propose Bi-Mem, an agentic framework ensuring hierarchical memory fidelity through bidirectional construction. Specifically, we deploy an inductive agent to form the hierarchical memory: it extracts factual information from raw conversations to form fact-level memory, aggregates them into thematic scenes (i.e., local scene-level memory) using graph clustering, and infers users' profiles as global persona-level memory. Simultaneously, a reflective agent is designed to calibrate local scene-level memories using global constraints derived from the persona-level memory, thereby enforcing global-local alignment. For coherent memory recall, we propose an associative retrieval mechanism: beyond initial hierarchical search, a spreading activation process allows facts to evoke contextual scenes, while scene-level matches retrieve salient supporting factual information. Empirical evaluations demonstrate that Bi-Mem achieves significant improvements in question answering performance on long-term personalized conversational tasks.

  • 7 authors
·
Jan 10

From CISC to RISC: language-model guided assembly transpilation

The transition from x86 to ARM architecture is becoming increasingly common across various domains, primarily driven by ARM's energy efficiency and improved performance across traditional sectors. However, this ISA shift poses significant challenges, mainly due to the extensive legacy ecosystem of x86 software and lack of portability across proprietary ecosystems and software stacks. This paper introduces CRT, a lightweight LLM-based transpiler that automatically converts x86 assembly to ARM assembly. Our approach bridges the fundamental architectural gap between x86's CISC-based and ARM's RISC-based computing paradigms while preserving program semantics and optimizing performance. We evaluate CRT on diverse real-world applications, achieving 79.25% translation accuracy from x86 to ARMv5 on our comprehensive test suite, and an 88.68% accuracy from x86 to RISC-V. In practical deployments on Apple M2 hardware (ARMv8), our transpiled code achieves 1.73times speedup compared to Apple's Rosetta 2 virtualization engine, while delivering 2.41times memory efficiency and 1.47times better energy consumption. Through testing and analysis, we show that CRT successfully navigates the CISC/RISC divide and generates correctly executable RISC code despite machine ``language'' barriers. We release our code, models, training datasets, and benchmarks at: https://ahmedheakl.github.io/asm2asm/.

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.

  • 30 authors
·
May 13, 2024

ChiseLLM: Unleashing the Power of Reasoning LLMs for Chisel Agile Hardware Development

The growing demand for Domain-Specific Architecture (DSA) has driven the development of Agile Hardware Development Methodology (AHDM). Hardware Construction Language (HCL) like Chisel offers high-level abstraction features, making it an ideal language for HCL-Based AHDM. While Large Language Models (LLMs) excel in code generation tasks, they still face challenges with Chisel generation, particularly regarding syntax correctness and design variability. Recent reasoning models have significantly enhanced code generation capabilities through test-time scaling techniques. However, we found that reasoning models without domain adaptation cannot bring substantial benefits to Chisel code generation tasks. This paper presents ChiseLLM, a solution comprising data processing and transformation, prompt-guided reasoning trace synthesis, and domain-adapted model training. We constructed high-quality datasets from public RTL code resources and guided the model to adopt structured thinking patterns through prompt enhancement methods. Experiments demonstrate that our ChiseLLM-7B and ChiseLLM-32B models improved syntax correctness by 18.85% and 26.32% respectively over base models, while increasing variability design ability by 47.58% compared to baseline reasoning models. Our datasets and models are publicly available, providing high-performance, cost-effective models for HCL-Based AHDM, and offering an effective baseline for future research. Github repository: https://github.com/observerw/ChiseLLM

  • 6 authors
·
Apr 27, 2025 2

The revenge of BiSeNet: Efficient Multi-Task Image Segmentation

Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.

  • 5 authors
·
Apr 15, 2024

Quantized Spike-driven Transformer

Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrained devices. To overcome this challenge, we propose a quantized spike-driven Transformer baseline (QSD-Transformer), which achieves reduced resource demands by utilizing a low bit-width parameter. Regrettably, the QSD-Transformer often suffers from severe performance degradation. In this paper, we first conduct empirical analysis and find that the bimodal distribution of quantized spike-driven self-attention (Q-SDSA) leads to spike information distortion (SID) during quantization, causing significant performance degradation. To mitigate this issue, we take inspiration from mutual information entropy and propose a bi-level optimization strategy to rectify the information distribution in Q-SDSA. Specifically, at the lower level, we introduce an information-enhanced LIF to rectify the information distribution in Q-SDSA. At the upper level, we propose a fine-grained distillation scheme for the QSD-Transformer to align the distribution in Q-SDSA with that in the counterpart ANN. By integrating the bi-level optimization strategy, the QSD-Transformer can attain enhanced energy efficiency without sacrificing its high-performance advantage.For instance, when compared to the prior SNN benchmark on ImageNet, the QSD-Transformer achieves 80.3% top-1 accuracy, accompanied by significant reductions of 6.0times and 8.1times in power consumption and model size, respectively. Code is available at https://github.com/bollossom/QSD-Transformer.

  • 10 authors
·
Jan 23, 2025

SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation with Fine-Grained Geometry

3D indoor scenes are widely used in computer graphics, with applications ranging from interior design to gaming to virtual and augmented reality. They also contain rich information, including room layout, as well as furniture type, geometry, and placement. High-quality 3D indoor scenes are highly demanded while it requires expertise and is time-consuming to design high-quality 3D indoor scenes manually. Existing research only addresses partial problems: some works learn to generate room layout, and other works focus on generating detailed structure and geometry of individual furniture objects. However, these partial steps are related and should be addressed together for optimal synthesis. We propose SCENEHGN, a hierarchical graph network for 3D indoor scenes that takes into account the full hierarchy from the room level to the object level, then finally to the object part level. Therefore for the first time, our method is able to directly generate plausible 3D room content, including furniture objects with fine-grained geometry, and their layout. To address the challenge, we introduce functional regions as intermediate proxies between the room and object levels to make learning more manageable. To ensure plausibility, our graph-based representation incorporates both vertical edges connecting child nodes with parent nodes from different levels, and horizontal edges encoding relationships between nodes at the same level. Extensive experiments demonstrate that our method produces superior generation results, even when comparing results of partial steps with alternative methods that can only achieve these. We also demonstrate that our method is effective for various applications such as part-level room editing, room interpolation, and room generation by arbitrary room boundaries.

  • 6 authors
·
Feb 16, 2023

Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration

Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems grow in complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compiler optimization, etc.) quickly results in a combinatorial explosion of design space. This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role assigned to each agent. We test this hypothesis by designing domain-specific DRAM memory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.

  • 7 authors
·
Nov 29, 2022

The Architecture Tradeoff and Risk Analysis Framework (ATRAF): A Unified Approach for Evaluating Software Architectures, Reference Architectures, and Architectural Frameworks

Modern software systems are guided by hierarchical architectural concepts -- software architectures, reference architectures, and architectural frameworks -- each operating at a distinct level of abstraction. These artifacts promote reuse, scalability, and consistency, but also embed tradeoffs that shape critical quality attributes such as modifiability, performance, and security. Existing evaluation methods, such as the Architecture Tradeoff Analysis Method (ATAM), focus on system-specific architectures and are not designed to address the broader generality and variability of higher-level architectural forms. To close this gap, we introduce the Architecture Tradeoff and Risk Analysis Framework (ATRAF) -- a unified, scenario-driven framework for evaluating tradeoffs and risks across architectural levels. ATRAF encompasses three methods: the Architecture Tradeoff and Risk Analysis Method (ATRAM), extending ATAM with enhanced risk identification for concrete systems; the Reference Architecture Tradeoff and Risk Analysis Method (RATRAM), adapting ATRAM to the evaluation of domain-level reference architectures; and the Architectural Framework Tradeoff and Risk Analysis Method (AFTRAM), supporting the evaluation of architectural frameworks that guide entire system families. All three methods follow an iterative spiral process that enables the identification of sensitivities, tradeoffs, and risks while supporting continuous refinement of architectural artifacts. We demonstrate ATRAF through progressively abstracted examples derived from the Remote Temperature Sensor (RTS) case, originally introduced in the ATAM literature. ATRAF equips architects, reference modelers, and framework designers with a practical, systematic approach for analyzing design alternatives and managing quality attribute tradeoffs early in the lifecycle and across all levels of architectural abstraction.

Dracodes Dracodes
·
May 1, 2025 1

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

  • 5 authors
·
Apr 6

InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs

The rise of deep neural networks (DNNs) has driven an increased demand for computing power and memory. Modern DNNs exhibit high data volume variation (HDV) across tasks, which poses challenges for FPGA acceleration: conventional accelerators rely on fixed execution patterns (dataflow or sequential) that can lead to pipeline stalls or necessitate frequent off-chip memory accesses. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design methodology for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and problem sizes. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task HDV DNN kernels using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits 1.8times and 7.1 times speedups correspondingly. Moreover, we extend InTAR to GPT-2 medium as a more complex example, which is 3.65 sim 39.14times faster and a 1.72 sim 10.44times more DSP efficient than SoTA accelerators (Allo and DFX) on FPGAs. Additionally, this design demonstrates 1.66 sim 7.17times better power efficiency than GPUs. Code: https://github.com/OswaldHe/InTAR

  • 4 authors
·
Feb 12, 2025

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.

  • 3 authors
·
Sep 5, 2024

BRIDGES: Bridging Graph Modality and Large Language Models within EDA Tasks

While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (<1% model weights increase and <30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.

  • 6 authors
·
Apr 7, 2025

iHAS: Instance-wise Hierarchical Architecture Search for Deep Learning Recommendation Models

Current recommender systems employ large-sized embedding tables with uniform dimensions for all features, leading to overfitting, high computational cost, and suboptimal generalizing performance. Many techniques aim to solve this issue by feature selection or embedding dimension search. However, these techniques typically select a fixed subset of features or embedding dimensions for all instances and feed all instances into one recommender model without considering heterogeneity between items or users. This paper proposes a novel instance-wise Hierarchical Architecture Search framework, iHAS, which automates neural architecture search at the instance level. Specifically, iHAS incorporates three stages: searching, clustering, and retraining. The searching stage identifies optimal instance-wise embedding dimensions across different field features via carefully designed Bernoulli gates with stochastic selection and regularizers. After obtaining these dimensions, the clustering stage divides samples into distinct groups via a deterministic selection approach of Bernoulli gates. The retraining stage then constructs different recommender models, each one designed with optimal dimensions for the corresponding group. We conduct extensive experiments to evaluate the proposed iHAS on two public benchmark datasets from a real-world recommender system. The experimental results demonstrate the effectiveness of iHAS and its outstanding transferability to widely-used deep recommendation models.

  • 5 authors
·
Sep 14, 2023

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Text-to-image generation (T2I) has become a key area of research with broad applications. However, existing methods often struggle with complex spatial relationships and fine-grained control over multiple concepts. Many existing approaches require significant architectural modifications, extensive training, or expert-level prompt engineering. To address these challenges, we introduce LayerCraft, an automated framework that leverages large language models (LLMs) as autonomous agents for structured procedural generation. LayerCraft enables users to customize objects within an image and supports narrative-driven creation with minimal effort. At its core, the system includes a coordinator agent that directs the process, along with two specialized agents: ChainArchitect, which employs chain-of-thought (CoT) reasoning to generate a dependency-aware 3D layout for precise instance-level control, and the Object-Integration Network (OIN), which utilizes LoRA fine-tuning on pre-trained T2I models to seamlessly blend objects into specified regions of an image based on textual prompts without requiring architectural changes. Extensive evaluations demonstrate LayerCraft's versatility in applications ranging from multi-concept customization to storytelling. By providing non-experts with intuitive, precise control over T2I generation, our framework democratizes creative image creation. Our code will be released upon acceptance at github.com/PeterYYZhang/LayerCraft

  • 3 authors
·
Mar 25, 2025

Working Paper: Towards a Category-theoretic Comparative Framework for Artificial General Intelligence

AGI has become the Holly Grail of AI with the promise of level intelligence and the major Tech companies around the world are investing unprecedented amounts of resources in its pursuit. Yet, there does not exist a single formal definition and only some empirical AGI benchmarking frameworks currently exist. The main purpose of this paper is to develop a general, algebraic and category theoretic framework for describing, comparing and analysing different possible AGI architectures. Thus, this Category theoretic formalization would also allow to compare different possible candidate AGI architectures, such as, RL, Universal AI, Active Inference, CRL, Schema based Learning, etc. It will allow to unambiguously expose their commonalities and differences, and what is even more important, expose areas for future research. From the applied Category theoretic point of view, we take as inspiration Machines in a Category to provide a modern view of AGI Architectures in a Category. More specifically, this first position paper provides, on one hand, a first exercise on RL, Causal RL and SBL Architectures in a Category, and on the other hand, it is a first step on a broader research program that seeks to provide a unified formal foundation for AGI systems, integrating architectural structure, informational organization, agent realization, agent and environment interaction, behavioural development over time, and the empirical evaluation of properties. This framework is also intended to support the definition of architectural properties, both syntactic and informational, as well as semantic properties of agents and their assessment in environments with explicitly characterized features. We claim that Category Theory and AGI will have a very symbiotic relation.

  • 3 authors
·
Apr 7

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba

  • 7 authors
·
Mar 26, 2024

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands design-oriented planning, reasoning, and layer-wise generation. Unlike the recent CanvaGPT, which integrates GPT-4 with existing design templates to build a custom GPT, this paper introduces the COLE system - a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a vague intention prompt into a high-quality multi-layered graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text. Furthermore, we construct the DESIGNINTENTION benchmark to demonstrate the superiority of our COLE system over existing methods in generating high-quality graphic designs from user intent. Last, we present a Canva-like multi-layered image editing tool to support flexible editing of the generated multi-layered graphic design images. We perceive our COLE system as an important step towards addressing more complex and multi-layered graphic design generation tasks in the future.

  • 13 authors
·
Nov 28, 2023

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

The most important architectural problem in AI is not the size of the model but the absence of a layer that carries forward what the model has come to understand. Sessions end. Context windows fill. Memory APIs return flat facts that the model has to reinterpret from scratch on every read. The result is intelligence that is powerful per session and amnesiac across time. This position paper argues that the layer which fixes this, the continuity layer, is the most consequential piece of infrastructure the field has not yet built, and that the engineering work to build it has begun in public. The formal evaluation framework for the property described here is the ATANT benchmark (arXiv:2604.06710), published separately with evaluation results on a 250-story corpus; a companion paper (arXiv:2604.10981) positions this framework against existing memory, long-context, and agentic-memory benchmarks. The paper defines continuity as a system property with seven required characteristics, distinct from memory and from retrieval; describes a storage primitive (Decomposed Trace Convergence Memory) whose write-time decomposition and read-time reconstruction produce that property; maps the engineering architecture to the theological pattern of kenosis and the symbolic pattern of Alpha and Omega, and argues this mapping is structural rather than metaphorical; proposes a four-layer development arc from external SDK to hardware node to long-horizon human infrastructure; examines why the physics limits now constraining the model layer make the continuity layer newly consequential; and argues that the governance architecture (privacy implemented as physics rather than policy, founder-controlled class shares on non-negotiable architectural commitments) is inseparable from the product itself.

Kenotic-Labs Kenotic Labs
·
Apr 18 2

AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias

Fairness is an increasingly important concern as machine learning models are used to support decision making in high-stakes applications such as mortgage lending, hiring, and prison sentencing. This paper introduces a new open source Python toolkit for algorithmic fairness, AI Fairness 360 (AIF360), released under an Apache v2.0 license {https://github.com/ibm/aif360). The main objectives of this toolkit are to help facilitate the transition of fairness research algorithms to use in an industrial setting and to provide a common framework for fairness researchers to share and evaluate algorithms. The package includes a comprehensive set of fairness metrics for datasets and models, explanations for these metrics, and algorithms to mitigate bias in datasets and models. It also includes an interactive Web experience (https://aif360.mybluemix.net) that provides a gentle introduction to the concepts and capabilities for line-of-business users, as well as extensive documentation, usage guidance, and industry-specific tutorials to enable data scientists and practitioners to incorporate the most appropriate tool for their problem into their work products. The architecture of the package has been engineered to conform to a standard paradigm used in data science, thereby further improving usability for practitioners. Such architectural design and abstractions enable researchers and developers to extend the toolkit with their new algorithms and improvements, and to use it for performance benchmarking. A built-in testing infrastructure maintains code quality.

  • 18 authors
·
Oct 3, 2018

Bi-Mamba: Towards Accurate 1-Bit State Space Models

The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. Bi-Mamba models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that Bi-Mamba achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, Bi-Mamba drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.

  • 5 authors
·
Nov 18, 2024

Bridging the Initialization Gap: A Co-Optimization Framework for Mixed-Size Global Placement

Global placement is a critical step with high computational complexity in VLSI physical design. Modern analytical placers formulate the placement problem as a nonlinear optimization, where initialization strongly affects both convergence behavior and final placement quality. However, existing initialization methods exhibit a trade-off: area-aware initializers account for cell areas but are computationally expensive and can dominate total runtime, while fast point-based initializers ignore cell area, leading to a modeling gap that impairs convergence and solution quality. We propose a lightweight co-optimization framework that bridges this initialization gap through two strategies. First, an area-hint refinement initializer incorporates heuristic cell area information into a signed graph signal by augmenting the netlist graph with virtual nodes and negative-weight edges, yielding an area-aware and spectrally smooth placement initialization. Second, a macro-schedule placement procedure progressively restores area constraints, enabling a smooth transition from the refined initializer to the full area-aware objective and producing high-quality placement results. We evaluate the framework on macro-heavy ISPD2005 academic benchmarks and two real-world industrial designs across two technology nodes (12 cases in total). Experimental results show that our method consistently improves half-perimeter wirelength (HPWL) over point-based initializers in 11 out of 12 cases, achieving up to 2.2% HPWL reduction, while running approximately 100 times faster than the state-of-the-art area-aware initializer.

  • 7 authors
·
Nov 12, 2025

Closing the Performance Gap with Modern C++

On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware execution models, such as GPGPU's, SIMD vector units, and general purpose cores which conventionally have to be programmed using separate tool chains representing non-overlapping programming models. The recent revival of interest in the industry and the wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications in the arena of concurrency and parallelism. This recently includes an increasing amount of discussion around the need for a uniform, higher-level abstraction and programming model for parallelism in the C++ standard targeting heterogeneous and distributed computing. Such an abstraction should perfectly blend with existing, already standardized language and library features, but should also be generic enough to support future hardware developments. In this paper, we present the results from developing such a higher-level programming abstraction for parallelism in C++ which aims at enabling code and performance portability over a wide range of architectures and for various types of parallelism. We present and compare performance data obtained from running the well-known STREAM benchmark ported to our higher level C++ abstraction with the corresponding results from running it natively. We show that our abstractions enable performance at least as good as the comparable base-line benchmarks while providing a uniform programming API on all compared target architectures.

  • 5 authors
·
May 30, 2022

ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages

Building multi-modal language models has been a trend in the recent years, where additional modalities such as image, video, speech, etc. are jointly learned along with natural languages (i.e., textual information). Despite the success of these multi-modal language models with different modalities, there is no existing solution for neural network architectures and natural languages. Providing neural architectural information as a new modality allows us to provide fast architecture-2-text and text-2-architecture retrieval/generation services on the cloud with a single inference. Such solution is valuable in terms of helping beginner and intermediate ML users to come up with better neural architectures or AutoML approaches with a simple text query. In this paper, we propose ArchBERT, a bi-modal model for joint learning and understanding of neural architectures and natural languages, which opens up new avenues for research in this area. We also introduce a pre-training strategy named Masked Architecture Modeling (MAM) for a more generalized joint learning. Moreover, we introduce and publicly release two new bi-modal datasets for training and validating our methods. The ArchBERT's performance is verified through a set of numerical experiments on different downstream tasks such as architecture-oriented reasoning, question answering, and captioning (summarization). Datasets, codes, and demos are available supplementary materials.

Autonomous Business System via Neuro-symbolic AI

Modern business environments demand continuous reconfiguration of cross-functional processes, yet most enterprise systems remain organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile, large language models (LLMs) demonstrate strong capabilities in interpreting natural language and synthesizing unstructured information, but they lack deterministic, auditable execution of complex business logic. We introduce Autonomous Business System (AUTOBUS), a system that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a unified neuro-symbolic architecture for executing end-to-end business initiatives. AUTOBUS models a business initiative as a network of interrelated tasks with explicit pre- and post-conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph, whose entities, relationships, and constraints are translated into logic facts and foundational rules that ground reasoning and ensure semantic consistency. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and produces deterministic outcomes. Humans specify task instructions, define and maintain business semantics and policies, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the structure of AI-generated logic programs, and the human-AI collaboration model and present a case study that demonstrates accelerated time to market in a data-rich organization. A reference implementation of the case study is available at https://github.com/cecilpang/autobus-paper.

  • 2 authors
·
Jan 21

The whole brain architecture approach: Accelerating the development of artificial general intelligence by referring to the brain

The vastness of the design space created by the combination of a large number of computational mechanisms, including machine learning, is an obstacle to creating an artificial general intelligence (AGI). Brain-inspired AGI development, in other words, cutting down the design space to look more like a biological brain, which is an existing model of a general intelligence, is a promising plan for solving this problem. However, it is difficult for an individual to design a software program that corresponds to the entire brain because the neuroscientific data required to understand the architecture of the brain are extensive and complicated. The whole-brain architecture approach divides the brain-inspired AGI development process into the task of designing the brain reference architecture (BRA) -- the flow of information and the diagram of corresponding components -- and the task of developing each component using the BRA. This is called BRA-driven development. Another difficulty lies in the extraction of the operating principles necessary for reproducing the cognitive-behavioral function of the brain from neuroscience data. Therefore, this study proposes the Structure-constrained Interface Decomposition (SCID) method, which is a hypothesis-building method for creating a hypothetical component diagram consistent with neuroscientific findings. The application of this approach has begun for building various regions of the brain. Moving forward, we will examine methods of evaluating the biological plausibility of brain-inspired software. This evaluation will also be used to prioritize different computational mechanisms, which should be merged, associated with the same regions of the brain.

  • 1 authors
·
Mar 5, 2021

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

  • 5 authors
·
Dec 30, 2025

K-Dense Analyst: Towards Fully Automated Scientific Analysis

The complexity of modern bioinformatics analysis has created a critical gap between data generation and developing scientific insights. While large language models (LLMs) have shown promise in scientific reasoning, they remain fundamentally limited when dealing with real-world analytical workflows that demand iterative computation, tool integration and rigorous validation. We introduce K-Dense Analyst, a hierarchical multi-agent system that achieves autonomous bioinformatics analysis through a dual-loop architecture. K-Dense Analyst, part of the broader K-Dense platform, couples planning with validated execution using specialized agents to decompose complex objectives into executable, verifiable tasks within secure computational environments. On BixBench, a comprehensive benchmark for open-ended biological analysis, K-Dense Analyst achieves 29.2% accuracy, surpassing the best-performing language model (GPT-5) by 6.3 percentage points, representing nearly 27% improvement over what is widely considered the most powerful LLM available. Remarkably, K-Dense Analyst achieves this performance using Gemini 2.5 Pro, which attains only 18.3% accuracy when used directly, demonstrating that our architectural innovations unlock capabilities far beyond the underlying model's baseline performance. Our insights demonstrate that autonomous scientific reasoning requires more than enhanced language models, it demands purpose-built systems that can bridge the gap between high-level scientific objectives and low-level computational execution. These results represent a significant advance toward fully autonomous computational biologists capable of accelerating discovery across the life sciences.

  • 5 authors
·
Aug 9, 2025

WaveMix: A Resource-efficient Neural Network for Image Analysis

We propose WaveMix -- a novel neural architecture for computer vision that is resource-efficient yet generalizable and scalable. WaveMix networks achieve comparable or better accuracy than the state-of-the-art convolutional neural networks, vision transformers, and token mixers for several tasks, establishing new benchmarks for segmentation on Cityscapes; and for classification on Places-365, five EMNIST datasets, and iNAT-mini. Remarkably, WaveMix architectures require fewer parameters to achieve these benchmarks compared to the previous state-of-the-art. Moreover, when controlled for the number of parameters, WaveMix requires lesser GPU RAM, which translates to savings in time, cost, and energy. To achieve these gains we used multi-level two-dimensional discrete wavelet transform (2D-DWT) in WaveMix blocks, which has the following advantages: (1) It reorganizes spatial information based on three strong image priors -- scale-invariance, shift-invariance, and sparseness of edges, (2) in a lossless manner without adding parameters, (3) while also reducing the spatial sizes of feature maps, which reduces the memory and time required for forward and backward passes, and (4) expanding the receptive field faster than convolutions do. The whole architecture is a stack of self-similar and resolution-preserving WaveMix blocks, which allows architectural flexibility for various tasks and levels of resource availability. Our code and trained models are publicly available.

  • 4 authors
·
May 28, 2022

AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation

Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide promising latency and performance trade-offs, support a variety of tasks, scale efficiently with respect to the amounts of data and compute, leverage available data from other tasks, and efficiently support various hardware. To this end, we introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. We revisit the key design principles of hybrid architectures and propose a simple and effective asymmetric architecture, where the distribution of convolutional and transformer blocks is asymmetric, containing more convolutional blocks in the earlier stages, followed by more transformer blocks in later stages. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation, and features a superior trade-off between performance and latency. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance compared to the most recent public and commercial models. Notably, even without any computation optimization for transformer blocks, our models still yield faster inference speed than existing works featuring efficient attention mechanisms, highlighting the advantages and the value of our approach.

  • 8 authors
·
Nov 7, 2024

Configurable Foundation Models: Building LLMs from a Modular Perspective

Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.

openbmb OpenBMB
·
Sep 4, 2024 2

Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

Large Language Models (LLMs) are fundamentally constrained by the quadratic computational cost of self-attention and the "Lost in the Middle" phenomenon, where reasoning capabilities degrade as context windows expand. Existing solutions, primarily "Flat RAG" architectures relying on vector databases, treat memory as an unstructured bag of embeddings, failing to capture the hierarchical and temporal structure of long-horizon interactions. This paper presents Aeon, a Neuro-Symbolic Cognitive Operating System that redefines memory as a managed OS resource. Aeon structures memory into a Memory Palace (a spatial index implemented via Atlas, a SIMD-accelerated Page-Clustered Vector Index) and a Trace (a neuro-symbolic episodic graph). This architecture introduces three advances: (1) Symmetric INT8 Scalar Quantization, achieving 3.1x spatial compression and 5.6x math acceleration via NEON SDOT intrinsics; (2) a decoupled Write-Ahead Log (WAL) ensuring crash-recoverability with statistically negligible overhead (<1%); and (3) a Sidecar Blob Arena eliminating the prior 440-character text ceiling via an append-only mmap-backed blob file with generational garbage collection. The Semantic Lookaside Buffer (SLB) exploits conversational locality to achieve sub-5us retrieval latencies, with INT8 vectors dequantized to FP32 on cache insertion to preserve L1-resident lookup performance. Benchmarks on Apple M4 Max demonstrate that the combined architecture achieves 4.70ns INT8 dot product latency, 3.09us tree traversal at 100K nodes (3.4x over FP32), and P99 read latency of 750ns under hostile 16-thread contention via epoch-based reclamation.

  • 1 authors
·
Jan 14

ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans

We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.

  • 2 authors
·
Aug 19, 2025

AI Co-Artist: A LLM-Powered Framework for Interactive GLSL Shader Animation Evolution

Creative coding and real-time shader programming are at the forefront of interactive digital art, enabling artists, designers, and enthusiasts to produce mesmerizing, complex visual effects that respond to real-time stimuli such as sound or user interaction. However, despite the rich potential of tools like GLSL, the steep learning curve and requirement for programming fluency pose substantial barriers for newcomers and even experienced artists who may not have a technical background. In this paper, we present AI Co-Artist, a novel interactive system that harnesses the capabilities of large language models (LLMs), specifically GPT-4, to support the iterative evolution and refinement of GLSL shaders through a user-friendly, visually-driven interface. Drawing inspiration from the user-guided evolutionary principles pioneered by the Picbreeder platform, our system empowers users to evolve shader art using intuitive interactions, without needing to write or understand code. AI Co-Artist serves as both a creative companion and a technical assistant, allowing users to explore a vast generative design space of real-time visual art. Through comprehensive evaluations, including structured user studies and qualitative feedback, we demonstrate that AI Co-Artist significantly reduces the technical threshold for shader creation, enhances creative outcomes, and supports a wide range of users in producing professional-quality visual effects. Furthermore, we argue that this paradigm is broadly generalizable. By leveraging the dual strengths of LLMs-semantic understanding and program synthesis, our method can be applied to diverse creative domains, including website layout generation, architectural visualizations, product prototyping, and infographics.

  • 2 authors
·
Nov 26, 2025

Mechanistic Design and Scaling of Hybrid Architectures

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology.

  • 12 authors
·
Aug 18, 2024

Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.