SDXL extends Stable Diffusion with a larger U-Net backbone, multi-scale generation, and flexible text conditioning, enabling high-resolution, semantically rich image synthesis across diverse prompts and resolutions.
UMT is a unified framework for video highlight detection and moment retrieval that flexibly integrates visual, audio, and optional text modalities to identify key moments in both query-based and query-free scenarios.
SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.
The SlowFast network employs dual pathways, with the Slow Pathway capturing high-resolution spatial details and the Fast Pathway capturing rapid temporal changes, to achieve advanced video recognition.
DETR revolutionizes object detection by integrating the Transformer architecture’s global attention mechanism with CNN-extracted image features, utilizing a novel bipartite matching algorithm to enhance detection accuracy and efficiency across varied object scales.
Variational Autoencoders (VAEs) employ a probabilistic approach to latent variable modeling, optimizing a variational lower bound to perform efficient approximate posterior inference and learning of generative models with continuous latent variables.
Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.
BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.
A.pl offers a modular SDK enabling blockchain-based autonomous agents to securely generate interaction data, addressing Web2 data scarcity. It uses asynchronous methods to overcome blockchain latency and concurrency issues.
APT-based Pipeline is an end-to-end insurance analysis system using watt-tool-8B for function orchestration and Mistral-small-24B for detailed output generation.
We introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously.
InfoCausalQA—a benchmark of 494 infographic–text pairs with 1,482 human-revised MCQs generated via GPT-4o—tests quantitative trend reasoning and five semantic causal types (cause, effect, intervention, counterfactual, temporal) and shows that current VLMs, far below human performance, struggle with genuinely grounded causal inference from infographics.