Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Blog Posts

[Paper Review] Stable Diffusion & SDXL

3 minute read

Published:

SDXL extends Stable Diffusion with a larger U-Net backbone, multi-scale generation, and flexible text conditioning, enabling high-resolution, semantically rich image synthesis across diverse prompts and resolutions.

[Paper Review] UMT (Unified Multimodal Transformers)

3 minute read

Published:

UMT is a unified framework for video highlight detection and moment retrieval that flexibly integrates visual, audio, and optional text modalities to identify key moments in both query-based and query-free scenarios.

[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published:

SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.

[Paper Review] SlowFast Networks for Video Recognition

4 minute read

Published:

The SlowFast network employs dual pathways, with the Slow Pathway capturing high-resolution spatial details and the Fast Pathway capturing rapid temporal changes, to achieve advanced video recognition.

[Paper Review] End-to-End Object Detection with Transformers (DETR)

5 minute read

Published:

DETR revolutionizes object detection by integrating the Transformer architecture’s global attention mechanism with CNN-extracted image features, utilizing a novel bipartite matching algorithm to enhance detection accuracy and efficiency across varied object scales.

[Paper Review] VAE (Variational AutoEncoder)

4 minute read

Published:

Variational Autoencoders (VAEs) employ a probabilistic approach to latent variable modeling, optimizing a variational lower bound to perform efficient approximate posterior inference and learning of generative models with continuous latent variables.

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

[Paper Review] VAE (Variational AutoEncoder)

5 minute read

Published:

DDIM (Denoising Diffusion Implicit Models) is a generative model for efficient image generation through a refined diffusion denoising process.

[Paper Review] BERT

7 minute read

Published:

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.

Projects

Agent PLayground (A.PL)

Published:

A.pl offers a modular SDK enabling blockchain-based autonomous agents to securely generate interaction data, addressing Web2 data scarcity. It uses asynchronous methods to overcome blockchain latency and concurrency issues.

Publications

InfoCausalQA : Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

Published in preprint, 2025

InfoCausalQA—a benchmark of 494 infographic–text pairs with 1,482 human-revised MCQs generated via GPT-4o—tests quantitative trend reasoning and five semantic causal types (cause, effect, intervention, counterfactual, temporal) and shows that current VLMs, far below human performance, struggle with genuinely grounded causal inference from infographics.

Arxiv (abs)