Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Blog Posts

[Paper Review] Stable Diffusion & SDXL

3 minute read

Published: September 15, 2024

SDXL extends Stable Diffusion with a larger U-Net backbone, multi-scale generation, and flexible text conditioning, enabling high-resolution, semantically rich image synthesis across diverse prompts and resolutions.

[Paper Review] UMT (Unified Multimodal Transformers)

3 minute read

Published: September 09, 2024

UMT is a unified framework for video highlight detection and moment retrieval that flexibly integrates visual, audio, and optional text modalities to identify key moments in both query-based and query-free scenarios.

[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published: August 16, 2024

SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.

[Paper Review] SlowFast Networks for Video Recognition

4 minute read

Published: July 23, 2024

The SlowFast network employs dual pathways, with the Slow Pathway capturing high-resolution spatial details and the Fast Pathway capturing rapid temporal changes, to achieve advanced video recognition.

[Paper Review] End-to-End Object Detection with Transformers (DETR)

5 minute read

Published: July 23, 2024

DETR revolutionizes object detection by integrating the Transformer architecture’s global attention mechanism with CNN-extracted image features, utilizing a novel bipartite matching algorithm to enhance detection accuracy and efficiency across varied object scales.

[Paper Review] VAE (Variational AutoEncoder)

4 minute read

Published: July 13, 2024

Variational Autoencoders (VAEs) employ a probabilistic approach to latent variable modeling, optimizing a variational lower bound to perform efficient approximate posterior inference and learning of generative models with continuous latent variables.

[Paper Review] Transformers

5 minute read

Published: July 10, 2024

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

[Paper Review] VAE (Variational AutoEncoder)

5 minute read

Published: July 10, 2024

DDIM (Denoising Diffusion Implicit Models) is a generative model for efficient image generation through a refined diffusion denoising process.

[Paper Review] BERT

7 minute read

Published: July 10, 2024

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.

Projects

TWO SHOTS are Enough! — CGMaker with sparse 3DGS

Published: November 20, 2024

Established a pipeline capable of performing 3D reconstruction of a moving scene using only two shots, followed by editing the reconstructed scene.

Agent PLayground (A.PL)

Published: April 13, 2025

A.pl offers a modular SDK enabling blockchain-based autonomous agents to securely generate interaction data, addressing Web2 data scarcity. It uses asynchronous methods to overcome blockchain latency and concurrency issues.

APT (Actuarial Personalized Tool)-Based Tool-Call sLLM Agent Pipeline

Published: May 03, 2025

APT-based Pipeline is an end-to-end insurance analysis system using watt-tool-8B for function orchestration and Mistral-small-24B for detailed output generation.

Publications

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

Published in ACL 2025 (main), 2025

We introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously.

Arxiv (abs)

InfoCausalQA : Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

Published in preprint, 2025

InfoCausalQA—a benchmark of 494 infographic–text pairs with 1,482 human-revised MCQs generated via GPT-4o—tests quantitative trend reasoning and five semantic causal types (cause, effect, intervention, counterfactual, temporal) and shows that current VLMs, far below human performance, struggle with genuinely grounded causal inference from infographics.

Arxiv (abs)

Junhyeong Park

Sitemap

Blog Posts

Projects

Publications