Posts by Tags

Attention

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

Autoencoder

[Paper Review] VAE (Variational AutoEncoder)

4 minute read

Published:

Variational Autoencoders (VAEs) employ a probabilistic approach to latent variable modeling, optimizing a variational lower bound to perform efficient approximate posterior inference and learning of generative models with continuous latent variables.

Computer Vision

[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published:

SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.

Decoder

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

Diffusion

[Paper Review] Stable Diffusion & SDXL

3 minute read

Published:

SDXL extends Stable Diffusion with a larger U-Net backbone, multi-scale generation, and flexible text conditioning, enabling high-resolution, semantically rich image synthesis across diverse prompts and resolutions.

[Paper Review] VAE (Variational AutoEncoder)

5 minute read

Published:

DDIM (Denoising Diffusion Implicit Models) is a generative model for efficient image generation through a refined diffusion denoising process.

Encoder

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

[Paper Review] BERT

7 minute read

Published:

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.

Generative Models

[Paper Review] Stable Diffusion & SDXL

3 minute read

Published:

SDXL extends Stable Diffusion with a larger U-Net backbone, multi-scale generation, and flexible text conditioning, enabling high-resolution, semantically rich image synthesis across diverse prompts and resolutions.

[Paper Review] VAE (Variational AutoEncoder)

4 minute read

Published:

Variational Autoencoders (VAEs) employ a probabilistic approach to latent variable modeling, optimizing a variational lower bound to perform efficient approximate posterior inference and learning of generative models with continuous latent variables.

[Paper Review] VAE (Variational AutoEncoder)

5 minute read

Published:

DDIM (Denoising Diffusion Implicit Models) is a generative model for efficient image generation through a refined diffusion denoising process.

Image Segmentation

[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published:

SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.

NLP

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

[Paper Review] BERT

7 minute read

Published:

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.

Transformers

[Paper Review] UMT (Unified Multimodal Transformers)

3 minute read

Published:

UMT is a unified framework for video highlight detection and moment retrieval that flexibly integrates visual, audio, and optional text modalities to identify key moments in both query-based and query-free scenarios.

[Paper Review] Transformers

5 minute read

Published:

Transformers is a deep learning architecture that enhances natural language processing by using self-attention mechanisms to capture long-range dependencies and contextual relationships in text.

[Paper Review] BERT

7 minute read

Published:

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model that improves natural language understanding by pre-training on vast amounts of text to capture context from both directions.

Video Highlight Detection

[Paper Review] UMT (Unified Multimodal Transformers)

3 minute read

Published:

UMT is a unified framework for video highlight detection and moment retrieval that flexibly integrates visual, audio, and optional text modalities to identify key moments in both query-based and query-free scenarios.

Video Segmentation

[Paper Review] Segment Anything Model 2 (SAM2)

5 minute read

Published:

SAM2 generalizes promptable visual segmentation to video by integrating spatio-temporal memory, interactive prompting, and a data engine for fine-grained, efficient, and class-agnostic object segmentation across frames.

Video Understanding

[Paper Review] SlowFast Networks for Video Recognition

4 minute read

Published:

The SlowFast network employs dual pathways, with the Slow Pathway capturing high-resolution spatial details and the Fast Pathway capturing rapid temporal changes, to achieve advanced video recognition.

[Paper Review] End-to-End Object Detection with Transformers (DETR)

5 minute read

Published:

DETR revolutionizes object detection by integrating the Transformer architecture’s global attention mechanism with CNN-extracted image features, utilizing a novel bipartite matching algorithm to enhance detection accuracy and efficiency across varied object scales.