Ch08_Attention

Deep Learning Crash Course

by Benjamin Midtvedt, Jesús Pineda, Henrik Klein Moberg, Harshith Bachimanchi, Joana B. Pereira, Carlo Manzo, Giovanni Volpe
No Starch Press, San Francisco (CA), 2025
ISBN-13: 9781718503922
https://nostarch.com/deep-learning-crash-course

Dense Neural Networks for Classification
Dense Neural Networks for Regression
Convolutional Neural Networks for Image Analysis
Encoders–Decoders for Latent Space Manipulation
U-Nets for Image Transformation
Self-Supervised Learning to Exploit Symmetries
Recurrent Neural Networks for Timeseries Analysis
Attention and Transformers for Sequence Processing
Introduces attention mechanisms, transformer models, and vision transformers (ViT) for natural language processing (NLP) including improved text translation and sentiment analysis, and image classification.

Code 8-1: Understanding Attention
Builds attention modules from scratch (dot-product, trainable dot-product, additive) and applies them to toy examples, visualizing attention maps that show which tokens focus on which other tokens. It illustrates how pre-trained embeddings (GloVe) can highlight semantic relationships (like she-her) even without fine-tuning. It clarifies the difference between non-learnable and learnable key/value embeddings.

Code 8-A: Translating with Attention
Improves the seq2seq model from Chapter 7 with dot-product cross-attention to focus on the most relevant parts of source sentences during translation. It demonstrates how attention helps align multi-word phrases and resolves ambiguities. The model surpasses the earlier RNN-based approach by dynamically highlighting crucial source tokens, proving "attention is all you need" for better translations.

Code 8-B: Performing Sentiment Analysis with a Transformer
Implements an encoder-only Transformer using multi-head self-attention and feedforward layers to classify the sentiment of IMDB reviews as positive or negative. Details the entire pipeline: tokenizing, building a vocabulary, batching sequences with masks, stacking multiple Transformer blocks, and adding a dense top for binary classification. The approach yields strong sentiment prediction accuracy, highlighting the parallel processing benefits of Transformers over RNN-based models.

Code 8-C: Classifying Images with a Vision Transformer
Shows how Transformers can replace convolutions for image tasks by splitting images into patch embeddings. The ViT model is trained from scratch on CIFAR-10, using CutMix to address its weaker inductive biases compared to CNNs. Achieves notable performance when fine-tuned and especially excels if using a pretrained backbone. This underscores ViT’s flexibility and potential to rival or outperform CNNs on visual data.

Name		Name	Last commit message	Last commit date
parent directory ..
ec08_1_attention		ec08_1_attention
ec08_A_nlp_attn		ec08_A_nlp_attn
ec08_B_transformer		ec08_B_transformer
ec08_C_vit		ec08_C_vit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Deep Learning Crash Course

FilesExpand file tree

Ch08_Attention

Directory actions

More options

Directory actions

More options

Latest commit

History

Ch08_Attention

Folders and files

parent directory

README.md

Deep Learning Crash Course