Meme Context Clustering: NLP-Based Approach to Thematic Grouping of Internet Memes

Overview

Memes look chaotic. They're not.

Beneath the humor, irony, and internet slang lies a surprisingly consistent taxonomy - patterns of genre, tone, cultural reference, and emotional register that repeat across thousands of posts. This project sets out to reverse-engineer that hidden structure using a fully interpretable NLP pipeline, built from scratch with standard Python libraries.

"Meme culture is not noise - it's compressed cultural signal. Our job is to decompress it."

The goal: cluster 5,819 meme descriptions into meaningful, human-interpretable thematic groups - using no external datasets, no pre-trained models, and no AutoML.

1. Dataset Description

The competition dataset comprises 5,819 meme records, each described by three fields.

Field	Type	Description
`id`	Integer	Unique identifier for each meme
`input`	String	Textual description of the meme → includes caption, scene context, poster intent, and entity mapping
`url`	String	Source image link, used as a secondary contextual signal

The primary analytical signal is the input column. Each entry follows an implicit structure of the form: caption → scene description → poster intent → entity mapping providing multi-layered semantic information within a single text field. The url column was used supplementarily to extract domain-level metadata that could hint at community context.

Technical Pipeline (Methodology)

Pipeline Overview (TL;DR)

The end-to-end pipeline follows a sequential transformation from raw text to labelled clusters:

Stage	Method	Purpose
Preprocessing	Rule-based text cleaning	Remove noise, normalize vocabulary
Vectorization	TF-IDF (unigrams + bigrams)	Represent term importance relative to corpus
Dimensionality Reduction	LSA via TruncatedSVD	Capture latent semantic relationships
Normalization	L2 normalization	Ensure scale invariance across vectors
Clustering	K-Means (K=7)	Partition memes into thematic groups
Validation	Agglomerative Clustering + ARI	Verify cluster stability
Interpretation	Per-cluster TF-IDF + manual review	Assign human-readable cluster labels

a) Text Preprocessing

Raw text descriptions were cleaned through the following sequential steps:

Step	Operation
1	Stripped the structural `TEXT:` prefix present in every `input` entry
2	Converted all characters to lowercase
3	Removed embedded URLs, punctuation, and numeric tokens
4	Collapsed excess whitespace
5	Applied a curated list of approximately 60 English stopwords
6	Applied lightweight suffix-stripping stemming (`-ing`, `-tion`, `-ness`, `-ment`, `-er`, `-ed`, `-ly`, `-es`, `-s`)
7	Enforced a minimum word length of 3 characters

b) Feature Engineering

Two complementary representations were constructed from the cleaned text.

TF-IDF Vectorization

Parameter	Value	Rationale
`max_features`	2,000	Limits vocabulary to high-signal terms
`ngram_range`	(1, 2)	Captures unigrams and bigrams for multi-word concepts
`min_df`	2	Excludes ultra-rare terms unlikely to generalize
`max_df`	0.85	Down-weights near-universal boilerplate terms
`sublinear_tf`	True	Applies log-scaling to term frequency

Bigrams preserve meaningful multi-word concepts like "meme poster", "trying convey", and "not related" that carry structural meaning a unigram model would miss.

Latent Semantic Analysis (LSA)

LSA projects the 2,000-dimensional sparse TF-IDF space into a dense 50-dimensional semantic space, capturing latent co-occurrence patterns that pure term matching misses entirely.

Technique	Input Dimensions	Output Dimensions	Purpose
TF-IDF	Raw text	2,000 (sparse)	Term importance
LSA / TruncatedSVD	2,000	50 (dense)	Semantic compression
L2 Normalization	50	50 (unit vectors)	Scale invariance

c) Optimal Cluster Selection

The optimal number of clusters was determined empirically using two independent metrics evaluated across K = 4 to K = 14:

Elbow Method → Inertia flattens at the inflection point
Silhouette Score → Peaks at the optimal cluster count

→ K = 7 selected, yielding clusters large enough for statistical stability and specific enough for thematic coherence.

d) Clustering

Primary Method → K-Means

K-Means clustering was applied to the L2-normalized LSA vectors with K = 7, using 20 random initializations to reduce sensitivity to centroid initialization.

Validation Method → Agglomerative Clustering (Ward Linkage)

Method	Type	Linkage / Init
K-Means	Centroid-based	n_init = 20, seed = 42
Agglomerative	Hierarchical	Ward linkage

Using two fundamentally different algorithms (centroid-based vs hierarchical) and measuring their agreement via Adjusted Rand Index (ARI) provides a strong statistical argument that the discovered clusters are real structure, not noise.

e) Visualization

t-SNE was applied to the 50-dimensional LSA vectors to produce a 2-dimensional projection for visual inspection. The resulting scatter plot, saved as cluster_viz.png, confirms meaningful spatial separation between cluster regions, corroborating the quantitative validation results.

f) Cluster Interpretation

Cluster labels were assigned through a two-step interpretive process:

Keyword extraction - A per-cluster TF-IDF analysis was run to identify the top 10 most distinctive terms for each group
Manual review - A minimum of 15 meme entries per cluster were manually read to validate that keyword-derived labels accurately reflected the full range of content within each group

3. Results (Cluster Summary)

Seven thematic clusters were identified, each with a human-assigned label grounded in keyword analysis and manual content review.

Cluster	Label	Representative Keywords	Count
1	Relatable Everyday Struggles	`woman`, `man`, `trying`, `convey`, `look`, `hold`, `right`	292
2	Relationship & Social Dynamics	relationship cues, social situations, interpersonal dynamics	351
3	Surreal & Absurdist Humor	bizarre imagery, non-sequitur logic, chaotic scenarios	1908
4	Gaming & Tech Culture	`game`, `pc`, `console`, `download`, `update`	1035
5	Political & Social Commentary	`society`, `government`, `commentary`, `news`	675
6	Pop Culture & Entertainment	`movie`, `character`, `scene`, `reference`	181
7	Animals & Pet Behavior	`cat`, `dog`, `pet`, `animal`	1376

All cluster labels emerged organically from keyword convergence and content inspection (no labels were pre-defined)

4. Challenges

Challenge	Description	Mitigation Strategy
Short Text Lengths	Many descriptions contain fewer than 15 words, limiting per-meme signal	Bigrams and LSA leverage corpus-wide co-occurrence to compensate
Informal & Slang Language	Slang, abbreviations, and deliberate misspellings are prevalent	Preprocessing was calibrated to preserve meaningful informal terms
Boilerplate Text patterns	Repeated structural phrases appear across hundreds of entries	`max_df = 0.85` in TF-IDF down-weights near-universal terms automatically
Cluster Boundary ambiguity	Some memes are thematically multi-class	Acknowledged as a hard clustering limitation; soft clustering proposed as future work
Absence of Word Embeddings	Competition rules prohibit pre-trained models	LSA partially compensates by capturing latent semantic relationships across the full vocabulary

5. Tools & Libraries

Category	Library / Tool
Language	Python 3.12
Data handling	Pandas, NumPy
Vectorization	`TfidfVectorizer`, `CountVectorizer` (Scikit-learn)
Dimensionality reduction	`TruncatedSVD` (Scikit-learn)
Clustering	`KMeans`, `AgglomerativeClustering` (Scikit-learn)
Evaluation	`silhouette_score`, `adjusted_rand_score` (Scikit-learn)
Visualization	`TSNE` (Scikit-learn), Matplotlib, Seaborn
Serialization	Pickle
Environment	Jupyter Notebook

6. Reproducibility

Every component of this pipeline is fully reproducible:

All random seeds fixed to 42
Notebook executes top-to-bottom without errors
Model artifacts saved as .pkl for independent inspection
No external data downloads or API calls required
output.txt generated programmatically from the notebook

7. Future Directions

The following improvements are proposed conceptually. Here's what an unconstrained v2.0 could look like:

Multimodal Analysis

The url column points to actual meme images. A future system could combine OCR (to extract overlaid text) with image classifiers (to identify visual meme templates like the Drake format or Distracted Boyfriend). Visual + textual fusion would dramatically improve cluster fidelity.

Sentiment-Aware Clustering

An emotional layer (positive / negative / sarcastic / dark) would split clusters that share a theme but differ in tone — separating "motivational memes" from "motivational sarcasm", for example.

Hierarchical Taxonomy

A two-level system would mirror how humans naturally categorize memes:

Humor    →  [Dark Humor, Self-deprecating, Absurdist]
Culture  →  [Gaming, Anime, Sports, Politics]
Emotion  →  [Wholesome, Relatable, Rage-bait]

Temporal Trend Detection

With timestamps, tracking cluster growth and decay over time could reveal meme lifecycle patterns and predict emerging formats before they go viral.

Community-Context Features

URL domains (Reddit subreddits, Know Your Meme, Imgur) encode deep community context. Graph-based clustering that weights memes from the same origin community could significantly improve thematic coherence.

Human-in-the-Loop Refinement

A semi-supervised approach where domain experts label a seed set of 50–100 memes could bootstrap constrained K-Means or label propagation — combining machine scale with human judgment.

Real-World Applications

Content Moderation — Automatically flag offensive meme clusters for human review
Recommendation Engines — Suggest memes from contextually related clusters
Cultural Analytics — Track the evolution of internet humor across communities and time

Repository Structure

Meme Context Clustering
├── NOTICE
├── README.md
├── LICENSE
├── data/ 
    └── meme.csv           -> Original Data
├── models/ 
    ├── vectorizer.pkl     -> Trained TF-IDF vectorizer
    └── cluster_model.pkl  -> Trained K-Means model
├── report.txt             -> Whole Approach report
├── output.txt             -> Cluster listings + per-meme reasoning + methodology + pitch
└── meme_clustering.ipynb  -> Full reproducible pipeline notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meme Context Clustering: NLP-Based Approach to Thematic Grouping of Internet Memes

Overview

1. Dataset Description

Technical Pipeline (Methodology)

Pipeline Overview (TL;DR)

a) Text Preprocessing

b) Feature Engineering

c) Optimal Cluster Selection

d) Clustering

e) Visualization

f) Cluster Interpretation

3. Results (Cluster Summary)

4. Challenges

5. Tools & Libraries

6. Reproducibility

7. Future Directions

Multimodal Analysis

Sentiment-Aware Clustering

Hierarchical Taxonomy

Temporal Trend Detection

Community-Context Features

Human-in-the-Loop Refinement

Real-World Applications

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
models		models
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
meme_clustering.ipynb		meme_clustering.ipynb
output.txt		output.txt
report.txt		report.txt

Folders and files

Latest commit

History

Repository files navigation

Meme Context Clustering: NLP-Based Approach to Thematic Grouping of Internet Memes

Overview

1. Dataset Description

Technical Pipeline (Methodology)

Pipeline Overview (TL;DR)

a) Text Preprocessing

b) Feature Engineering

c) Optimal Cluster Selection

d) Clustering

e) Visualization

f) Cluster Interpretation

3. Results (Cluster Summary)

4. Challenges

5. Tools & Libraries

6. Reproducibility

7. Future Directions

Multimodal Analysis

Sentiment-Aware Clustering

Hierarchical Taxonomy

Temporal Trend Detection

Community-Context Features

Human-in-the-Loop Refinement

Real-World Applications

Repository Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages