GitHub - archivearchive/TikTokStreamingAnalytics

Project Overview

This project presents a real-time anomaly detection system that monitors unusual activities on social media platform, TikTok.

Introduction

Brands in modern world face challenges of timely community management and reputation maintanence on multiple social outlets. Words have a quick way to travel across the Internet and consumer sentiments are prone to rapid decline if brands do not take proper actions immediately upon emerging topics and crises. When social media sites such as Twitter and Facebook have a relatively easy way to keep track of the latest activities of a given keyword, the trending Gen-Z's favorite platform, Tiktok, does not provide a latest view of search, which makes community management harder for brand strategists.

This project presents a streaming pipeline system which ingests and processes Tiktok data in a real-time manner and detects abnormal activities of a given keyword. In addition, it also sinks streaming data to S3 and further stores them in ProgreSQL dabatases. Airflow then orchestrates tasks and runs daily batch analysis to gain insights on trending challenges, musicians, etc.

Dataset

The dataset is kindly provided by pushshift.io and is about 95GB, stored as .ndjson format. It has a nested structure with four main fields, authorInfos, challengeInfoList, itemInfos, and musicInfos. A fifth field, time_stamp will be added as Kafka producer sends the message to simulate a streaming situation.

Architecture

Toolkit

Apache Kafka
Apache Spark Spark & Spark Streaming
S3
Postgresql
AWS EC2
Apache Airflow

Pipeline Flow

KafkaProducer.py sends raw json messages to Kafka and adds a streaming timestamp to data
TikTokSparkStream.py consumes Kafka content, sinks the streaming data to data lake in S3 for backup, and joins with historical statistics to detect outliers
static_tiktok.py reads data from S3 and parses it into structured data which are then stored in a Postgres database, which can be further queried for ad-hoc analysis.
summarystats.py reads data from Postgres database and generates historical statistics for outlier detection, and finally loads data to a Postgres table.
tiktok_dag schedules daily batch job with Airflow on bullets 3 and 4.

Environment Setup

Zookeeper
Scala 
Java8
Kafka 2.6.0
Spark 2.4.7
Py4j 0.10.9.1
Python3
Airflow

Next-steps

Refines text processing steps to clean out unhelpful content to save space for database
Add a GUI component
Improve outlier detection algorithm that factors in attributes such as seasonality and macro trends

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
image		image
kafka-scripts		kafka-scripts
lib		lib
README.md		README.md
TikTokSparkStream.py		TikTokSparkStream.py
static_tiktok.py		static_tiktok.py
summarystats.py		summarystats.py
tiktok_dag.py		tiktok_dag.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Introduction

Dataset

Architecture

Toolkit

Pipeline Flow

Environment Setup

Next-steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Introduction

Dataset

Architecture

Toolkit

Pipeline Flow

Environment Setup

Next-steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages