Introduction

This code is an accompaniment for Full Guide: Creating Flask Callback Server to Store LinkedIn Profiles in MySQL using Crawlbase Crawler blog.

Getting Started

Software Needed

Setup MySQL Database Schema

Create a user

CREATE USER 'linkedincrawler'@'localhost' IDENTIFIED BY 'linked1nS3cret';

Create a database

CREATE DATABASE linkedin_crawler_db;

Grant permission

GRANT ALL PRIVILEGES ON linkedin_crawler_db.* TO 'linkedincrawler'@'localhost';

Set current database

USE linkedin_crawler_db;

Create tables

CREATE TABLE IF NOT EXISTS `crawl_requests` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `url` TEXT NOT NULL,
  `status` VARCHAR(30) NOT NULL,
  `crawlbase_rid` VARCHAR(255) NOT NULL
);
CREATE INDEX `idx_crawl_requests_status` ON `crawl_requests` (`status`);
CREATE INDEX `idx_crawl_requests_crawlbase_rid` ON `crawl_requests` (`crawlbase_rid`);
CREATE INDEX `idx_crawl_requests_status_crawlbase_rid` ON `crawl_requests` (`status`, `crawlbase_rid`);

CREATE TABLE IF NOT EXISTS `linkedin_profiles` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `crawl_request_id` INT NOT NULL,
  `title` VARCHAR(255),
  `headline` VARCHAR(255),
  `summary` TEXT,

  FOREIGN KEY (`crawl_request_id`) REFERENCES `crawl_requests`(`id`)
);

CREATE TABLE IF NOT EXISTS `linkedin_profile_experiences` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `linkedin_profile_id` INT NOT NULL,
  `title` VARCHAR(255),
  `company_name` VARCHAR(255),
  `description` TEXT,
  `is_current` BIT NOT NULL DEFAULT 0,

  FOREIGN KEY (`linkedin_profile_id`) REFERENCES `linkedin_profiles`(`id`)
);

Settings

In PROJECT_FOLDER/settings.yml, configure the token and crawler values on what you set in Crawlbase Crawler

Example:

# PROJECT_FOLDER/settings.yml
token: mynormalcrawlbasetoken
crawler: linkedin-profile-crawler

List of URLs

Then make sure you have entries of urls in PROJECT_FOLDER/urls.txt. Note that each line corresponds to a valid url. By default it is configured with 5 top most followed people in LinkedIn

Setup Python Virtual Environment

1. Create a virtual environment in our project folder

PROJECT_FOLDER$ python3 -m venv .venv

2. Activate the virtual environment

PROJECT_FOLDER$ . .venv/bin/activate

3. Install dependencies

PROJECT_FOLDER$ pip install -r requirements.txt

I. Running the Crawler Callback Server

1. Start ngrok.

Open a new terminal and run the command below:

$ ngrok http 5000

Then remember the forwarding url that looks like below:

Forwarding                    https://4e15-180-190-160-114.ngrok-free.app -> http://localhost:5000

What we need is the https://4e15-180-190-160-114.ngrok-free.app value and we will use this later.

2. Then run the callback server script.

Open a new terminal and run the command below to activate the python virtual environment for this terminal

PROJECT_FOLDER$ . .venv/bin/activate

Then run the callback server

PROJECT_FOLDER$ python callback_server.py

3. Test the server

On a new terminal, run the following:

$ curl -i -X POST 'http://localhost:5000/crawlbase_crawler_callback' -H 'RID: dummyrequest' -H 'Accept: application/json' -H 'Content-Type: gzip/json' -H 'User-Agent: Crawlbase Monitoring Bot 1.0' -H 'Content-Encoding: gzip' --data-binary '"\x1F\x8B\b\x00+\xBA\x05d\x00\x03\xABV*\xCALQ\xB2RJ)\xCD\xCD\xAD,J-,M-.Q\xD2QJ\xCAO\xA9\x04\x8A*\xD5\x02\x00L\x06\xB1\xA7 \x00\x00\x00' --compressed

If running normally then you should see a message in the console:

[app][2023-08-10 17:42:16] Callback server is working

4. Finally register the ngrok path to Crawlbase Crawler dashboard

Example:

https://4e15-180-190-160-114.ngrok-free.app/crawlbase_crawler_callback

II. Runing the Periodic Processor

Open a new terminal and run the command below to activate the python virtual environment for this terminal

PROJECT_FOLDER$ . .venv/bin/activate

Then run the processor

PROJECT_FOLDER$ python process.py

This will keep on looping and waiting for a data to be processed coming from Crawlbase.

III. Initiating Crawling

Open a new terminal and run the command below to activate the python virtual environment for this terminal

PROJECT_FOLDER$ . .venv/bin/activate

Then run the crawl script.

PROJECT_FOLDER$ python crawl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting Started

Software Needed

Setup MySQL Database Schema

Settings

List of URLs

Setup Python Virtual Environment

1. Create a virtual environment in our project folder

2. Activate the virtual environment

3. Install dependencies

I. Running the Crawler Callback Server

1. Start ngrok.

2. Then run the callback server script.

3. Test the server

4. Finally register the ngrok path to Crawlbase Crawler dashboard

II. Runing the Periodic Processor

III. Initiating Crawling

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
lib		lib
.gitignore		.gitignore
README.md		README.md
callback_server.py		callback_server.py
crawl.py		crawl.py
process.py		process.py
requirements.txt		requirements.txt
settings.yml		settings.yml
urls.txt		urls.txt

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Software Needed

Setup MySQL Database Schema

Settings

List of URLs

Setup Python Virtual Environment

1. Create a virtual environment in our project folder

2. Activate the virtual environment

3. Install dependencies

I. Running the Crawler Callback Server

1. Start ngrok.

2. Then run the callback server script.

3. Test the server

4. Finally register the ngrok path to Crawlbase Crawler dashboard

II. Runing the Periodic Processor

III. Initiating Crawling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages