data-science-notes/data-engineer.md at master · ffflora/data-science-notes · GitHub

93 lines (50 loc) · 3.32 KB

Here are the questions I have collected for the data engineer interview preparation.

General Questions

1. What is data engineer?

keywords:

Advanced data structures, distributed computing, concurrent programming, data storage, data stream; design and maintain database, set up data pool.
Build and manage data pipelines for large datasets
ETL: make sure the data is being efficiently collected and retrieved from its source when needed, cleaned and preprocessed.
toolkits: Hadoop, Spark, Kafka, Hive, ...
Lambda Architecture

Data Pipeline:

Distributed system is a must.
- Able to handle large pressure
- Higher chance of failure
Designing for the whole company
- High scalability ready future growth
- Be genetic enough to support different teams

科技巨头都爱的Data Pipeline，如何自动化你的数据工作？

ETL best practices with Airflow documentation site

2. How can I clean this dataset without loading it all in RAM?

TIP 1 - Deleting unused variables and gc.collect()
TIP 2 - Presetting the datatypes
TIP 3 - Importing selected rows of the a file (including generating your own subsamples)
TIP 4 - Importing in batches and processing each individually
TIP 5 - Importing just selected columns
TIP 6 - Creative data processing
TIP 7 - Using Dask

How to Work with BIG Datasets on Kaggle Kernels (16G RAM)

Server Related:

How do I build a pipeline that can handle 10000 requests per minute?
How to scale web service(with AWS)?
Techniques to make distributed system highly available?

Analytics Related:

SQL Related:

Optimize Hive query
Window Function

A window function is an SQL function where the input values are taken from a "window" of one or more rows in the results set of a SELECT statement.

Distribution System related:

How Does MapRuduce Work?

The idea is divide and conquer.

The main process is highly paralleled:

Input -> split -> map -> shuffle -> reduce -> finalize

There are map worker, reduce worker, master, which master are as user programs to assign tasks, and each map worker split the data locally in RAM, and reduce worker load the data locally, after reduce then finalize the results.

Hive is a Hadoop based data warehouse tool, which could map the structured data into a database table, and provide functions like SQL query, also could use SQL command to assign MapReduce tasks.
Data replication strategies
Message delivery guarantees

Cloud computing related:

How to sync data across S3 buckets in different AWS account?

到底什么是流计算（Stream Computing）