Here are the questions I have collected for the data engineer interview preparation.
keywords:
-
Advanced data structures, distributed computing, concurrent programming, data storage, data stream; design and maintain database, set up data pool.
-
Build and manage data pipelines for large datasets
-
ETL: make sure the data is being efficiently collected and retrieved from its source when needed, cleaned and preprocessed.
-
toolkits: Hadoop, Spark, Kafka, Hive, ...
Data Pipeline:
- Distributed system is a must.
- Able to handle large pressure
- Higher chance of failure
- Designing for the whole company
- High scalability ready future growth
- Be genetic enough to support different teams
科技巨头都爱的Data Pipeline,如何自动化你的数据工作?
ETL best practices with Airflow documentation site
- TIP 1 - Deleting unused variables and
gc.collect() - TIP 2 - Presetting the datatypes
- TIP 3 - Importing selected rows of the a file (including generating your own subsamples)
- TIP 4 - Importing in batches and processing each individually
- TIP 5 - Importing just selected columns
- TIP 6 - Creative data processing
- TIP 7 - Using Dask
How to Work with BIG Datasets on Kaggle Kernels (16G RAM)
-
A window function is an SQL function where the input values are taken from a "window" of one or more rows in the results set of a SELECT statement.
-
The idea is divide and conquer.
The main process is highly paralleled:
Input -> split -> map -> shuffle -> reduce -> finalize
There are map worker, reduce worker, master, which master are as user programs to assign tasks, and each map worker split the data locally in RAM, and reduce worker load the data locally, after reduce then finalize the results.
Hive is a Hadoop based data warehouse tool, which could map the structured data into a database table, and provide functions like SQL query, also could use SQL command to assign MapReduce tasks.


