数据工程师big data常用知识点总结

雨天愁浪

注册一亩三分地论坛，查看更多干货！

您需要登录才可以下载或查看附件。没有帐号？注册账号

x

本帖最后由雨天愁浪于 2021-5-10 15:58 编辑

做Data engineer 4年了，在小公司大公司各呆过两年。最近在准备跳槽，总结了一些big data方面的知识点，主要集中在spark和streaming processing。
~~~求大米看面经~~

Resilient Distributed Datasets (RDD): They abstract a distributed dataset in the cluster, usually executed in the primary memory.
Operations: They represent transformations or actions that are made within a RDD. A Spark program is normally defined like a sequence of transformations or actions that are performed in a dataset.
Spark Context: Context is the object that connects Spark to the program being developed. It can be accessed as a variable in a program that uses its resources.

Cluster - 1 driver + N executors. Collections of JVM running Spark
Driver: main JVM that orchestrate processing data within executors. Executors: worker JVM running jobs
Cost to consider: EC2, S3, EMR
Spark History Server jobs/stages tab can find task time breakdown and identify task skew and data skew, can also find task level log.

how to calculate memory and other usual spark parameters?
1 node has multiple cores. Spark tasks runs by executors. Executor is within one node. You can specify 1 executor has how many cores (usually 5 for good HDFS throughput). Usually on each node, leave 1 core for Hadoop/Yarn daemons, and left cores can be assigned to executors. Across all nodes, leave 1 executor for driver/application manager (Driver memory and driver cores can be specified differently, but recommend to use the same as executors). Node Memory is fixed, and split evenly to executors. Executor.memoryOverhead is max(384mb, 10% of executor memory). round down every where except calculate memoryOverhead. default.parallelism=sql.shuffle.partitions= total num of executor cores *2. sql.shuffle.partitions is only for dataframe.
Always set the virtual and physical memory check flag to false. "yarn.nodemanager.vmem-check-enabled":"false", "yarn.nodemanager.pmem-check-enabled":"false"
e.g. 10 nodes, 16 cores/node. 64gb memory/node. Then executor.cores=5. leave 1 core per node for Hadoop/Yarn daemons, then num cores available per node=15. then number of executors per nodes=30/10=3. executor.instances=3*10-1=29. total memory per executor=64gb/3=21gb. Executor.memoryOverhead=7%*21=3gb. So actual executor.memory=21-3=18gb. driver.memory=executor.memory=18gb. Driver.cores=executor.cores=5. default.parallelism=sql.shuffle.partitions= executor.instances * spark.executors.cores * 2=29*5*2=290

Spark join strategy https://towardsdatascience.com/s ... k-join-c0e7b4572bcf
Broadcast Hash Join: first creating a Hash Table based on join_key of smaller relation and then looping over larger relation to match the hashed join_key values. Also, this is only supported for ‘=’ join. Not support for full outer join.
Shuffle Hash join: Shuffle Hash Join involves moving data with the same value of join key in the same executor node followed by Hash Join, as we know data of the same key will be present in the same executor. Pick if one side is small enough to build the local hash map, and is much smaller than the other side. only supported for ‘=’ join. Not support for full outer join.
Sort-merge Join: Shuffle sort-merge join involves, shuffling of data to get the same join_key with the same worker, and then performing sort-merge join operation at the partition level in the worker nodes. Require join keys are sortable. only supported for ‘=’ join.
Cartesian Join: the cartesian product(similar to SQL) of the two relations is calculated to evaluate join. Only for inner join.
Broadcast nested loop join: very slow
for record_1 in relation_1:
for record_2 in relation_2:
# join condition is executed

Operations requiring shuffle (exchange in spark plan): join, groupByKey, reduceByKey, partition by, etc.
Whenever a shuffle will happen, a new stage start. The data is written in disk in previous step, so the shuffle can read from any node; and can retry in case of failure in shuffle. Shuffle is slow because of Disk I/O, Involves data serialization and deserialization, Network I/O. try to reduce shuffle. And tune spark.sql.shuffle.partitions

How to deal with data skew? https://developer.aliyun.com/article/741111
If data is skewed from input source, repartition or rewrite it.
Change join key or partition key to make data evenly distributed
Increase parallelism to avoid a task has too many data
Larger cluster
Reduce number of sortMergeJoins
Try broadcast if one of the table <2Gb. Use broadcast hints
Data preprocess to remove unnecessary data. If a lot of null values in join key. Separate into 2 dataset, 1 non-null, 1 null, then union them.
specify the hint ` /*+ SKEW ('<table_name>') */ ` for a join that describes the column and the values upon which skew is expected. Based on that information, the engine automatically ensures that the skewed values are handled appropriately.
Key salting. 一大一小表可以大表加盐小表explode. 2个大表可以都随机加盐，然后按原值做cardisian join

How to tune spark performance?
More batch reading rather than individual record update, more analysis than processing focused. Data in read efficient format like Parquet, data in a flattened schema to use S3Select (for CSV and JSON, but not work with nested json)
Avoid too many small files
filter out data as early as possible in your application pipeline
Use dataframe/dataset over RDD, because they includes several optimization modules to improve the performance.
Data object should be ready to join with other dataset, no need to derive data based on service implementation logic.
Avoid UDF, use built-in functions
Replace union with flattenmap, or cache before union. Because each df in union runs independently before shuffle. Self-union will dup comuptation.
Cache
Repartition
Coalesce - merge files together, reduce number of partitions/files, faster than repartition since less shuffle. Coalesce to number of cores.
Parallelize - number of parallization = number of cores
Reduce distinct unions
Bucketing. if a large tables are used in join multiple times, write to s3 by join key. Define how many number of buckets and bucketBy key (join key).
Aggregate before shuffle.
set up a garbage collector when handling large volume of data through Spark.
Tune spark.driver.memory, executor-memory, num-executors, and executor-cores, and spark.sql.shuffle.partitions, yarn.scheduler.maximum-allocation-mb
https://aws.amazon.com/cn/blogs/ ... ions-on-amazon-emr/

How to choose AWS EMR node type?
For memory-intensive applications, prefer R type instances. For compute-intensive applications, prefer C type instances. For applications balanced between memory and compute, prefer M type general-purpose instances.
you can run spark-submit with the –verbose option. Also, you can use Ganglia and Spark UI to monitor the application progress, Cluster RAM usage, Network I/O, etc.

Streaming processing
一个很comprehensive的总结 https://medium.com/@chandanbaran ... essing-91ea3f04675b
What is unbounded data and bounded data?
Unlike Batch processing where data is bounded with a start and an end in a job and the job finishes after processing that finite data, Streaming is meant for processing unbounded data coming in realtime continuously for days,months,years and forever.

What need to be considered for streaming processing?
Delivery Guarantees. Atleast-once, Atmost-once, Exactly-once (most desirable but hard)
Fault Tolerance. Distributed replica, checkpointing the state of streaming to some persistent storage from time to time
State Management : stateful processing requires maintain some state (e.g. counts of each distinct word seen in records), framework should be able to provide some mechanism to preserve and update state information.
Preformance
Advanced Features : Event Time Processing, Watermarks, Windowing
Maturity: proven by big company. Great community support. Compatible with company exiting tools. What is learning curve
Whether need Lambda architecture

How to compare Spark streaming and Flink, 2 most popular engine?
Both in-memory processing. Both fast. Both support lambda architecture, but Flink batch didn’t proven by big company. Spark use micro-batch processing, Flink use streaming processing. Flink has less parameters to tune. Spark is stateless and lacks many advanced features as Flink. Spark has better community support, larger user base.
How to compare native stream processing with micro-batch processing?
Native Streaming: every record is processed as soon as it arrives. the minimum latency. But it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.
Micro-batching: Fault tolerance is good as it is essentially a batch and throughput is also high as processing and checkpointing will be done in one shot for group of records. But it has latency. Also efficient state management will be a challenge to maintain.

Kafka
Kafaka is a distributed, event streaming, message publishing/subscribing platform. It is like a storage as broker between data publisher and consumer.
Streaming is different than Message Queue. MQ is point-to-point, where consumer can define which message to consume individually. Streaming is publish/subscribe all event in a topic.Property of Kafka: Keep Stream history, Scalable Consumption, Immutable data, Scalable, highly available
Benefits of Kafka: 1. loose decoupling to support micro-services. Small container. 2. fully distributed. 3. event-based. 4. 0 downtime 5. easy to scale. 6. no vendor lock in.
To send a event: 1. choose a topic, 2. whether to use a key, 3. choose acknowledge level (0-fire and forget, 1-wait for 1 broker, ALL-wait for all broker), 4. whether retry. No retry risk losing message. Retry risk dup messages.
To receive a event: 1. choose a topic, 2. where to start, 3. how to manage commit offsets (automatic, manual asynchronous, manual synchronous), 4. choose consumer group
Message就是row, topic就是table. Kafka has multiple brokers(nodes). 1 topic are distributed by partition in multiple brokers. Each topic in one partition is sorted by the key you defined. Across partitions/brokers you have to mergesort provided by Kafka function.
Replica of partitions. One is leader, other are followers. Always connect to lead only. When lead is down, Kafka choose one of Followers as leader and update metadata.
Kafka Connect allows you to continuously ingest data from external systems into Kafka, and vice versa.

~~~求大米看面经~~

chenwang9527

qinshimingyue 发表于 2021-5-11 11:23
感谢分享，之前对de有点朦朦胧胧的兴趣，面试后才大致对这个岗位有了理解，请问楼主觉得de这个岗位如何？ca ...

DE我个人理解来说现在正在处于一个更新的阶段，很多非一线大型科技公司开始慢慢有了更多的data，对data的储存，处理分析需求也更多。原来很多用ssis或者informatic之类软件就能搞定的数据处理分析现在已经无法满足他们的需求了。所以现在其实DE的open很多都是这些非一线科技公司的岗位，对经验要求也不太高，毕竟现在市面上主流的数据处理框架也没出来多久。linkedin上很多DE岗位需求都是Python，spark，airflow，aws，sql，data model。我觉得这个算是de现在比较重要的技能了，有的还希望你会data stream，但是其实并不是很硬性的，去了在学就好了，没什么难的。就从这些技能要求其实也能比较明确的看出来DE现在要求什么，python是general coding， spark是big data process/clean， airflow是data pipeline management， aws是系统框架和简单的design，sql是很重要的data process技能。然后data model主要是设计新的data pipeline怎么和现有的数据相结合和更好的抓捕和设计数据结构。。。。。
个人浅见。

nn960208

koupayio 发表于 2021-05-11 19:42:05
. Waral dи,请教一下
以前是MS SQL DBA
目前想转云Data Engineer

这是一个非常好的问题。我在一家小独角兽做de，最近公司准备上市急需招人，近半年来基本每周都有面试，但一直没有招到我们想要的人。首先我觉得de在各家公司的要求都不太一样。有些公司就只需要处理一些adhoc的报表什么的，或者是大一点的公司都有比较完善的组织结构，de做的事很单纯。但像我们就要求啥都能干。从获取数据，建立pipeline，到维护独立的infrastructure。我们的某一个clickstream data 每天就 30-50 M rows。处理这样数据的经验是很难从练习中获得的。一方面我经常看到有很多人想转data 相关的工作，一方面我们又找不到想要的人。作为一个过来人，我必须承认这些经验是需要时间培养的，但公司现在的情况是连培训的精力都没有。

补充内容 (2021-05-12 12:06 +08:00):
感觉自己的上个回复不一定能帮到你，但好像也删除不了。我稍微补充一下，我觉得可以从两个角度思考这个问题一方面可以加强其他方面的技能比如SQL，另一方面可以在练习pipeline/或者思考问题的时候想一下如果这个data size 特别大怎么办。举一个在我们的面试中很常见的问题。我们的主要语言是python，主要的ETL tool 是airflow，我们经常会要求写一个简单的pipeline很多面试者会通过pandas 实现一些data transformation。可是大多数人不会想到pandas可以处理一两百行的数据，但是如果我的dataset是几百万行，pandas还是好的选择吗？

雨天愁浪

qinshimingyue 发表于 2021-5-11 11:23
感谢分享，之前对de有点朦朦胧胧的兴趣，面试后才大致对这个岗位有了理解，请问楼主觉得de这个岗位如何？ca ...

de主要就是建pipeline吧，对系统设计要求比较高，各种ETL工具都要了解，算法要求比较低。career path和sde是分开的，一般和数据团队合作多，有的就直接在数据组下面，经验多了还是去大公司有单独的de团队可以带。

Gustavo

mark一下

HHHHarold

谢谢分享zs

BlairXiao

Thank you so much! This is amazing!!

redeye1

mark big data

qwert000

感谢楼主！

Lenin

非常赞感谢lz

qinshimingyue

感谢分享，之前对de有点朦朦胧胧的兴趣，面试后才大致对这个岗位有了理解，请问楼主觉得de这个岗位如何？career path一般是怎样的？

Fredbeijixiong

mark一波zszszs

suhaobo1989

我靠我带了个DE团队我居然不太看得懂，看来我落伍了

数据工程师big data常用知识点总结

注册一亩三分地论坛，查看更多干货！

评分

相关帖子

本帖被以下淘专辑推荐:

评分

评分

浏览过的版块