AmazonSystem Design·60 minMembers
Distributed Training Data Pipeline (FAR)
Members only
Design a distributed pipeline for LLM training-data preprocessing: ingest from S3 / streaming, tokenize, deduplicate, quality-filter, and output sharded training corpora. Amazon FAR onsite Staff-level...
MLE
RE
Infra Eng
system-design
storage
scaling
dedup
hard
Frequency
Single report
Last asked
2026-03-21
Stage
onsite-system-design
Log in to continue reading the full content
