Data Engineer Roadmap¶
- Roadmap: https://roadmap.sh/data-engineer
1. Introduction¶
- 1.1 What is Data Engineering
- 1.2 Data Engineering vs Data Science
- 1.3 Data Engineering Lifecycle
- 1.4 Skills and Responsibilities
2. Programming Skills¶
- 2.1 Python
- 2.2 Java
- 2.3 Scala
- 2.4 Go
- 2.5 SQL (Learn SQL)
- 2.6 Linux Basics
- 2.7 Git and GitHub
- 2.8 Data Structures and Algorithms
3. Database Fundamentals¶
- 3.1 Database Fundamentals
- 3.2 Relational Databases
- 3.3 NoSQL Databases
- 3.4 OLTP vs OLAP
- 3.5 Transactions
- 3.6 Indexing
- 3.7 Data Normalization
- 3.8 Idempotency
4. Relational Databases¶
- 4.1 PostgreSQL
- 4.2 MySQL
- 4.3 MariaDB
- 4.4 MS SQL
- 4.5 Oracle
5. NoSQL Databases¶
- 5.1 Document (MongoDB, CouchDB)
- 5.2 Key-Value (Redis, Memcached, DynamoDB)
- 5.3 Column (Cassandra, HBase, Bigtable)
- 5.4 Graph (Neo4j, Neptune)
- 5.5 What and Why Use Them
6. Data Warehousing¶
- 6.1 What is a Data Warehouse
- 6.2 Data Warehousing Architectures
- 6.3 Star vs Snowflake Schema
- 6.4 Data Modelling Techniques
- 6.5 Slowly Changing Dimensions (SCD)
- 6.6 Data Mart
- 6.7 Data Lake
- 6.8 Data Hub
- 6.9 Data Mesh
- 6.10 Data Fabric
- 6.11 Metadata-First Architecture
7. Cloud Warehouses¶
- 7.1 Snowflake
- 7.2 Amazon Redshift
- 7.3 Google BigQuery
- 7.4 Databricks Delta Lake
- 7.5 Onehouse
8. Data Ingestion¶
- 8.1 Types of Data Ingestion
- 8.2 Sources of Data
- 8.3 Data Collection Considerations
- 8.4 Batch
- 8.5 Streaming
- 8.6 Realtime
- 8.7 Apache Kafka
- 8.8 RabbitMQ
- 8.9 AWS SNS / SQS
- 8.10 Messages vs Streams
- 8.11 Messaging Systems
9. Data Pipelines and ETL¶
- 9.1 Data Pipelines
- 9.2 ETL vs Reverse ETL
- 9.3 Extract Data
- 9.4 Transform Data
- 9.5 Load Data
- 9.6 Apache Airflow
- 9.7 Luigi
- 9.8 Prefect (Perfect)
- 9.9 dbt
- 9.10 Glue ETL
- 9.11 Data Factory ETL
- 9.12 Dataflow
10. Reverse ETL¶
- 10.1 Reverse ETL Concepts
- 10.2 Use Cases
- 10.3 Tools: Hightouch, Census, Segment
11. Big Data Tools¶
- 11.1 Big Data Tools Overview
- 11.2 Apache Hadoop YARN
- 11.3 HDFS
- 11.4 Apache Spark
- 11.5 MapReduce
12. Cluster Computing¶
- 12.1 What is Cluster Computing
- 12.2 Cluster Computing Basics
- 12.3 Cluster Management Tools
13. Distributed Systems¶
- 13.1 Distributed Systems Basics
- 13.2 CAP Theorem
- 13.3 Horizontal vs Vertical Scaling
- 13.4 Async vs Sync Communication
- 13.5 Distributed File Systems
14. Cloud Computing¶
- 14.1 Cloud Computing Basics
- 14.2 Cloud Architectures
- 14.3 Hybrid
- 14.4 Serverless Options
- 14.5 AWS (EC2, RDS, Aurora, S3, EKS, CDK)
- 14.6 Azure (Blob Storage, SQL DB, VMs)
- 14.7 GCP (Compute Engine, Cloud Storage, GKE, Cloud SQL, Deployment Manager)
15. Containers and Orchestration¶
- 15.1 Containers and Orchestration
- 15.2 Docker
- 15.3 Kubernetes
- 15.4 ArgoCD
16. Infrastructure as Code¶
- 16.1 Infrastructure as Code (IaC)
- 16.2 Declarative vs Imperative
- 16.3 Terraform
- 16.4 OpenTofu
17. CI/CD¶
- 17.1 CI/CD Concepts
- 17.2 GitHub Actions
- 17.3 GitLab CI
- 17.4 CircleCI
- 17.5 Environmental Management
18. Data Governance and Security¶
- 18.1 Authentication vs Authorization
- 18.2 Encryption
- 18.3 Data Masking
- 18.4 Data Obfuscation
- 18.5 Tokenization
- 18.6 Data Quality
- 18.7 Data Lineage
- 18.8 Metadata Management
- 18.9 Data Interoperability
19. Compliance¶
- 19.1 GDPR
- 19.2 ECPA
- 19.3 EU AI Act
20. Testing¶
- 20.1 Unit Testing
- 20.2 Integration Testing
- 20.3 End-to-End Testing
- 20.4 Smoke Testing
- 20.5 Functional Testing
- 20.6 Load Testing
- 20.7 A/B Testing
- 20.8 Data Generation
21. Monitoring and Observability¶
- 21.1 Monitoring
- 21.2 Logs
- 21.3 Prometheus
- 21.4 Datadog
- 21.5 New Relic
- 21.6 Sentry
22. Analytics and BI¶
- 22.1 Data Analytics
- 22.2 Business Intelligence
- 22.3 Tableau
- 22.4 Looker
- 22.5 Microsoft Power BI
- 22.6 Streamlit
23. Machine Learning / MLOps¶
- 23.1 Machine Learning Basics
- 23.2 MLOps
24. Other Topics¶
- 24.1 APIs
- 24.2 IoT
- 24.3 Mobile Apps
- 24.4 Choosing the Right Technologies
- 24.5 Reusability
- 24.6 Best Practices
- 24.7 Job Scheduling