Skip to content

Data Engineer Roadmap

  • Roadmap: https://roadmap.sh/data-engineer

1. Introduction

  • 1.1 What is Data Engineering
  • 1.2 Data Engineering vs Data Science
  • 1.3 Data Engineering Lifecycle
  • 1.4 Skills and Responsibilities

2. Programming Skills

  • 2.1 Python
  • 2.2 Java
  • 2.3 Scala
  • 2.4 Go
  • 2.5 SQL (Learn SQL)
  • 2.6 Linux Basics
  • 2.7 Git and GitHub
  • 2.8 Data Structures and Algorithms

3. Database Fundamentals

  • 3.1 Database Fundamentals
  • 3.2 Relational Databases
  • 3.3 NoSQL Databases
  • 3.4 OLTP vs OLAP
  • 3.5 Transactions
  • 3.6 Indexing
  • 3.7 Data Normalization
  • 3.8 Idempotency

4. Relational Databases

  • 4.1 PostgreSQL
  • 4.2 MySQL
  • 4.3 MariaDB
  • 4.4 MS SQL
  • 4.5 Oracle

5. NoSQL Databases

  • 5.1 Document (MongoDB, CouchDB)
  • 5.2 Key-Value (Redis, Memcached, DynamoDB)
  • 5.3 Column (Cassandra, HBase, Bigtable)
  • 5.4 Graph (Neo4j, Neptune)
  • 5.5 What and Why Use Them

6. Data Warehousing

  • 6.1 What is a Data Warehouse
  • 6.2 Data Warehousing Architectures
  • 6.3 Star vs Snowflake Schema
  • 6.4 Data Modelling Techniques
  • 6.5 Slowly Changing Dimensions (SCD)
  • 6.6 Data Mart
  • 6.7 Data Lake
  • 6.8 Data Hub
  • 6.9 Data Mesh
  • 6.10 Data Fabric
  • 6.11 Metadata-First Architecture

7. Cloud Warehouses

  • 7.1 Snowflake
  • 7.2 Amazon Redshift
  • 7.3 Google BigQuery
  • 7.4 Databricks Delta Lake
  • 7.5 Onehouse

8. Data Ingestion

  • 8.1 Types of Data Ingestion
  • 8.2 Sources of Data
  • 8.3 Data Collection Considerations
  • 8.4 Batch
  • 8.5 Streaming
  • 8.6 Realtime
  • 8.7 Apache Kafka
  • 8.8 RabbitMQ
  • 8.9 AWS SNS / SQS
  • 8.10 Messages vs Streams
  • 8.11 Messaging Systems

9. Data Pipelines and ETL

  • 9.1 Data Pipelines
  • 9.2 ETL vs Reverse ETL
  • 9.3 Extract Data
  • 9.4 Transform Data
  • 9.5 Load Data
  • 9.6 Apache Airflow
  • 9.7 Luigi
  • 9.8 Prefect (Perfect)
  • 9.9 dbt
  • 9.10 Glue ETL
  • 9.11 Data Factory ETL
  • 9.12 Dataflow

10. Reverse ETL

  • 10.1 Reverse ETL Concepts
  • 10.2 Use Cases
  • 10.3 Tools: Hightouch, Census, Segment

11. Big Data Tools

  • 11.1 Big Data Tools Overview
  • 11.2 Apache Hadoop YARN
  • 11.3 HDFS
  • 11.4 Apache Spark
  • 11.5 MapReduce

12. Cluster Computing

  • 12.1 What is Cluster Computing
  • 12.2 Cluster Computing Basics
  • 12.3 Cluster Management Tools

13. Distributed Systems

  • 13.1 Distributed Systems Basics
  • 13.2 CAP Theorem
  • 13.3 Horizontal vs Vertical Scaling
  • 13.4 Async vs Sync Communication
  • 13.5 Distributed File Systems

14. Cloud Computing

  • 14.1 Cloud Computing Basics
  • 14.2 Cloud Architectures
  • 14.3 Hybrid
  • 14.4 Serverless Options
  • 14.5 AWS (EC2, RDS, Aurora, S3, EKS, CDK)
  • 14.6 Azure (Blob Storage, SQL DB, VMs)
  • 14.7 GCP (Compute Engine, Cloud Storage, GKE, Cloud SQL, Deployment Manager)

15. Containers and Orchestration

  • 15.1 Containers and Orchestration
  • 15.2 Docker
  • 15.3 Kubernetes
  • 15.4 ArgoCD

16. Infrastructure as Code

  • 16.1 Infrastructure as Code (IaC)
  • 16.2 Declarative vs Imperative
  • 16.3 Terraform
  • 16.4 OpenTofu

17. CI/CD

  • 17.1 CI/CD Concepts
  • 17.2 GitHub Actions
  • 17.3 GitLab CI
  • 17.4 CircleCI
  • 17.5 Environmental Management

18. Data Governance and Security

  • 18.1 Authentication vs Authorization
  • 18.2 Encryption
  • 18.3 Data Masking
  • 18.4 Data Obfuscation
  • 18.5 Tokenization
  • 18.6 Data Quality
  • 18.7 Data Lineage
  • 18.8 Metadata Management
  • 18.9 Data Interoperability

19. Compliance

  • 19.1 GDPR
  • 19.2 ECPA
  • 19.3 EU AI Act

20. Testing

  • 20.1 Unit Testing
  • 20.2 Integration Testing
  • 20.3 End-to-End Testing
  • 20.4 Smoke Testing
  • 20.5 Functional Testing
  • 20.6 Load Testing
  • 20.7 A/B Testing
  • 20.8 Data Generation

21. Monitoring and Observability

  • 21.1 Monitoring
  • 21.2 Logs
  • 21.3 Prometheus
  • 21.4 Datadog
  • 21.5 New Relic
  • 21.6 Sentry

22. Analytics and BI

  • 22.1 Data Analytics
  • 22.2 Business Intelligence
  • 22.3 Tableau
  • 22.4 Looker
  • 22.5 Microsoft Power BI
  • 22.6 Streamlit

23. Machine Learning / MLOps

  • 23.1 Machine Learning Basics
  • 23.2 MLOps

24. Other Topics

  • 24.1 APIs
  • 24.2 IoT
  • 24.3 Mobile Apps
  • 24.4 Choosing the Right Technologies
  • 24.5 Reusability
  • 24.6 Best Practices
  • 24.7 Job Scheduling