What you will be doing
- Design and develop pipelines using Python, PySpark, and SQL
- Use GitLab as the versioning control system
- Utilize S3 buckets for storing large volumes of raw and processed data
- Implement and manage complex data workflows using Apache Airflow (MWAA) to orchestrate tasks
- Utilize Apache Iceberg (or similar) for managing and organizing data in the data lake
- Create and maintain data catalogs using AWS Glue Catalog to organize metadata
- AWS Athena for interactive querying
- Familiarize with data modeling techniques to support analytics and reporting requirements, as well as knowledge of the data journey stages within a data lake (Medallion Architecture)
What we are looking for
- Ideally, a degree in Information Technology, Computer Science, or a related field
- Ideally, +5 years of experience within the Data Engineering landscape
- Strong expertise in Python, PySpark, SQL, and the overall AWS data ecosystem
- Strong problem-solving and analytical skills
- Ability to explain technical concepts to non-technical users
- Proficiency to work with Github
- Terraform and CICD pipelines are a great ‘nice-to-have’