Vacancy expired!
- Cleanse, manipulate and analyze large datasets (Structured and Unstructured data – XMLs, JSONs, PDFs) using Hadoop platform.
- Develop Python, PySpark, Spark scripts to filter/cleanse/map/aggregate data.
- Be able to build Dashboards in R/Shiny for end user consumption
- Manage and implement data processes (Data Quality reports)
- Develop data profiling, deduping logic, matching logic for analysis
- Programming Languages experience in Python, PySpark and Spark for data ingestion
- Programming experience in BigData platform using Hadoop platform
- Present ideas and recommendations on Hadoop and other technologies best use to management
- 5+ years of experience in processing large volumes and variety of data (Structured and unstructured data, writing code for parallel processing, XMLS, JSONs, PDFs) - Mandatory
- 5+ years of programming experience in Python, PySpark for data processing and analysis. - Mandatory
- Strong SQL experience is a must - Mandatory
- 3+ years of experience – using Hadoop platform and performing analysis. Familiarity with Hadoop cluster environment and configurations for resource management for analysis work
- 2+ years of experience with containerization and orchestration.
- Hands on experience with AWS, Kubernetes, Kubeflow, Docker etc.
- Detail oriented. Excellent communication skills (verbal and written)
- Must be able to manage multiple priorities and meet deadlines
- Degree in Statistics, Economics, Business, Mathematics, Computer Science or related field
Vacancy expired!