Description:
We seek an experienced and highly skilled Azure Infrastructure and Site Reliability Engineer to join Greystar’s Data, Digital, and AI team (D2AI). As an Azure infrastructure engineer, this role involves designing, implementing, and managing Azure-based infrastructure and cloud solutions with additional expertise in Azure Databricks. The ideal candidate will be responsible for ensuring system scalability, security, and performance by leveraging best DevOps practices, Infrastructure as Code (IaC), and advanced data processing workflows. This is a 100% hands-on role that requires deep technical expertise and the ability to collaborate effectively across teams.
As an SRE on the D2AI team, you will be responsible for ensuring the stability, performance, and availability of our cloud-based internally and externally customer-facing products. Your role will be crucial in ensuring seamless operations and rapid issue resolution.
Job Description
What You Will Do:
Azure Architecture and Resource Management:
- Design, implement, and manage Azure solutions to meet technical and operational needs.
- Documentation: Maintain comprehensive network and server documentation, including infrastructure diagrams, server configurations, standard operating procedures, and incident reports
- Optimize Azure resource configuration for performance, cost and security.
- Monitor the health and reliability of Azure resources, ensuring high availability.
- Continuously monitor network/server performance, using advanced network management and server administration tools to identify issues proactively.
- Monitor and manage the health, performance, and availability of our applications running on Azure (including ADF pipelines).
Support for Application Development Teams:
- Collaborate with development teams to align Azure architecture with application requirements.
- Provide guidance on best practices for Azure resource provisioning, scaling, and configuration.
- Enable teams to leverage Azure services, including DataBricks, for analytics and data workflows.
- Incident Detection and Response: Detect and analyze network and server anomalies, security threats, and performance bottlenecks. Initiate incident response procedures and coordinate with relevant teams for swift resolution.
- Troubleshooting: Investigate and resolve infrastructure, network/server-related issues, escalate complex problems to higher-level teams, and maintain detailed incident documentation.
Azure Databricks Expertise:
- Set up and manage Azure Databricks environments for big data processing and advanced analytics.
- Support and optimize Databricks pipelines for data engineers and scientists.
- Effectively Troubleshoot and resolve Databricks-related challenges.
DevOps and Infrastructure Management:
- Develop and maintain Infrastructure as Code (IaC) scripts using Terraform.
- Implement DevOps practices, including CI/CD pipelines, automated testing, and monitoring.
- Streamline workflows by collaborating with IT and operations teams.
Troubleshooting and Issue Resolution:
- Act as the primary point of contact for Azure-related issues within the project.
- Investigate, diagnose, and resolve complex technical issues in collaboration with development and operations teams.
- Implement preventive measures to minimize downtime and disruptions.
Continuous Improvement:
- Analyze trends and metrics to identify areas for improvement and optimization.
- Identify and implement cost optimization opportunities within Azure infrastructure and services.
- Conduct regular reviews of Azure cost management using cost management tools.
- Stay updated on emerging Azure technologies, AI and and cloud computing trends to drive innovation.
- Identify opportunities to improve processes, tools, and systems to enhance efficiency and scalability.