Ai Ops Engineer

Description:

The AlOps Engineer is responsible for integrating machine learning and advanced analytics into our existing monitoring and logging systems. This role will leverage artificial intelligence to automate and implement phased improvements to achieve operational excellence. Detect anomalies proactively, and implement self-healing frameworks to enhance the stability and performance of our infrastructure. The ideal candidate wil be proactive in identifying gaps and help in solutions.

Key Responsibilities

Apply machine learning algorithms to existing operational data (logs, metrics, events) to predict system failures and proactively address potential incidents.

Implement automation for routine DevOps practices including automated scaling, resource optimization, and controlled restarts.

Develop and maintain self-healing systems to reduce manual intervention and enhance system reliability.

Build anomaly detection models to quickly identify and address unusual operational patterns.

Collaborate closely with SREs, developers, and infrastructure teams to continuously enhance the operational stability and performance of the system.

Provide insights and improvements through visualizations and reports leveraging Al-driven analytics.

Create a phased roadmap to incrementally enhance operational capabilities and align with strategic business goals.

Required Skills And Qualifications

Strong experience with Al/ML frameworks and tools (e.g., TensorFlow, PyTorch, scikit-learn).

Proficiency in data processing and analytics tools (e.g., Splunk, Prometheus, Grafana, ELK stack).

Solid background in scripting and automation (Python, Bash, Ansible, etc.).

Experience with cloud environments and infrastructure automation.

Proven track record in implementing proactive monitoring, anomaly detection, and self-healing techniques.

Excellent analytical, problem-solving, and strategic planning skills.

Strong communication skills and the ability to effectively collaborate across teams.

Preferred Experience

Background in DevOps/Site Reliability Engineering.

Familiarity with containerization and orchestration platforms (Kubernetes, Docker).

Experience in building scalable, distributed systems.

This role is pivotal in enabling our organization to achieve and sustain Operational Excellence through intelligent automation and proactive monitoring practices.

Organization	Highbrow LLC
Industry	Engineering Jobs
Occupational Category	AI Ops Engineer
Job Location	Texas,USA
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2025-08-09 7:51 am
Expires on	2026-08-06