Description:
We’re seeking an experienced AI Platforms Leader to own the strategy, architecture, and operation of our end‑to‑end AI Platform—spanning on‑prem GPU clusters and cloud services (AWS/GCP/Azure). You’ll lead a high-caliber engineering team to deliver reliable, secure, and cost‑efficient infrastructure for training, fine‑tuning, inference, retrieval, and agentic orchestration (including A2A patterns and MCP servers). If you love turning complex AI/ML requirements into robust, self‑service platform capabilities for builders across the company, this is your role.
This role requires full-time onsite work in San Diego, CA (5 days per week).
Key Responsibilities
- Own the AI Platform strategy & roadmap
- Define the multi‑year vision for a multi‑tenant, hybrid (on‑prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency.
- Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails.
- Run GPU-based compute at scale
- Operate and optimize on‑prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high‑throughput storage/networking.
- Drive GPU utilization efficiency, right‑sizing, and cost transparency across training and inference workloads.
- Deliver MLOps & LLMOps as a product
- Provide golden paths for data prep, training/fine‑tuning, model registry, lineage, governance, evaluation, red‑teaming, and safe deployment (batch, online, streaming).
- Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments.
- Agentic AI, A2A, and MCP ecosystem
- Lead the design and operation of agentic orchestration (A2A patterns), tool integration, and MCP (Model Context Protocol) servers to securely expose enterprise tools and data.
- Standardize agent capability schemas, guardrails, observability, and policy enforcement.
- Cloud AI/ML platforms
- Leverage AWS/Azure AI services for training and inference (e.g., Bedrock/SageMaker/EKS; Azure AI Studio/Azure ML/AKS/Azure OpenAI) with robust networking, identity, secrets, and cost controls.
- Establish multi‑cloud patterns for portability, resilience, and vendor risk management.
- Platform engineering & DevOps excellence
- Own core platform services: identity/RBAC, secrets, service meshes, observability (logs/metrics/traces), data access controls, vector stores, feature stores, and model gateways (e.g., KServe/Triton/vLLM).
- Use GitOps/IaC (Terraform/Bicep/Helm) and secure software supply chain practices (SBOMs, image signing, policy as code).
- Operational leadership
- Lead a ~10‑engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on‑call rotation.
- Drive incident response, post‑mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance.
- Stakeholder & vendor management
- Partner with product, data, and application teams to enable high‑impact AI use cases.
- Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget.
Required Qualifications
- 15+ years overall engineering/technology experience, including ~10 years building and operating large‑scale platforms (AI/ML, data, or high‑performance computing).
- Leadership: Proven experience leading a team of ~10 engineers for 5+ years, across platform/SRE/MLOps/LLMOps, with coaching, hiring, performance management, and clear execution rhythms.
- GPU cluster expertise: Hands‑on operations for on‑prem GPU clusters (Kubernetes + GPU operator and/or Slurm), scheduling, capacity planning, performance tuning, and reliability.
- MLOps & LLMOps: Strong experience with model lifecycle (data → training → registry → deployment), model/agent evaluation, safety/guardrails, and observability.
- Cloud (AWS/GCP/Azure): Deep experience with AI/ML services and managed Kubernetes (EKS/AKS/GKE), networking, security, identity, and cost management.
- DevOps/Platform Engineering: CI/CD, GitOps, IaC (Terraform/Bicep/Helm), containerization (Docker), Kubernetes, and secure SDLC practices.
- Agentic AI & MCP: Solid understanding of agent orchestration, A2A patterns, tool abstractions, and operating MCP servers in production.
- Operational excellence: Demonstrated success running AI or computing clusters with SLOs, on‑call, incident management, and post‑mortems.
- Global collaboration: Experience leading a distributed engineering team across time zones.
- Education: Bachelor’s degree in Engineering, Computer Science, or related field.