Site Reliability Engineer (SRE) at LockedIn AI
33 Irving Pl, Manhattan, New York, United States
About the job
About LockedIn AI
LockedIn AI is the #1 real-time AI interview and meeting copilot, trusted by over 1 million users worldwide. We build AI-powered systems that help users perform better in live interviews, coding assessments, and professional conversations.
Our infrastructure powers real-time AI experiences where latency, uptime, and reliability directly impact user success.
Job Overview
We are hiring a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our production systems.
This is a high-impact role where you will own the stability of real-time AI infrastructure that serves over 1M users. When users are in live interviews, system delays or failures directly affect their outcomes—your work ensures that never happens.
You will design resilient systems, automate operations, and improve observability across our entire AI stack, including inference pipelines, APIs, and cloud infrastructure.
Employment Details
Type: Full-Time
Work Model: Remote (US-based) with optional hybrid in New York, NY
Reports To: Co-Founder / CEO
Compensation: $145,000 – $200,000 USD per year + equity
Key Responsibilities
Reliability, Availability & Performance
Own uptime, latency, and reliability for all production systems
Define and manage SLIs, SLOs, and error budgets
Design fault-tolerant and self-healing architectures
Optimize system performance under high load and real-time AI traffic
Improve latency for inference, APIs, and streaming systems
Infrastructure as Code & Cloud Systems
Build and maintain infrastructure using Terraform, Pulumi, or CloudFormation
Design scalable cloud architectures on AWS, GCP, or Azure
Manage Kubernetes clusters and containerized microservices
Implement cost-efficient and scalable infrastructure strategies
Observability & Monitoring
Build monitoring systems using Prometheus, Grafana, Datadog, or equivalent
Design alerting systems with low noise and clear escalation paths
Implement distributed tracing and centralized logging
Monitor AI-specific metrics such as inference latency, GPU usage, and throughput
Incident Response & Reliability Engineering
Lead incident response during production outages
Perform root cause analysis and write blameless postmortems
Build runbooks and escalation procedures for engineering teams
Continuously improve MTTR and system resilience
CI/CD & Deployment Systems
Build and maintain CI/CD pipelines for code and AI model deployments
Implement canary, blue-green, and automated rollback strategies
Ensure safe, fast, and reliable production deployments
Add validation gates and automated testing for deployments
Security & Compliance
Implement security best practices across infrastructure
Manage IAM, secrets, encryption, and network security
Ensure privacy-first compliance for all systems
Maintain vulnerability scanning and patch management processes
Required Qualifications
Experience
3+ years in SRE, DevOps, or infrastructure engineering roles
Experience owning production systems at scale
Strong incident management and postmortem experience
Experience working in fast-paced startup environments
Technical Skills
Strong programming skills in Python, Go, or similar
Deep knowledge of AWS, GCP, or Azure
Experience with Kubernetes and Docker in production
Hands-on expertise with Infrastructure as Code (Terraform, Pulumi, etc.)
Experience with monitoring and observability tools
Experience building CI/CD pipelines
Soft Skills
Strong reliability-first engineering mindset
Calm and structured thinking during incidents
Strong communication and documentation skills
Ownership-driven and proactive problem solver
Preferred Qualifications
Experience with AI/ML infrastructure or GPU-based systems
Familiarity with real-time or streaming systems
Experience with chaos engineering practices
Knowledge of AI observability and model monitoring
Multi-cloud infrastructure experience
Open-source or startup background (Seed to Series A preferred)
What We Offer
Equity & Ownership
Meaningful early-stage equity
High ownership of critical production systems
Impact
Systems you build directly impact 1M+ users
Your work defines reliability of real-time AI experiences
Culture & Team
Small, fast-moving engineering team
High autonomy and trust-based execution
Strong focus on engineering excellence
Flexibility
Remote-first (US-based) culture
Optional hybrid workspace in New York, NY
Growth
Opportunity to build reliability systems for cutting-edge AI infrastructure
Work at the intersection of distributed systems and real-time AI
Why Join LockedIn AI?
We are building one of the most latency-sensitive AI products in the world. Reliability is not a backend concern—it is the product itself.
If you want to build systems where every millisecond matters and your work directly impacts user success in high-stakes moments, this is the role for you.
Organization Information
Address
LockedIn AI
33 Irving Pl, Manhattan, New York, United States 33 Irving Pl, Manhattan, New York, United States
Manhattan, KS, USA, New York 10003
Contact
LockedIn AI
18622797219
[email protected]
Website
www.lockedinai.com