Site Reliability Engineer
Parabola · Technology
senior
Salary Range (USD)
Negotiable
Location
San Francisco, USA
Visa Support
Not mentioned
Funding Stage
Unknown
Job Responsibilities
- • own our monitoring and alerting infrastructure
- • drive incident response
- • work closely with engineering to make sure we have deep visibility into everything that matters
- • use LLMs heavily in your work
- • writing runbooks, generating alert configs, drafting postmortems, building dashboards
Required Skills
Hands-on experience with Prometheus and Grafana (or similar — Datadog, Honeycomb, etc.)Strong instincts for what to instrument and what good alerting actually looks likeComfort debugging distributed systems across the full stackExperience owning on-call and incident response end to endAWS familiarity and enough IaC experience to get things done (CDK or Terraform)
Engineering Culture & Tech Stack
PrometheusGrafanaAWSLLMsCDK
passionate about observability and keeping systems healthy and understandable
ability to make using LLMs feel natural
Raw Post
Show original text
Parabola | Site Reliability Engineer | San Francisco, CA (Hybrid) | Full-time
Parabola is a no-code data workflow automation tool that helps operations teams move, transform, and automate their data without writing code. LLMs are a core part of our product — we use them to help users build and reason about their workflows — and they're increasingly part of how we run infrastructure too. We're a small, product-focused team and our infrastructure runs on AWS.
We're looking for an SRE who's passionate about observability and keeping systems healthy and understandable. You'll own our monitoring and alerting infrastructure, drive incident response, and work closely with engineering to make sure we have deep visibility into everything that matters. We expect you to use LLMs heavily in your work — writing runbooks, generating alert configs, drafting postmortems, building dashboards — and we want someone who's already figured out how to make that feel natural.
What you'll work on:
Observability stack — Prometheus, Grafana, dashboards, alerting, and on-call workflows
Incident response and postmortems — building a culture of learning from failures
SLIs, SLOs, and error budgets — helping the team make data-driven reliability decisions
Monitoring LLM-specific infrastructure: latency, token throughput, model error rates, cost attribution
AWS infrastructure across our stack (Lambda, ECS, RDS, OpenSearch, CloudFront, etc.)
CDK-based IaC and CI/CD pipelines as needed
What we're looking for:
Hands-on experience with Prometheus and Grafana (or similar — Datadog, Honeycomb, etc.)
Strong instincts for what to instrument and what good alerting actually looks like
Comfort debugging distributed systems across the full stack
Experience owning on-call and incident response end to end
AWS familiarity and enough IaC experience to get things done (CDK or Terraform)
Someone who reaches for an LLM before writing boilerplate from scratch — and knows when not to
Nice to have: experience instrumenting LLM pipelines, familiarity with TypeScript/Node.js, startup experience, or a background in security and compliance.
Reach out at cj@parabola.io or apply at https://jobs.ashbyhq.com/parabola-io/75141699-0666-4baa-b03f....
AI Risk Insights
No major risk signals detected.
Recent News
No recent updates
Data Source
Content parsed by LLM from Hacker News raw data. Confidence:HIGH