AI Insights

Software Engineer - Production Engineering

Snowflake · Bozeman, Montana, US

full-timemid (3-6 yrs)Posted 71d ago

Software EngineeringIC2ICOn-site

StackGoKubernetesLinuxAWSAzureGCPDistributed SystemsObservabilityIncident ResponseSLOsSLAsContainersCloud InfrastructureAutomationMonitoringPostmortem Analysis

Summary

A Production Engineering Software Engineer role at Snowflake focused on reliability engineering, SLO ownership, observability, and large-scale distributed systems. The team drives proactive reliability culture including incident management, blameless postmortems, and automation-first operations across Snowflake's cloud infrastructure.

About the role

At Snowflake, we are powering the era of the agentic enterprise. To usher in this new era, we seek AI-native thinkers across every function who are energized by the opportunity to reinvent how they work. You don’t just use tools; you possess an innate curiosity, treating AI as a high-trust collaborator that is core to how you solve problems and accelerate your impact. We look for low-ego individuals who thrive in dynamic and fast-moving environments and move with an experimental mindset — who rapidly test emerging capabilities to discover simpler, more powerful ways to deliver results. At Snowflake, your role isn't just to execute a function, but to help redefine the future of how work gets done.

The Production Engineering Team at Snowflake is responsible for driving the reliability tools and processes that ensure Snowflake consistently delivers a top-tier experience for its customers. This includes championing Service Level Objectives (SLOs) across all of Engineering, building the infrastructure necessary for rapid detection of reliability issues, and deeply engaging in system health verification after releases. We think about production reliability end-to-end: how do we proactively prevent issues, quickly detect and diagnose problems when they arise, and efficiently resolve them to minimize impact. We drive the culture of learning from every incident.

RESPONSIBILITIES:

Improve the whole lifecycle of services—from inception and design, deployment, operation, and refinement.
Scale systems sustainably by automation; Participate in changes that improve reliability and velocity.
Establish and practice low noise incident response rotations and blameless postmortems to prevent problem recurrence.
Write and review code. Develop documentation and capacity plans, and debug the hardest problems on large distributed systems.
Collaborate with software engineers to establish, maintain, and optimize functional and performance SLOs.
Participate in a 12x7 on-call rotation.

MINIMAL QUALIFICATIONS:

Bachelor's degree in Computer Science, a related technical field involving software engineering, or equivalent practical experience.
Proficient in at least one modern programming language, preferably Golang.
Systematic problem-solving methods, effective communication skills.

PREFERRED QUALIFICATIONS:

3+ years industry experience of building and supporting large scale systems in production.
Experience in modern observability tools and production monitoring practices.
Experience with containers and container orchestration systems such as Kubernetes
Experience in deploying, managing, and operating scalable and fault tolerant Linux infrastructure.
Hands-on experience with one of more public cloud providers (AWS, Azure, or GCP)

Snowflake is growing fast, and we’re scaling our team to help enable and accelerate our growth. We are looking for people who share our values, challenge ordinary thinking, and push the pace of innovation while building a future for themselves and Snowflake.

How do you want to make your impact?

For jobs located in the United States, please visit the job posting on the Snowflake Careers Site for salary and benefits information: careers.snowflake.com

What you'll do

1Improve the full lifecycle of services from inception and design through deployment, operation, and refinement

2Scale systems sustainably through automation and participate in reliability and velocity improvements

3Establish and lead low-noise incident response rotations and blameless postmortems to prevent recurrence

4Write and review code, develop documentation and capacity plans, and debug complex distributed systems issues

5Collaborate with software engineers to establish, maintain, and optimize functional and performance SLOs

6Participate in a 12x7 on-call rotation

Requirements

■Proficiency in at least one modern programming language, preferably Go, to build reliability tooling and automation

■Systematic problem-solving approach to debug and resolve issues in large-scale distributed systems

■Strong communication skills to collaborate cross-functionally with software engineers on SLO design and capacity planning

■Experience or strong aptitude for reliability engineering practices including on-call rotations, incident response, and blameless postmortems

Nice to have

■Go

■Kubernetes

■AWS

■Azure

■GCP

■Linux

■Prometheus

■Grafana

■Datadog

■OpenTelemetry

■Terraform

■Docker

Role overview

Role family

Software Engineering

Level

IC2 — devops_sre

Experience

3–6 years

Type

Individual Contributor

Remote policy

On-site

Visa sponsorship

Not offered

Tech stack analysis

LANGUAGES

INFRASTRUCTURE

KubernetesDockerAWSAzureGCPLinux

TOOLS

Observability tooling (unspecified)Production monitoring platforms

Salary estimate

$140K – $185K

AI-estimated salary range

Confidence72%

Reasoning

Snowflake is a high-growth, publicly traded cloud data company known for competitive above-market compensation. For a mid-level SWE in Production/SRE at a tier-1 tech company, total cash (base + bonus) typically ranges $140K–$185K USD. Location in Bozeman, MT may slightly compress base vs. SF/NYC, but Snowflake historically pays competitively regardless of office location. RSU grants would be substantial on top of base. Estimate excludes equity.

See the AI-estimated salary range for this role

Green flags

5 items

■Blameless postmortem culture explicitly called out — signals a psychologically safe and learning-oriented engineering environmentculture

Discover all 5 green flags for this role

Benefits breakdown

See all benefits organized by category — health, financial, time off & more

Hiring insights

JD quality

7/10

Urgency

medium

Autonomy

medium

Team size

medium (5-15)

See JD quality score, hiring urgency & team details

Red flags

PRO3 items

■12x7 on-call rotation is significant — more demanding than a typical 1-week-per-month on-call schedule; could impact work-life balancework life balance

See all 3 red flags — what the JD isn't telling you

Interview insights

PRO

Rounds

Duration

4 wks

Difficulty

hard

Take-home

Yes

Get full interview breakdown — rounds, likely topics & prep tips

Career path

PRO

Next roles

Senior Software Engineer - SREStaff Production EngineerEngineering Manager - Reliability

See where this role leads — full career progression

About the company

Snowflake

www.snowflake.com

Snowflake is the cloud data platform that enables organizations to consolidate data, power analytics, and build data applications at massive scale. Processing exabytes of data for over 9,800 customers including Capital One, Adobe, and AT&T, Snowflake's unique architecture separates compute and storage across AWS, Azure, and GCP.

HQBozeman, MT, USA

Interview difficultyhard

Build vs Maintainboth

Cross-functionalYes

Similar roles