Software Engineer (Site reliability) - Mercari

Salary not provided

KubernetesGoPythonShell

English or Japanese

English: FluentORJapanese: Fluent

Mercari

Software Engineer (Site Reliability)

Employment Status: Full-time
Work Hours: Full Flextime (no core time)
Office: Roppongi

About the Role

Mission: Circulate all forms of value to unleash the potential in all people.
Vision: Utilizing technology to connect people worldwide and provide opportunities for everyone to realize their dreams.

See more about our values and mission in our Culture Doc.

Team Mission

Engineering Principles:

Passion For The Product
Grow Together
Solve Through Mechanisms
Collaborate Openly

Learn more about our engineering culture.

The SRE Team is responsible for the reliability, scalability, and operational excellence of large-scale production services. The team works across Google Cloud and Kubernetes, focuses on observability via CUJ SLOs, incident response enhancement, reducing toil, building resilient systems, and advancing AI-driven operations.

Work Responsibilities

Operate hundreds of production microservices on Google Cloud (Kubernetes, managed services) under SLO targets, including on-support rotation for urgent issues.
Lead end-to-end reliability epics independently — from design through rollout, monitoring, and post-launch iteration.
Define and operate SLOs and SLIs for critical user journeys, using error budgets for prioritization with product teams.
Lead incident response, improve the team's postmortem culture, and drive follow-ups to prevent recurrence.
Build autonomous AI agents for detection, triage, and recovery with clear safety protocols.
Write Infrastructure as Code with Terraform and develop automation to reduce toil and support large-scale operations.
Build and maintain monitoring, alerting, and tracing on Datadog, optimizing for user impact and rapid detection-to-mitigation.
Perform reliability and performance tuning on production workloads (capacity planning, autoscaling, load shedding, dependency hardening).
Collaborate with product and platform teams on production readiness, capacity planning, and new infrastructure adoption.
Strengthen reliability governance through engineering (risk assessment, audit response, compliance-as-code).

Unique Challenges

Improving reliability across a broad business portfolio (Marketplace and Fintech) at scale, leading with CUJ SLOs.
Shaping SRE culture in an AI-driven environment—collaborating with both engineers and autonomous agents.
Partnering with engineering teams who act on data for compounding reliability improvements.
Balancing approximately 50% reactive (alerts, support, incident response) and 50% project delivery work.
Working in a bilingual (Japanese/English) environment.

Qualifications

Required

Production SRE experience with service ownership, availability targets, toil reduction, operational readiness, using SLOs and SLIs with development teams.
Experience operating production services at scale (over 10K QPS or several microservices) under SLOs.
Production experience with Google Cloud (compute, networking, managed services) and Kubernetes workloads.
Infrastructure-as-Code knowledge (Terraform), and scripting ability (Go, Python, or shell).
Hands-on experience with monitoring and observability (Datadog or similar), including alert design and reducing alert fatigue.
Ownership of incident response, postmortems, on-call or support rotations.
Ability to lead epics end-to-end independently.
Willingness to learn/apply AI to operational workflows beyond core SRE.

Preferred

Designing or running platform-wide SLO programs across multiple services/business units.
Applying AI to operational workflows (e.g., log analysis, alert summarization, RCA, remediation) with quality/safety evaluation.
Experience in high-scale Kubernetes platforms or distributed systems internals (scheduling, consistency, failure recovery).
Leading reliability or platform initiatives spanning multiple teams.
Strengthening reliability governance (compliance-as-code, automated audit evidence, risk assessment).

Language

Japanese: Independent (CEFR – C1)
- OR
English: Independent (CEFR – C1)

(See more about CEFR levels here.)

Learn More

Recruiting Process

Application screening
Skill assessment (for engineering positions/on HackerRank or GitHub)
Interviews (number varies by position)
Reference check (online, near final interview)
Offer (after final interview and reference check)

Equal Opportunity Hiring

We aim for a world in which no one's potential is limited by their background. Our inclusion & diversity mindset drives us to eliminate discrimination based on age, gender, sexual orientation, race, religion, physical disability or any other factor.

Read our Inclusion & Diversity statement.
See our Privacy Policy.