Site Reliability Engineer – Platform Foundation
Salary: 400 - 1000 百万円
Minimum year of experience: 3
SalesMarkerJob Description: Site Reliability Engineer – Platform Foundation
The Team
The Common Foundation team empowers engineering teams to move faster by providing scalable, reliable, and reusable systems as the platform for product development. The Platform Foundation side focuses on reliability, performance, security, and developer productivity across cloud infrastructure and a Kubernetes platform. This involves building paved roads, automating operations, and enabling application teams to ship safely at speed.
We are seeking a Site Reliability Engineer to take ownership of the health and evolution of our platform. You will design and operate AWS and Kubernetes environments, lead reliability initiatives, and collaborate with product engineers to embed best practices in availability, observability, and performance. The role turns complex infrastructure into simple, well-documented, self-service building blocks.
Responsibilities
- Operate and improve our Kubernetes platform (EKS): cluster lifecycle, upgrades, scaling, networking, and multi-tenant isolation.
- Design, provision, and manage AWS infrastructure (VPC, RDS/Aurora, OpenSearch, S3, SQS, Lambda, API Gateway, Batch, Glue) with emphasis on security, reliability, and developer experience.
- Build infrastructure as code using Terraform and AWS CDK; establish module standards, environments, and change management via GitOps.
- Drive end-to-end observability: metrics, logs, traces, SLOs, error budgets, actionable dashboards and alerts (Datadog).
- Partner with backend engineers to enhance service reliability, performance, and cost efficiency; promote best practices in testing, rollouts, and production readiness.
- Automate operations and repetitive work with tooling and pipelines; reduce MTTR with improved runbooks, diagnostics, and incident tooling.
- Lead incident response and post-incident reviews; advance operations through blameless retrospectives, remediation plans, and reliability roadmaps.
- Strengthen platform security: identity/access control, secrets management, network policies, patching, vulnerability management.
- Support data workloads and pipelines with robust, scalable infrastructure and monitoring.
- Contribute to platform documentation, paved paths, and self-service developer workflows to accelerate delivery.
Requirements
- 3+ years in SRE, Platform, or Infrastructure Engineering, with production ownership of cloud-native systems.
- Strong experience running Kubernetes in production: upgrades, scaling, workload reliability.
- Deep hands-on expertise with AWS services (networking, compute, storage, databases, messaging) and secure-by-default architectures.
- Proficiency with IaC (Terraform and/or AWS CDK), modularization, and environment management.
- Solid observability fundamentals: metrics, logging, tracing, SLOs/error budgets, actionable alerting.
- Proven track record improving reliability, performance, and developer experience in partnership with application teams.
- Experience running incident response and driving post-incident improvements.
Nice to Have
- Experience with identity and access management patterns, Cognito, JWT, and service-to-service auth.
- Background in multi-tenant architectures, capacity planning, and cost optimization.
- History of handling major incidents at scale and building tooling to reduce MTTR/MTTD.
- Contributions to internal developer platforms, golden paths, or shared libraries.
- Fluency in English or Japanese.
Why Join?
- Work with one of the fastest growing SaaS startups in Japan, showing strong financial growth.
- Opportunity for innovative product development and to build systems from scratch.
- Abundant leadership and career development opportunities.
- Remote-friendly and fully flexible work schedules.
- Global team and English-speaking environment.
- Attractive benefits and perks: Resort Worx, book purchasing, weekly free lunch, company offsites, and more.
Working Style
Hybrid Work
A combination of office and remote work, with in-office days varying by role. Remote collaboration is conducted via Zoom, Google Meet, and Gather.
Flex Work
Customizable working hours to match your personal needs. For business and client-facing teams, schedules are frequently client-driven.
Global Environment
Work alongside teammates from over 20 countries. Projects move forward across languages and cultures, with English and Japanese used naturally in daily work.
Position Details
- Title: Site Reliability Engineer – Platform Foundation
- Employment Type: Full-time
- Compensation: ¥4,000,000 – ¥10,000,000 annual gross (final offer based on experience and skill)
- Location: Tokyo, Japan (Hybrid/on-site at address near Ebisu Station)
- Probation Period: 6 months (same conditions)