Software Engineer (Site Reliability) - Mercari Crossborder

Salary not provided

GCPKubernetes

English or Japanese

English: FluentORJapanese: Fluent

Software Engineer (Site Reliability) - Mercari Crossborder

Location: Minato City, Tokyo, Japan
Employment Status: Full-time
Work Hours: Full Flextime (no core time)
Office: Roppongi
Mode: Hybrid (Mid-Career, Full time)

Description

Mercari has operated localised marketplaces in Japan and the United States for years. The Cross Border team recently launched a new global app—already live in Hong Kong, Taiwan, and the United States—with the ambition to expand into over 100 countries in the coming years. The engineering goal is to build a global app platform that lets existing and future product services run consistently across every region, without the fragmentation that comes from excessive localisation.

The cross-border nature of the business means reliability is never a purely technical problem. You will work alongside product engineers, operations teams, and third-party logistics and vendor partners to ensure the platform holds up under the real-world pressures of multi-region, multi-currency, and multi-regulatory commerce. Speed and stability are both requirements—the team is expected to make deliberate trade-offs and move forward with a clear long-term foundation in mind.

Work Responsibilities

Platform Reliability

Own the SRE function for the global app platform as it expands into new regions—defining SLIs/SLOs, leading production readiness reviews, and ensuring reliability is built in from design through to launch.
Operate the platform reliably across multiple time zones: roll out and run services with each region’s traffic patterns in mind to meet the reliability requirements of a global, always-on service.
Scale databases and infrastructure as the user base grows, and evolve the architecture from today’s single-region setup toward a multi-region configuration.
Track and improve reliability metrics (change fail rate, incident frequency, time to detect and resolve) to guide trade-off decisions between delivery speed and platform stability.

Observability & Incident Management

Build and own the observability stack—metrics, distributed tracing, structured logging—and continuously improve production debugging so on-call engineers can resolve issues independently at any hour.
Lead incident response: rapid detection, mitigation, resolution, and blameless post-mortems; participate in on-call rotation with a long-term goal of reducing alert volume through AI-assisted automation.

Developer Enabling

Own the developer-facing side of the platform: CI/CD pipelines, deployment tooling, development environments, and the systems that determine how fast and safely engineers can go from code to production.
Lift the productivity and development experience of every engineer building the global app so the team can ship a global product more efficiently, increasingly by leveraging AI to accelerate the entire development workflow.
Identify and eliminate friction in the engineering workflow—through automation, self-service tooling, AI-assisted workflows, and platform improvements—so engineers spend more time building and less time waiting or debugging.

Cross-functional Collaboration

Partner with platform engineering, product engineering, and operations to bridge the gap between infrastructure health and product-level reliability—working across time zones with teams in Japan, APAC, and beyond.
Collaborate with operations teams and third-party vendor partners on reliability requirements that extend into logistics and external integrations; share learnings across engineering teams to raise the collective bar.

Unique Challenges

Global scale: You are building the reliability foundation for a platform targeting 100+ countries, where a single time zone no longer bounds the problem.
Multi-region evolution: Growing the user base means scaling data and infrastructure and moving from today’s single-region service toward a globally distributed, multi-region architecture.
SRE meets enabling: You will directly shape how the entire engineering team develops, ships, and operates software.
AI-native approach: Working at the frontline of AI-driven development velocity and the operational strengthening it demands—you will help shape the SRE culture that emerges as engineers and autonomous agents share the work.
Business context: You will work directly with product, engineering, operations, and business stakeholders across regions, connecting reliability engineering to real commercial outcomes, not just uptime numbers.

Qualifications

Required Experience/Skills

Hands-on experience operating production services on cloud infrastructure, particularly GCP and Kubernetes.
Experience defining reliability indicators such as SLIs/SLOs and continuously improving service reliability based on them, including on-call rotations and blameless post-mortem processes.
Operational experience with databases and a solid understanding of database fundamentals.
Hands-on experience applying AI/LLMs to improve engineering productivity, and/or experience building and providing tooling to developers.
A proactive mindset toward incident response and troubleshooting, and a willingness to embed directly into product teams when the service requires it—improving reliability at the product level, not only the infrastructure level.
Demonstrated experience learning from incidents and working cross-functionally to design and implement preventive measures and automation.
Foundational knowledge of web services architecture.

Preferred Experience/Skills

Experience setting SLIs/SLOs not only within engineering but by communicating and aligning with business and product leaders.
Experience in product development within a team environment.
Experience developing AI agents.

Language

Japanese: Independent (CEFR – C1)
OR
English: Independent (CEFR – C1)

Learn more:

Mercari Careers site
Mercan
Social media: X / LinkedIn

reco経由で応募する