Software Engineer (Site Reliability) - Mercari Crossborder

Salary not provided

KubernetesGCP
English or Japanese
English: FluentORJapanese: Fluent
Mercari

Software Engineer (Site Reliability) - Mercari Crossborder

Location: Minato City, Tokyo, Japan
Employment Status: Full-time
Work Hours: Full Flextime (no core time)
Office: Roppongi
Mode: Hybrid (Mid-Career, Full time)


Description

Mercari has operated localised marketplaces in Japan and the United States for years. The Cross Border team recently launched a new global app—already live in Hong Kong, Taiwan, and the United States—with the ambition to expand into over 100 countries in the coming years. The engineering goal is to build a global app platform that lets existing and future product services run consistently across every region, without the fragmentation that comes from excessive localisation.

The cross-border nature of the business means reliability is never a purely technical problem. You will work alongside product engineers, operations teams, and third-party logistics and vendor partners to ensure the platform holds up under the real-world pressures of multi-region, multi-currency, and multi-regulatory commerce. Speed and stability are both requirements—the team is expected to make deliberate trade-offs and move forward with a clear long-term foundation in mind.


Work Responsibilities

Platform Reliability

  • Own the SRE function for the global app platform as it expands into new regions—defining SLIs/SLOs, leading production readiness reviews, and ensuring reliability is built in from design through to launch.
  • Operate the platform reliably across multiple time zones: roll out and run services with each region’s traffic patterns in mind to meet the reliability requirements of a global, always-on service.
  • Scale databases and infrastructure as the user base grows, and evolve the architecture from today’s single-region setup toward a multi-region configuration.
  • Track and improve reliability metrics (change fail rate, incident frequency, time to detect and resolve) to guide trade-off decisions between delivery speed and platform stability.

Observability & Incident Management

  • Build and own the observability stack—metrics, distributed tracing, structured logging—and continuously improve production debugging so on-call engineers can resolve issues independently at any hour.
  • Lead incident response: rapid detection, mitigation, resolution, and blameless post-mortems; participate in on-call rotation with a long-term goal of reducing alert volume through AI-assisted automation.

Developer Enabling

  • Own the developer-facing side of the platform: CI/CD pipelines, deployment tooling, development environments, and the systems that determine how fast and safely engineers can go from code to production.
  • Lift the productivity and development experience of every engineer building the global app so the team can ship a global product more efficiently, increasingly by leveraging AI to accelerate the entire development workflow.
  • Identify and eliminate friction in the engineering workflow—through automation, self-service tooling, AI-assisted workflows, and platform improvements—so engineers spend more time building and less time waiting or debugging.

Cross-functional Collaboration

  • Partner with platform engineering, product engineering, and operations to bridge the gap between infrastructure health and product-level reliability—working across time zones with teams in Japan, APAC, and beyond.
  • Collaborate with operations teams and third-party vendor partners on reliability requirements that extend into logistics and external integrations; share learnings across engineering teams to raise the collective bar.

Unique Challenges

  • Global scale: You are building the reliability foundation for a platform targeting 100+ countries, where a single time zone no longer bounds the problem.
  • Multi-region evolution: Growing the user base means scaling data and infrastructure and moving from today’s single-region service toward a globally distributed, multi-region architecture.
  • SRE meets enabling: You will directly shape how the entire engineering team develops, ships, and operates software.
  • AI-native approach: Working at the frontline of AI-driven development velocity and the operational strengthening it demands—you will help shape the SRE culture that emerges as engineers and autonomous agents share the work.
  • Business context: You will work directly with product, engineering, operations, and business stakeholders across regions, connecting reliability engineering to real commercial outcomes, not just uptime numbers.

Qualifications

Required Experience/Skills

  • Hands-on experience operating production services on cloud infrastructure, particularly GCP and Kubernetes.
  • Experience defining reliability indicators such as SLIs/SLOs and continuously improving service reliability based on them, including on-call rotations and blameless post-mortem processes.
  • Operational experience with databases and a solid understanding of database fundamentals.
  • Hands-on experience applying AI/LLMs to improve engineering productivity, and/or experience building and providing tooling to developers.
  • A proactive mindset toward incident response and troubleshooting, and a willingness to embed directly into product teams when the service requires it—improving reliability at the product level, not only the infrastructure level.
  • Demonstrated experience learning from incidents and working cross-functionally to design and implement preventive measures and automation.
  • Foundational knowledge of web services architecture.

Preferred Experience/Skills

  • Experience setting SLIs/SLOs not only within engineering but by communicating and aligning with business and product leaders.
  • Experience in product development within a team environment.
  • Experience developing AI agents.

Language

  • Japanese: Independent (CEFR – C1)
    OR
  • English: Independent (CEFR – C1)

Learn more: