Software Engineer (Site Reliability) - Mercari Crossborder
Salary not provided
Software Engineer (Site Reliability) - Mercari Crossborder
Location: Minato City, Tokyo, Japan
Employment Status: Full-time
Work Hours: Full Flextime (no core time)
Office: Roppongi
Mode: Hybrid (Mid-Career, Full time)
Description
Mercari has operated localised marketplaces in Japan and the United States for years. The Cross Border team recently launched a new global app—already live in Hong Kong, Taiwan, and the United States—with the ambition to expand into over 100 countries in the coming years. The engineering goal is to build a global app platform that lets existing and future product services run consistently across every region, without the fragmentation that comes from excessive localisation.
The cross-border nature of the business means reliability is never a purely technical problem. You will work alongside product engineers, operations teams, and third-party logistics and vendor partners to ensure the platform holds up under the real-world pressures of multi-region, multi-currency, and multi-regulatory commerce. Speed and stability are both requirements—the team is expected to make deliberate trade-offs and move forward with a clear long-term foundation in mind.
Work Responsibilities
Platform Reliability
- Own the SRE function for the global app platform as it expands into new regions—defining SLIs/SLOs, leading production readiness reviews, and ensuring reliability is built in from design through to launch.
- Operate the platform reliably across multiple time zones: roll out and run services with each region’s traffic patterns in mind to meet the reliability requirements of a global, always-on service.
- Scale databases and infrastructure as the user base grows, and evolve the architecture from today’s single-region setup toward a multi-region configuration.
- Track and improve reliability metrics (change fail rate, incident frequency, time to detect and resolve) to guide trade-off decisions between delivery speed and platform stability.
Observability & Incident Management
- Build and own the observability stack—metrics, distributed tracing, structured logging—and continuously improve production debugging so on-call engineers can resolve issues independently at any hour.
- Lead incident response: rapid detection, mitigation, resolution, and blameless post-mortems; participate in on-call rotation with a long-term goal of reducing alert volume through AI-assisted automation.
Developer Enabling
- Own the developer-facing side of the platform: CI/CD pipelines, deployment tooling, development environments, and the systems that determine how fast and safely engineers can go from code to production.
- Lift the productivity and development experience of every engineer building the global app so the team can ship a global product more efficiently, increasingly by leveraging AI to accelerate the entire development workflow.
- Identify and eliminate friction in the engineering workflow—through automation, self-service tooling, AI-assisted workflows, and platform improvements—so engineers spend more time building and less time waiting or debugging.
Cross-functional Collaboration
- Partner with platform engineering, product engineering, and operations to bridge the gap between infrastructure health and product-level reliability—working across time zones with teams in Japan, APAC, and beyond.
- Collaborate with operations teams and third-party vendor partners on reliability requirements that extend into logistics and external integrations; share learnings across engineering teams to raise the collective bar.
Unique Challenges
- Global scale: You are building the reliability foundation for a platform targeting 100+ countries, where a single time zone no longer bounds the problem.
- Multi-region evolution: Growing the user base means scaling data and infrastructure and moving from today’s single-region service toward a globally distributed, multi-region architecture.
- SRE meets enabling: You will directly shape how the entire engineering team develops, ships, and operates software.
- AI-native approach: Working at the frontline of AI-driven development velocity and the operational strengthening it demands—you will help shape the SRE culture that emerges as engineers and autonomous agents share the work.
- Business context: You will work directly with product, engineering, operations, and business stakeholders across regions, connecting reliability engineering to real commercial outcomes, not just uptime numbers.
Qualifications
Required Experience/Skills
- Hands-on experience operating production services on cloud infrastructure, particularly GCP and Kubernetes.
- Experience defining reliability indicators such as SLIs/SLOs and continuously improving service reliability based on them, including on-call rotations and blameless post-mortem processes.
- Operational experience with databases and a solid understanding of database fundamentals.
- Hands-on experience applying AI/LLMs to improve engineering productivity, and/or experience building and providing tooling to developers.
- A proactive mindset toward incident response and troubleshooting, and a willingness to embed directly into product teams when the service requires it—improving reliability at the product level, not only the infrastructure level.
- Demonstrated experience learning from incidents and working cross-functionally to design and implement preventive measures and automation.
- Foundational knowledge of web services architecture.
Preferred Experience/Skills
- Experience setting SLIs/SLOs not only within engineering but by communicating and aligning with business and product leaders.
- Experience in product development within a team environment.
- Experience developing AI agents.
Language
- Japanese: Independent (CEFR – C1)
OR - English: Independent (CEFR – C1)
Learn more:
- Mercari Careers site
- Mercan
- Social media: X / LinkedIn