Job Description: Business Overview The Rakuten AI & Data Division (AIDD) creates powerful, customer-focused search, recommendation, data science, advertising, marketing, pricing, and inventory optimization solutions for various businesses in commerce, fintech, and mobile industries. Department Overview The Rakuten Institute of Technology Worldwide (RIT), the AI R&D engine of Rakuten Group, Inc. is a global network of research labs spanning Tokyo, Singapore, Boston, San Mateo, Bengaluru, and Paris. We are dedicated to pioneering advancements in core AI technologies, with a focus on machine learning, deep learning, and generative AI. Our researchers are actively exploring the use case of large language models, intelligent agent systems, and other cutting-edge applications, driving innovation across Rakuten's diverse ecosystem. Position: Why We Hire To establish and support domain-leading LLMs across critical sectors such as Fintech, Booking services, and E-commerce, we are building a foundational Senior Data Engineering team. This team will play a critical role in designing, building, and maintaining the robust data infrastructure essential for the entire LLM lifecycle - from data collection and preparation for pre-training and fine-tuning, to serving and monitoring. You will work closely with ML Engineers, Data Scientists, and Researchers to ensure data quality, accessibility, and scalability, directly impacting the success and performance of our in-house LLM initiatives. Position Details Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data. High-Quality Dataset Creation & Curation: Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora. Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis. Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA). Data Job Management: Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle. Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness. Data Infrastructure & Orchestration: Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments. Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation. Mandatory Qualifications: - Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field, with 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets. - Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets. - Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale). - Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow). - Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective. - Experience designing and implementing data annotation workflows and pipelines. - Strong proficiency in Python, and extensive experience with its data ecosystem. - Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design. Other Information: Additional information on English Qualification English: Fluent #engineer #infrastructureengineer #aianddatadiv In Japanese, Rakuten stands for ‘optimism.’ It means we believe in the future. It’s an understanding that, with the right mind-set, we can make the future better by what we do today. So we challenge ourselves to evolve, innovate and experiment, to create a better, brighter future for everyone. Today, our 70+ businesses span e-commerce, digital content, communications and fintech, bringing the joy of discovery to almost 1.3 billion members across the world. If you have any trouble logging in, please contact us here Rakuten Group, Inc.: rakuten-recruiting-info@mail.rakuten.com Please read the Application Requirements(EN) / 募集要項(JP) before applying. Our Diversity & Inclusion Policy and Application Documents Rakuten’s corporate mission is to “contribute to society by creating value through innovation and entrepreneurship.” We foster a culture that provides equal opportunities to those who share this founding philosophy and take on the challenge to transform society, regardless of age, gender, nationality, or any other status. Diversity is one of Rakuten's core strategies and a driving force for innovation. Because of this, you are not required to submit any of the following information in order to apply for our job positions. - Gender - Age - Photo - Nationality - Information not related to business, such as ideological beliefs, family structure, etc. * For legal compliance, we may ask you about your work eligibility. See the details