About the Team
Our team is responsible for the collection, storage, and processing of large-scale datasets generated by autonomous vehicles and delivery robots. This includes sensor data from cameras, lidars, radars, and other onboard systems. Scaling reliable storage and providing efficient compute tools is essential for supporting downstream teams—such as machine learning, simulation, and algorithm development. Our data processing stack incorporates specialized algorithms similar to those deployed directly on autonomous systems in the field.
About the Role
As a Software Engineer, Data Platform at Avride, you will be responsible for designing, building, and maintaining the core data and machine learning infrastructure with a strong focus on software design and code quality. You will design systems to ingest, process, and organize petabytes of telemetry and sensor data into a globally distributed data lake, enabling high-throughput, low-latency access to data for both model training and online inference. Your work will help ML engineers and data scientists iterate faster and deliver better-performing systems.
What You'll Do
- Build and maintain robust data pipelines and core datasets to support simulation, analytics, and machine learning workflows, as well as business use cases
- Design and implement scalable database architectures to manage massive and complex datasets, optimizing for performance, cost, and usability
- Collaborate closely with internal teams such as Simulation, Perception, Prediction, and Planning to understand their data requirements and workflows
- Evaluate, integrate, and extend open-source tools (e.g., Apache Spark, Ray, Apache Beam, Argo Workflows) as well as internal systems
What You'll Need
- Strong proficiency in Python (required); experience with C++ is highly desirable
- Proven ability to write high-quality, maintainable code and design scalable, robust systems
- Experience with Kubernetes for deploying and managing distributed systems
- Hands-on experience with large-scale open-source data infrastructure (e.g., Kafka, Flink, Cassandra, Redis)
- Deep understanding of distributed systems and big data platforms, with experience managing petabyte-scale datasets
Nice to Have
- Experience building and operating large-scale ML systems
- Understanding of ML/AI workflows and experience with machine learning pipelines
- Experience optimizing resource usage and performance in distributed environments
- Familiarity with data visualization and dashboarding tools (e.g., Grafana, Apache Superset)
- Experience with cloud-based infrastructure (e.g., AWS, GCP, Microsoft Azure)
Candidates are required to be authorized to work in the U.S. The employer is not offering relocation sponsorship, and remote work options are not available.