Facilitating Scalable ML and AI: Pachyderm’s Open-Source Platform Delivers Version-Controlled Data Science

Facilitating Scalable ML and AI: Pachyderm’s Open-Source Platform Delivers Version-Controlled Data Science

TL; DR: Pachyderm, designed to solve practical data science problems regardless of size or complexity, provides a solid foundation for machine learning (ML) and artificial intelligence (AI) projects. The platform brings together data lineage and end-to-end Kubernetes pipelines, making enterprise-grade ML scalable and facilitating collaboration. With a focus on partnerships and integrations, Pachyderm is aiming to play a significant role in the ML/AI framework of the future.

Git — a version-control system for tracking code changes and coordinating work among multiple developers — has been beloved for 15 years for its ability to streamline code management.

Today, as artificial intelligence (AI) continues to power businesses worldwide, the team at Pachyderm has created a platform it describes as Git for data scientists. The technology offers complete version control for data while providing data science teams with the same first-class tools that software developers know and love.

“Advances in AI algorithms get the most press, but everything that goes into making the algorithm — the plumbing of data science — constitutes 90% of AI,” said Dan Jeffries, Chief Technology Evangelist at Pachyderm. “We’re dealing with that 90%, automating away the infrastructure hurdles and all of the little intermediate steps.”

Headshot of Dan Jeffries, Chief Technology Evangelist and Pachyderm logo

Dan Jeffries, Chief Technology Evangelist, gave us the scoop on Pachyderm’s data science platform.

Pachyderm makes enterprise-grade software scalable by merging data lineage with end-to-end Kubernetes pipelines. The platform is ideal for building machine learning (ML) pipelines and Extract, Transform, and Load (ETL) workflows. And, since everything in Pachyderm is containerized, data scientists are free to choose the languages or libraries they want to use without infrastructure concerns.

The platform comes in three distinct formats. The Pachyderm Community edition, an open-source version backed by a community of experts, allows users to quickly build, train, and deploy data science workloads for free. Pachyderm Hub, currently in Beta, eliminates infrastructure hassles with a hosted and managed platform. Pachyderm Enterprise provides extra security and can be deployed on an enterprise’s own infrastructure.

Moving forward, Pachyderm’s focus on partnerships and integrations will help the platform play a significant role in the ML/AI framework of the future.

Bringing Cutting-Edge Development Tools to Data Scientists

Joe Doliner and Joey Zwicker founded the San Francisco-based company in 2014.

“When the founders were working at different startups across the world, including Airbnb dealing with anti-money laundering data science projects, they noticed that there was a dearth of AI tools,” Dan said. “AI was also many years behind software programming in terms of scale.”

The goal was to provide data scientists with a collaborative data versioning platform that would empower them to leverage containers, such as those made with Docker, when running data analytics and processing jobs.

“On the data versioning side, say you’re training a model to do video processing on a directory with a million video files, and you run a bunch of training jobs on that,” Dan said. “Then an administrator comes in later and crunches them down to a smaller size — all of those training runs are ruined because you can’t go back and reproduce them and you’ve essentially lost the previous state of the data. Since AI models can be extremely sensitive to even minor data changes, being able to version control you data and track the lineage of every step in your workflow makes a huge difference.”

Graphic depicting elements of the platform

Pachyderm combines data versioning with end-to-end pipelines on Kubernetes.

Dan told us that data scientists aren’t your typical enterprise developers who build tools with white-glove features and perfect documentation.

“They’re often researchers who build an algorithm and put out some Python code that they may or may not maintain,” he said. “Often, they have to stick together five or six changing libraries or cutting-edge pieces like beads on a string. That’s incredibly challenging because the frameworks just aren’t designed to have that level of plug-and-play functionality.”

Pachyderm makes it easy to build end-to-end data science workflows while automating manual processes regardless of what data, language, or framework (such as Spark, R, Python, or OpenCV) is used.

Use Cases for BioTech, Banking, the Automotive Industry and Beyond

Pachyderm is designed for compatibility with everything from data-heavy biotechnology processes and highly regulated banking workflows to the data-fueled automotive industry.

In the biotech industry, for example, firms often scale faster than the capabilities of their data management systems, leaving data scientists struggling to keep up with demand. Pachyderm helps streamline biotech development processes with automated ML and AI data lineage pipelines that eliminate frustration and help them deal with compliance hurdles.

The automotive industry also depends on data and ML to boost efficiency, lower costs, and spur innovation in areas such as autonomous vehicles and driver assistance systems. These technologies require the explainable, repeatable, and scalable data science pipelines that Pachyderm delivers.

Pachyderm illustration showing Kubernetes association

Pachyderm provides data scientists with a foundation for ML and AI projects.

On the banking side, many financial institutions now depend on automated trading, ML, and AI to push their businesses forward. At the same time, they need to provide regulators with critical information using clear and easily accessible data lineage.

Of course, the wide range of Pachyderm use cases doesn’t stop there. LogMeIn, for example, uses Pachyderm when working with natural language processing (NLP) audio. Previously, they tried to run their audio processing with the biggest containers they could get on AWS, but it was taking seven weeks to process a single iteration of training data.

“With Pachyderm, they wrote a bunch of little scripts to clean up the data and pulled in a small NLP library that would never have been supported by any centralized and opinionated cloud provider,” Dan said. “Once they ran their data through Pachyderm, they reduced their processing from seven weeks to seven hours. Pachyderm would split it up, schedule it out in a bunch of containers, and do all the preprocessing for them.”

Allowing Teams to Develop and Operate with Agility

LogMeIn’s experience is the perfect example of the impact of Pachyderm’s speed and scale.

“Our models are more accurate, and they are getting to production and the customer’s hands much faster,” said Eyal Heldenberg, Voice AI Product Manager at the LogMeIn AI Center of Excellence, in a case study. “If we can go from weeks to hours processing data, that greatly affects everyone. This way, we can focus on the fun stuff: research, manipulating the models, and making better models.”

In addition to speed, Dan said Pachyderm’s primary value proposition is centered on developing with agility through collaboration and reproducibility.

“AI lives and breathes on data, and you’ve got to be able to keep track of every single change to it in order to legitimately be able to reproduce your results again and again,” he said. “You should be able to say, ‘OK, cool. We went down this road of 50 different algorithms, and we were correct with algorithm version two. We want to go back there now and iterate on that.’”

Beyond data versioning and lineage, Dan said it’s crucial that data scientists be able to pull together any framework in an agnostic way.

“If you can package up your stuff into a Docker container, it is super easy to go ahead and call that Docker container for the next stage in your pipeline,” he said. “A lot of these systems require that you know Python or have a Java plugin. For us, it doesn’t matter in the least.”

Long-Term Partnership Programs and Integration Options

Dan told us he’s excited to currently be working on partnerships and integrations with tech companies like Seldon and Kubeflow.

“I think there’s going to be a canonical stack in the next three to five years where you have three or four tools that make up a complete and robust end-to-end AI/ML framework for development and pipelines,” he said. “We want to be a part of that.”

In the meantime, Pachyderm is open to working with long-term partners and making integrations a breeze.

“I’m starting to see the integrated stack potentially come together,” Dan said.

Christine Preusler

Questions or Comments? Ask Christine!

Ask a question and Christine will respond to you. We strive to provide the best advice on the net and we are here to help you in any way we can.