(Source: ArtemisDiana/stock.adobe.com)
While machine learning (ML) may seem complicated and time-consuming for novel research in the field, often the opposite is true for building a proof of concept (POC). The purpose of a POC is to demonstrate rapidly that an application or idea is feasible with ML, which does not need to be done from scratch. Instead of training the model or producing anything completely new, one can leverage existing third-party or open source resources.
This series on implementing an ML POC has discussed translating business objectives to ML metrics, building a project-specific dataset, and structuring the experimentation environment. Now, we will discuss how to put those stages to use in developing the POC model. Revisiting the initial project roadmap will offer a guide on how to approach development, particularly in terms of when to leverage the many available resources in the ML ecosystem that are constantly being updated. This article describes some of the key tools, software packages, and pre-trained model hubs that help build an ML POC successfully.
To develop a model that performs well enough to be considered a successful POC, one should look at the initial roadmap that was made while translating objectives into metrics. This typically begins with off-the-shelf models or those available via a third-party application programming interface (API), if allowable in the business context. Next, the roadmap moves to architectures or pre-trained models that can be adjusted via fine-tuning or retraining for the particular task, before finishing with those that need to be implemented from scratch with perhaps more complicated training regimes. Fortunately, with the growth of the ML ecosystem, many tools, resources, software libraries, and even open-source research papers with code repositories can be modified as needed for each stage.
Sufficiently demonstrating the use of ML for a particular application doesn’t need to start from scratch or even involve training an ML model if performance can be shown through other means. Therefore, one should certainly use existing tools and pre-trained models if possible. Across all ML domains—such as computer vision (CV), time series prediction, regression, and natural language processing (NLP)—a variety of third-party and open source software packages and models or paid service options are available to bootstrap an ML POC. They vary in their ease of use, expertise needed to operate or integrate them, amount of data required, and their flexibility or customizability for new tasks. Most will be compatible with the experimentation environment created in the previous POC stages.
Roughly speaking, the more pre-packaged a tool or model is, the less flexible it is in terms of retraining for new tasks—but this is less of an issue when the task is not highly bespoke, data sparse, or complex. For example, CV models used for object detection are generally available with cloud providers like Cloud Vision by Google or Rekognition by Amazon Web Services. They are likely to perform well at identifying common objects like produce items on a conveyor belt or in a warehouse but would struggle with highly specific tasks like defect detection without more data and fine-tuning where the services allow.
Typically trained on millions of data points, these foundation models are served with APIs or interactive user interfaces, which means that they are easier to use and still perform well on a variety of tasks. They are accessed most commonly through cloud providers (for the different ML domains like NLP or CV), but other companies have them too. For example, OpenAI has large language models (LLMs) for NLP, such as GPT-4, as well as Whisper for speech recognition. With these tools, the user does not have the burden of developing infrastructure for training and then hosting the model for further use. Additionally, these tools remove the risk of the POC dataset not containing enough data for the task.
In the cases where these types of pre-built, pre-packaged tools do not work well enough to complete the POC, are too expensive, or are prohibited due to privacy or network considerations, other options are available. However, they may require the user to have more expertise, develop infrastructure, or have more data for training. Several large software players in the ML ecosystem have open-source hubs for pre-trained models that can be downloaded, with further code written to repurpose and retrain them for different tasks.
In particular, Hugging Face is one of the most popular open-source platforms, with models across all domains available for free reuse, architecture documentation on how they work, and code snippets showing how to use and modify them. Another popular option is Google’s TensorFlow Hub. Even Meta and Microsoft have thrown their hats into the ring by releasing different open-source models. Depending on the size of the model chosen and the degree of modification needed, complete retraining, training from scratch, or additional infrastructure might be required (as might ML engineering expertise). For the ambitious user, solutions might also be found on arXiv, an open library of frequently updated academic papers that describe novel ML approaches, often with code repositories used to demonstrate the reproducibility of the research.
When launching the ML development portion of POC-building, one should revisit the initial roadmap created during the stages of generating metrics and project specifications. Resources exist for each stage in that roadmap—from easy-to-use but potentially less flexible or performant third-party models to more complex but customizable software packages that can be fit for purpose as needed. Fortunately, with the maturity of the ML ecosystem, several high-quality software tools can be deployed and connected easily to cover all aspects of the ML development life cycle. This is even the case for newer models like LLMs.
So far, this series has covered four important first steps to success with an ML project: establishing goals and translating them to metrics, getting the dataset ready, creating a development environment to ensure reliability, and in this blog, developing the POC model. Keep following to learn the remaining stages of developing an ML POC and putting the results into production. The remaining blogs will cover important guidelines for extending the POC developed in this stage to a production-ready version and advice about what to anticipate and monitor post-deployment, especially regarding updating the model.
Becks Simpson is a Machine Learning Lead at AlleyCorp Nord where developers, product designers and ML specialists work alongside clients to bring their AI product dreams to life. In her spare time, she also works with Whale Seeker, another startup using AI to detect whales so that industry and these gentle giants can coexist profitably. She has worked across the spectrum in deep learning and machine learning from investigating novel deep learning methods and applying research directly for solving real world problems to architecting pipelines and platforms to train and deploy AI models in the wild and advising startups on their AI and data strategies.