GPT Problems Are Back in the News with Evidence of Model Drift
|
NEWS
|
ChatGPT has sparked the imagination of consumers globally. But the enterprise market remains stuck in the mud. There are many barriers to deployment for enterprises, but among these, model drift is increasingly becoming a problem. Over the past month, several reports have emerged showing that public-facing, closed-source chatbots like ChatGPT are getting steadily worse at many tasks, including math and writing code. Researchers compared the accuracy and performance for GPT models over time and have shown that accuracy (for certain use cases like identifying prime numbers) fell from over 90% to c.2%. In addition, they found that ChatGPT became less transparent without showing its step-by-step process. Model drift or degradation of a model’s predictive powers over time is caused by several factors:
- Changes in Data Distribution: Training data distribution can change over time, which will alter output from the model. This can be caused by user behavior shifts, changes in the underlying population, or changes in data collection methods.
- New Features: Adding new capabilities like summarization or chatbots means that old features become less informative, and outputs will be less accurate.
- Negative or Incorrect Reinforcement: Behavioral fine-tuning is one of the most popular methods to “improve” a model’s performance, through which users either give a thumbs up or a thumbs down to model output based on the perceived accuracy of the response. If this response is incorrect, i.e., negative response when the model generates a correct answer, the model is fine-tuning on erroneous feedback and output may degrade over time.
- Conceptual Changes: Over time, prompts change, in terms of difficulty or context, which may negatively affect the model’s cognitive response.
- Forgetting: When models are fine-tuned on new datasets, they may forget or lose the ability to generate content that was “learned” in previous training.
Model drift is mostly a result of incorrect data inputs in training or fine-tuning, and is not limited to “closed” models like ChatGPT. Other models, better suited to enterprise deployment, like LLaMA or Cohere will also suffer. Model drift in the consumer domain is not ideal, but mostly harmless. However, as we move toward the enterprise market, it could be damaging and erode enterprise confidence. Model drift creates significant trustworthiness concerns. How can enterprises deploy a tool that could generate false output or mark the wrong inputs as fraudulent? In addition, rectifying this problem brings operational costs, which further erodes the business case for deployment. Obviously, any new technology has a risk-reward trade-off, but model drift adds another dimension to the already complex case for generative AI deployment.
What Other Challenges Are Enterprises Facing?
|
IMPACT
|
Model drift is just one of the many challenges facing enterprise generative AI deployment. Other issues are explored below.
- Hallucinations: Generative AI models generate content based on probabilistic weights and biases developed through training with limited understanding of context or the question itself. This means that the model will produce false or incorrect answers or responses. False responses could negatively impact enterprise workflows or hinder customer experience/relations.
- Bias: Poor or incorrect datasets used for training can result in answers/generated text with inherent biases. There are several ways that biases can be created: social/cultural bias within training datasets or inaccurate feedback/negative reinforcement.
- Transparency: Many models simply take in prompts and generate output without explaining the reasoning behind the generated content. This means that enterprises do not know why the model generates certain content or how they can make changes. This eliminates confidence and hinders a user’s ability to use AI-generated content effectively.
- Performance: Speed, energy efficiency, and response accuracy all impact generative AI model performance for enterprises. Slow inferencing can significantly diminish the applicability of generative AI.
Model drift, bias, hallucinations, transparency, and performance all impact generative AI’s “trustworthiness,” which plays a significant role in the enterprise risk-reward debate. Enterprise use cases require consistency, fairness, and transparency, directly affecting the operational impact that generative AI can have on business outcomes and workflows. Therefore, stakeholders looking to drive enterprise deployment must look to implement processes and tools that target trustworthiness throughout the entire Machine Learning (ML) lifecycle.
What Do Stakeholders Need to Do to Accelerate Enterprise Deployment?
|
RECOMMENDATIONS
|
Enterprise interest in generative AI deployment is high, as there is a clear opportunity to create significant value. However, model drift headlines are going to make Chief Technology Officers (CTOs)/Chief Information Officers (CIOs), who are already worried about model trustworthiness, slam on the brakes. Encouraging enterprise deployment requires Machine Learning Operations (MLOps platforms and support that specifically targets trustworthiness and stability concerns:
- Implementation of Integrated Monitoring and Observability Platforms: Combining MLOps tools that can effectively support the end-to-end deployment and management of data sources throughout the training and fine-tuning processes. Data curation, management, and monitoring tools should be implemented across platforms with a focus on continuous improvement across the entire ML lifecycle.
- Clear Model Performance Metrics: Providing MLOps management tools that show prompt and output accuracy with automated alerts with Continuous Input/Continuous Delivery (CI/CD) integration will ensure that enterprises are aware of model drift or hallucinations and can quickly mitigate the impact of these problems.
- Focus on Model Transparency: Vendors should look to provide platforms/tools to help enterprises understand why certain prompts lead to certain outputs. This will help ensure that end users can most effectively use the tools and how they can be iteratively improved to drive specific outcomes.
- Implement Behavioral Fine-Tuning to Align Output with User Expectations: Although this comes with additional risks, if done correctly, it can quickly align model output with expectations and drive accuracy.
- Fine-Tune Large Language Models (LLMs) for Specific Uses, Not Just Enterprises: Inaccuracy and poor performance can often be caused by using models for multiple use cases. Focusing smaller models (sub-15 billion parameters) on individual use cases can ensure that models do not suffer from conceptual changes or negative reinforcement-led drift.
- Vertical or Use Case-Specific Data Lakes: For many use cases and verticals, availability of good-quality training/fine-tuning is a major challenge. For example, manufacturing use cases like damage detection lack good data points upon which a model can be effectively trained. Developing lakes with these data points would be a valuable service that can improve access to good data and mitigate risks of model drift/hallucinations.
- Provide Managed Services to Eliminate Model Control Risk: ML skills are limited for most enterprise verticals, with the majority cornered by the supply side of the market. Implementing managed services that provide oversight, support, and iterative development of generative AI models could bridge this skills gap, while mitigating trustworthiness risks inherent within these tools. This will not only encourage enterprise adoption, but will offer a new revenue opportunity for MLOps vendors.
Generative AI deployment will always mean enterprises increase their risk exposure—and model drift headlines will not ease concerns. However, vendors have plenty of opportunities to reduce these fears, while potentially even creating new revenue opportunities. Conceptually, MLOps vendors need to move from isolated tools toward end-to-end lifecycle ML platforms, as this will help with data quality and observability, which will quickly mitigate trustworthiness and accuracy fears.