Improving Classification Performance with Human Feedback PDF

Improving Classification Performance With Human Feedback: Label a few, we label the rest Eden Chung, Liang Zhang, Katherine Jijo, Thomas Clifford, Natan Vidra...

Improving Classification Performance With Human Feedback: Label a few, we label the rest Eden Chung, Liang Zhang, Katherine Jijo, Thomas Clifford, Natan Vidra Abstract In the realm of artificial intelligence, where a vast majority of data is unstructured, obtaining sub- stantial amounts of labeled data to train supervised machine learning models poses a significant challenge. To address this, we delve into few-shot and active learning, where are goal is to improve AI models with human feedback on a few labeled examples. This paper focuses on understanding how a continuous feedback loop can refine models, thereby enhancing their accuracy, recall, and precision through incremental human input. By employing Large Language Models (LLMs) such as GPT-3.5, arXiv:2401.09555v1 [cs.LG] 17 Jan 2024 BERT, and SetFit, we aim to analyze the efficacy of using a limited number of labeled examples to substantially improve model accuracy. We benchmark this approach on the Financial Phrasebank, Banking, Craigslist, Trec, Amazon Reviews datasets to prove that with just a few labeled examples, we are able to surpass the accuracy of zero shot large language models to provide enhanced text classification performance. We demonstrate that rather than needing to manually label millions of rows of data, we just need to label a few and the model can effectively predict the rest. 1 Introduction In the world of AI, a significant challenge is handling massive amounts of data. Only 15 percent of this data is structured, while the rest, a surprising 85 percent, is unstructured. For AI/ML models to work effectively, they usually need large sets of data that are labeled, but getting such data is challenging. Traditional AI methods rely heavily on millions of rows to train models. This means pairing inputs (like pictures or text) with the correct labels. The big question is: How can we gather this massive amount of labeled data efficiently? Right now, businesses have to choose between leaving their data unlabeled (which is risky) or manually (AI Assisted) labeling it. While AI tools like Labelbox, Heartex, Datasaur and Prodigy can help with labeling, they aren’t perfect. Training data often necessitates the expertise of subject-matter professionals, such as doctors, legal analysts, and financial analysts, for label- ing. This manual data labeling is tedious, time-consuming, and costly, but particularly, when business requirements change, requiring manual labeling over and over again is not sustainable. Furthermore, manual labeling does not even ensure the correctness of the data; in fact, it can oftentimes be incorrect, which is one of the limitations of relying solely on manual labeling. Even state-of-the-art data sets, such as MNIST and ImageNet can have incorrectly labeled data 1. On the other hand, AI models such as GPT3 and Claude, although surely very helpful, can still hallucinate and return false data. Therefore, an approach that integrates human expertise with AI can accelerate labeling processes while ensuring accuracy, as well as minimizing errors, especially for diverse data types from different industries. One such approach that combines human power with AI is programmatic labeling 2. Instead of manually labeling each data point one by one, in programmatic labeling, the user inputs labeling functions, cap- turing the reasoning behind the labeling, which can then be generalized to larger amounts of unlabeled data through AI. This has been a great step forward, allowing for scalability and adaptability. However, there are several limitations to programmatic labeling. Since creating labels is generalizing data, the model may struggle with ambiguous cases or nuanced cases, and labeling functions may not be able to capture these subtle patterns that a human might be able to. In addition, as mentioned before, humans can make errors in labeling, and if these errors occur in the labeling functions, these errors will propagate throughout the dataset, instead of being limited to just one incorrect data point. Finally, as manual labeling is, programmatic labeling is still relatively resource-intensive. 1 Curtis Northcutt and Anish Athalye and Jonas Mueller. Label Errors in ML Test Sets. https://labelerrors.com, Accessed on 2023-12-23. 2 Programmatic labeling. https://snorkel.ai/programmatic-labeling/, Accessed on 2023-12-23. 1 Novel NLP research in Large Language Models and few shot learning has changed the way that data labeling is done. GPT-3, introduced in “Language Models Are Few Shot Learners” 3 , 2020, demonstrated the abilities for LLMs to learn with minimal data. LaMDA 4 and PaLM 5 , significantly contributed to the enhancement of language models. In 2021, “Want To Reduce Labeling Cost? GPT-3 Can Help” was published 6. The paper highlighted that employing labels produced by GPT-3 is notably more econom- ical, both in terms of computational resources and potentially time, compared to acquiring labels from human experts. Furthermore, even with these cost benefits, models trained using GPT-3.5-generated labels demonstrated comparable performance to those trained on human-provided labels. From this research, combining AI and human power called “few-shot learning” became the practical. Instead of labeling millions of data points, we can work with just a few thousand. This approach lets AI learn from a small amount of labeled data and then make educated guesses for the rest. Large language models such as GLaM 7 , Flamingo 8 , and the “Alexa Teacher Model” 9 , all demonstrate the expanding capabilities of AI models in learning efficiently from limited data. These models have set new benchmarks in the field, showcasing the power of few-shot learning in diverse applications, from language understand- ing to visual recognition and beyond. In 2022, Hugging Face announced SetFit, an efficient framework for few-shot fine-tuning of Sentence Transformers, which allows high accuracy with little labeled data 10. Few-shot learning is a technique that allows a model to utilize just a minimal number of examples to guide the model. By incorporating feedback from industry experts and integrating few-shot learning techniques, we can significantly improve the overall accuracy of the model with a small amount of labeled data provided by the industry expert. By combining state-of-the-art transformers with few-shot learning techniques, the system learns in real time, assisting data annotators with the labeling process. 2 Background LLM’s have evolved vastly over time, with this mainly being due to the architecture behind the models. Several main architectural methods have been used; we will discuss Recurrent Neural Networks (RNN) and Transformers. 2.1 Recurrent Neural Networks (RNN) Recurrent Neural Networks (RNN), the earlier architecture behind LLMs, was a concept thought up of in 1986, but the architecture for the model was finally built only in 1997. RNN has evolved over time to 3 distinct phases: Overloaded Single Memory (Vanilla RNN), Multiple Gate Memories (LSTM), and Encoder-Decoder architecture 11. 3 Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah, et al. Language Models are Few-Shot Learners, 2020. arXiv:2005.14165. 4 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, et al. LaMDA: Language Models for Dialog Applications, 2022. arXiv:2201.08239. 5 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra et al. PaLM: Scaling Language Modeling with Pathways, 2022. arXiv:2204.02311. 6 Shuohang Wang and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng. Want To Reduce Labeling Cost? GPT-3 Can Help, 2021. arXiv:2108.13487 7 Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, 2022. arXiv:2112.06905. 8 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, et al. Flamingo: a Visual Language Model for Few-Shot Learning, 2022. arXiv:2204.14198. 9 Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide Bernardi, et al. Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 2022. http:// dx.doi.org/10.1145/3534678.3539173, DOI: 10.1145/3534678.3539173. 10 Lewis Tunstall and Nils Reimers and Unso Eun Seo Jo and Luke Bates and Daniel Korat and Moshe Wasserblat and Oren Pereg. Efficient Few-Shot Learning Without Prompts, 2022. arXiv:2209.11055. 11 Chen Yanhui. A Battle Against Amnesia: A Brief History and Introduction of Recurrent Neural Networks. https :// towardsdatascience.com / a -battle -against -amnesia -a -brief -history -and -introduction -of -recurrent -neural-networks-50496aae6740, Accessed on 2023-12-17. 2 2.2 Transformers With technology constantly changing and deep learning models improving over time, RNN began to be replaced with a new architectural model – transformers. First introduced in 2016, they have since revo- lutionized the field of LLMs and NLP. Beginning with BERT in 2018, models have continued to evolve with GPT in 2020 and LaMDA in 2021 12 , proving that LLMs keep getting better and better, and it is clear that these models will only get more impressive with time. Transformers are now the core architecture behind many NLP models such as Chat GPT and Bard. Instead of having to consider words sequentially, as RNN models do, they can instead analyze words simultaneously. As a result, transformers are very good at adapting, so many LLMs exist today with many different use cases. Figure 1: Transformer architecture 13 3 Process and Work In order to attempt to improve the accuracy of ML models, a human feedback approach is introduced, enhancing text classification accuracy. Similar to providing students with clear examples for better un- derstanding, the strategy involves fine-tuning models through human feedback or providing more labeled examples. In order to evaluate whether or not few-shot learning is really effective or not, in this study, a con- tinual test and evaluation loop is used. The process consists of continually testing a datasets accuracy, precision, and recall but varying the number of labeled data points inputted. Starting with 10 data la- beled data points given, after each iteration, human feedback is applied. 10 incorrect answers are chosen and then fed into the model, giving it the correct answer instead, and this cycle repeats. So, on each iteration, the number of labeled data points increases by 10. 3.1 Data Sets In our machine learning research, we utilized six distinct datasets for various analyses. The Amazon reviews dataset consisted of a training set with 6,001 rows and a testing set of 2,001 rows, encompassing labels such as Excellent’, Very Good’, Neutral’, Good’, and Bad’. The banking dataset, on the other hand, had a training data of 200 rows and a testing set of 2,000 rows, with 77 distinct labels like cash received’, fiat currency support’, and pin blocked’. The Craigslist dataset was structured with 201 rows 12 Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin. Attention Is All You Need, 2023. arXiv:1706.03762. 13 Liz McQuillan. Deep Learning 101: What Is a Transformer and Why Should I Care? https://www.saltdatalabs.com/ blog/deep-learning-101/what-is-a-transformer-and-why-should-i-care, Accessed on 2023-12-17 3 in its training set and 1,001 rows in its testing set, covering categories like phone, furniture, housing, electronics, and car, with labels categorized into ABBR, ENTY, DESC, HUM, LOC, and NUM. The financial phrasebank dataset had 4,850 rows of data with categories positive, negative and neutral. Lastly, the Trec dataset offered both a coarse label set of 6 labels and a fine label set of 50 detailed labels such as expression abbreviated, animals, organ of body, color, invention, book, and other creative pieces. This Trec is consistent with 5,452 rows in training and 500 rows in testing. Figure 2: Sample of the Trec Data Set Figure 3: Sample of the Finance Phrasebank Set 4 Approach In text classification, key metrics play a pivotal role. The probability metric indicates the confidence level of a model’s predictions, while the entropy metric measures the significance or unpredictability of a prediction within the dataset. Higher entropy suggests that the predicted label has an impact on the overall classification or decision-making process. To initiate our process, we utilize zero-shot models like Claude, GPT, BERT, and SETFIT to derive initial predictions from the provided data. For clarity on our predictions for a spam versus non-spam dataset, before doing anything we have to examine the zero-shot model predictions. The first step in active learning from human feedback is to generate a prioritized list of edge cases. These edge cases highlight instances where the model’s predictions demonstrate uncertainty or reduced confidence. Once this list is established, human evaluators can review and assign the appropriate labels. In Table 2, we can see that the actual label is not spam. By iterating through this process, the model is once again organized based on entropy and probability values. Upon this second evaluation, you’ll notice a decrease in entropy and an increase in probability scores, highlighting the model’s enhanced confidence and refined predictions. We iterate on this process again. The model once again is sorted by entropy, we provide the row / edge case with the highest potential for impact. Here, we see the sentence Dear customer, Your account balance is low. which has the highest entropy of the non-labeled data points. We label this row as spam, largely due to the fact that the sentence is referring to account balance. Notice how this row of text data, with the corresponding label, is added to 4 Table 1: Zero shot model predictions Text Body Predicted Probability Entropy Important notice: Your package has not spam 0.65 0.88 been delivered. Dear customer, Your account balance is not spam 0.58 0.82 low. Hi, How are you doing? Let’s catch up not spam 0.58 0.75 soon. Urgent notice: Last chance to update spam 0.92 0.51 your personal information. Hi there, You have won a free vacation! spam 0.95 0.42 Claim now! Congratulations! You’ve won a million spam 0.97 0.36 dollars! Table 2: Spam Classification Results Text Actual Label Predicted Probability Entropy Important notice: Your package has been not spam not spam 1.0 0 delivered. Urgent notice: Last chance to update your not spam 0.88 0.28 personal information. Dear customer, Your account balance is spam 0.70 0.76 low. Hi, How are you doing? Let’s catch up not spam 0.65 0.68 soon. Congratulations! You’ve won a million dol- spam 0.92 0.22 lars! Hi there, You have won a free vacation! spam 0.80 0.25 Claim now! Table 3: Iterated Model Predictions - First Iteration Text Actual Label Predicted Probability Entropy Important notice: Your package has been not spam not spam 1.0 0 delivered. Dear customer, Your account balance is spam 0.70 0.76 low. Hi, How are you doing? Let’s catch up not spam 0.65 0.68 soon. Urgent notice: Last chance to update your not spam 0.88 0.28 personal information. Hi there, You have won a free vacation! spam 0.80 0.25 Claim now! Congratulations! You’ve won a million dol- spam 0.92 0.22 lars! 5 the annotation history once we confirm the annotation. After this second label, the models predictions, uncertainty and entropy is adjusted once more. Table 4: Iterated Model Predictions - Second Iteration Text Actual Label Predicted Probability Entropy Important notice: Your package has been not spam not spam 1.0 0 delivered. Dear customer, Your account balance is spam spam 1.0 0 low. Hi, How are you doing? Let’s catch up not spam 0.80 0.62 soon. Urgent notice: Last chance to update your spam 0.86 0.18 personal information. Hi there, You have won a free vacation! spam 0.75 0.20 Claim now! Congratulations! You’ve won a million dol- spam 0.88 0.18 lars! Notice how over time, the entropy is decreasing as the model is becoming more stable as more human labels are added. After a few labels, the entropy is able to stabilize, and no more labels are needed (the labels impact the accuracy less than is worth the effort). 5 Evaluating Method We primarily employed the BERT, SEFIT, and GPT-3.5 Turbo models for our evaluation. Initially, we tasked these models with predicting outcomes on the testing set. This approach helped us identify areas where the models exhibited weakness or lower confidence. For any incorrect predictions made by the models, we adjusted the labels for 10 rows to reflect the correct classifications. We then fine-tuned the models based on this corrected data, simulating human feedback, and measured accuracy, precision, and recall on the testing set after feedback. Subsequently, we incorporated these 10 rows into the training set and excluded them from the testing set. We repeated this process iteratively while measuring their accuracy, precision, and recall. We continued this iterative approach up to 150 labels. 6 Results After fine-tuning the models with human feedback and additional labeled examples, we observed a consis- tent improvement in accuracy across different datasets. Notably, our experiments revealed that incorpo- rating targeted training in areas where the model is weak played a pivotal role. This iterative approach allowed the model to gradually enhance its proficiency in handling specific domains. To see the raw results and code in more detail, please access it here. 6 Figure 4: Amazon Dataset Figure 5: Amazon Dataset Plot Figure 6: Banking Dataset 7 Figure 7: Banking Dataset Plot Figure 8: Craigslist Dataset Figure 9: Craigslist Dataset Plot 8 Figure 10: Financial Phrasebank Dataset Figure 11: Financial Phrasebank Dataset Plot Figure 12: Trec Coarse Label Dataset 9 Figure 13: Trec Coarse Dataset Plot Figure 14: Trec Fine Label Dataset Figure 15: Trec Fine Label Dataset Plot 10 7 Conclusion Our investigation into few-shot learning and active learning methodologies is promising in enhancing language models with minimal labeled data. We were able to produce notable improvements in model accuracy, recall, and precision across diverse domain-specific datasets by utilizing a continuous feedback loop and integrating human expertise. The ability to leverage a minimal number of labels to refine a model will be extremely beneficial to businesses. Companies will be able to maintain model performance levels while minimizing the resources typically required for manual labeling processes. While our research demonstrated the potential of few-shot learning with human feedback, there are some limitations. Our study is focused on a limited number of models such as GPT, BERT, and SetFit. To gain a more comprehensive understanding of the applicability of few-shot learning across various model structures, we can expand the work to train a diverse array of models. We could test newer model architectures like T5, Transformer-XL, or different BERT and GPT 4 variations. Experimenting with additional models can provide deeper insights into the effectiveness of these approaches. 8 Next Steps To address these limitations, follow-up research could include training the models on datasets with com- plex taxonomies and intricate label hierarchies. By employing datasets from domains such as medical, legal, or financial fields with complex taxonomies, we will challenge the model’s ability to generalize. By looking at classification problems on millions of rows of data with hundreds to thousands of categories and subcategories, the accuracy of these models will be of significant importance for ROI in specific do- mains. This will establish the foundation for using complex, industry-specific data to train more accurate classification models, bridging the gap between model performance and real-world application. 9 References 1. What are Large Language Models (LLM)? https://aws.amazon.com/what-is/large-language-model/, Accessed on 2023-12-17. 2. Chen Yanhui. A Battle Against Amnesia: A Brief History and Introduction of Recurrent Neural Networks. https://towardsdatascience.com/a-battle-against-amnesia-a-brief-history-and-introduction-of -recurrent-neural-networks-50496aae6740, Accessed on 2023-12-17. 3. Kiel Dang. Language Model History — Before and After Transformer: The AI Revolution. https:// medium.com/@kirudang/language-model-history-before-and-after-transformer-the-ai-revolution -bedc7948a130, Accessed on 2023-12-17. 4. Sanchit Goel. Evolution of Transformers — Part 1. https://sanchman21.medium.com/evolution-of -transformers-part-1-faac3f19d780, Accessed on 2023-12-17. 5. Liz McQuillan. Deep Learning 101: What Is a Transformer and Why Should I Care? https://www.saltdatalabs.com/blog/deep-learning-101/what-is-a-transformer-and-why-should-i-care, Ac- cessed on 2023-12-17. 6. Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin. Attention Is All You Need, 2023. arXiv:1706.03762. 7. Shuohang Wang and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng. Want To Reduce Labeling Cost? GPT-3 Can Help, 2021. arXiv:2108.13487. 8. Lewis Tunstall and Nils Reimers and Unso Eun Seo Jo and Luke Bates and Daniel Korat and Moshe Wasserblat and Oren Pereg. Efficient Few-Shot Learning Without Prompts, 2022. arXiv:2209.11055. 9. Curtis Northcutt and Anish Athalye and Jonas Mueller. Label Errors in ML Test Sets. https:// labelerrors.com, Accessed on 2023-12-23. 10. Programmatic labeling. https://snorkel.ai/programmatic-labeling/, Accessed on 2023-12-23. 11. Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah, et al. Language Models are Few-Shot Learners, 2020. arXiv:2005.14165. 12. Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, et al. LaMDA: Language Models for Dialog Applications, 2022. arXiv:2201.08239. 13. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra et al. PaLM: Scaling Language Modeling with Pathways, 2022. arXiv:2204.02311. 11 14. Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, 2022. arXiv:2112.06905. 15. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, et al. Flamingo: a Visual Language Model for Few-Shot Learning, 2022. arXiv:2204.14198. 16. Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide Bernardi, et al. Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 2022. http://dx.doi.org/10.1145/3534678.3539173, DOI: 10.1145/3534678.3539173. 10 Appendix A: Dataset Descriptions 1. Amazon Dataset Description: Amazon product reviews Source: Aymeric Roucher on HuggingFace Size: 6000 rows URL: https://huggingface.co/datasets/A-Roucher/amazon product reviews datafiniti 2. Finance Phrasebank Dataset Description: Finance phrases from news along with a rating out of 5 of whether they are negative, neutral, or positive Source: HuggingFace Size: 4850 rows URL: https://huggingface.co/datasets/financial phrasebank 3. TREC Dataset Description: Text Retrieval Conference (TREC) with coarse labels and fine labels Source: HuggingFace Size: 5450 rows URL: https://huggingface.co/datasets/trec 12

Improving Classification Performance with Human Feedback PDF

Document Details

Tags

Related

Summary

Full Transcript