Leveraging LLMs for Email Processing in Customer Centres PDF
Document Details
Uploaded by Deleted User
2024
Joanna Lenczuk
Tags
Related
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- Chapter 3 Introduction to AI, Machine Learning, Deep Learning, and Large Language Models (LLMs).pdf
- LLMs Explained PDF
- EU AI Act & AI Regulation - Augmented LLM
- Large Language Models (LLMs) PDF
- Long Context vs. RAG for LLMs Evaluation PDF
Summary
This presentation details the use of large language models (LLMs) to improve customer email processing in customer centers, specifically focusing on benefits such as increased efficiency and reduced costs. The presentation details a case study with UK Power Networks. The presenter highlights the use of LLMs in classifying, summarizing, and responding to emails, aiming to streamline the process and improve customer satisfaction.
Full Transcript
LEVERAGING LLMS FOR EMAIL PROCESSING IN CUSTOMER CENTRES Joanna Lenczuk | CKDelta 06/12/2024 ©2024 Databricks Inc. — All rights reserved 1 AGENDA Leveraging LLMs for Email Processing in Customer Centres About CKDelta Overview of the Challenge Workflow Walkthroug...
LEVERAGING LLMS FOR EMAIL PROCESSING IN CUSTOMER CENTRES Joanna Lenczuk | CKDelta 06/12/2024 ©2024 Databricks Inc. — All rights reserved 1 AGENDA Leveraging LLMs for Email Processing in Customer Centres About CKDelta Overview of the Challenge Workflow Walkthrough Email Classification Email Sentiment Email Summarisation Email Responses Model Performance & Cost Comparison Challenges & Conclusions ©2024 Databricks Inc. — All rights reserved 2 ABOUT CKDELTA ©2024 Databricks Inc. — All rights reserved 3 ABOUT US CKDelta builds intelligent applications designed to provide enhanced insight into business performance Industries Our goals Utilities Reduce costs by creating efficiency Logistics Increase revenue by driving innovation Enhance safety Transport Improve sustainability Retail Financial services ©2024 Databricks Inc. — All rights reserved 4 ABOUT ME My team and I delivered the first implementation of the Virtual Customer Agent Data scientist at CKDelta Predictive modelling for utilities and logistics MLOps implementations LLMs for customer communication support Joanna Lenczuk ©2024 Databricks Inc. — All rights reserved 5 OVERVIEW OF THE CHALLENGE ©2024 Databricks Inc. — All rights reserved 6 ABOUT THE CUSTOMER First implementation was adopted for UK Power Networks but it’s a common use-case and can be adopted for different industries and modes of contact UK Power Networks Customer Centre The largest electricity distributor in the UK Email is the main channel of communication Maintains electricity cables and lines in 20 agents working full time on handling London, the South-East and East of requests and inquires England B2B inbox where engineers can raise Supplies energy to 19 million people technical questions and requests ©2024 Databricks Inc. — All rights reserved 7 PROBLEM TO SOLVE Customer emails review process is long and prone to errors Slow response time Manual process of handling and Increasing number of emails distributing emails Long and complex inquires Low customer satisfaction Risk of regulatory fines ©2024 Databricks Inc. — All rights reserved 8 SOLUTION OVERVIEW Improving and accelerating the process of reviewing customer emails Flow 1 Categorise 2 Sentiment 3 Summarise 4 Respond Job Question Lower-Priority ©2024 Databricks Inc. — All rights reserved 9 BENEFITS Tangible benefits for both UK Power Networks and their customers 2.5h +19% 5s saved per day improvement in identifying time to process an email the most urgent emails The team leader spent ~3h every day The Customer Centre team used to Providing email categories, manually categorising emails. manually identify 79% of the most sentiment, summaries and draft Automating email classification saves urgent emails. Our model identifies responses takes 5-10 secs, which 30% of their time. 98% of them, significantly reducing translates to higher customer the risk of regulatory fines. satisfaction. The manual process An agent spent ~3h every day used to take 1.5 days. reviewing emails. Reading summaries takes on avg. 17% of that time, freeing 2.5h each day. ©2024 Databricks Inc. — All rights reserved 10 WORKFLOW WALKTHROUGH ©2024 Databricks Inc. — All rights reserved 11 WORKFLOW OVERVIEW Building an end-to-end solution in partnership with Databricks and Microsoft Clean email data 4 1. Assign email category 5 2. Identify email sentiment 3. Write email summary Send model outputs 4. Prepare email response Tag email with a category & sentiment, add a summary and draft a response 1 3 Load email data Trigger workflow 2 when a new email arrives Store emails in Azure Blob Storage ©2024 Databricks Inc. — All rights reserved 12 TECHNICAL SOLUTION All email-processing steps were implemented using Databricks OpenAI Email classification Email summary Email response Clean Job Job email body Question Request Question Template filled with Customer details Lower-Priority (LP) details provided in the Internal numbers Unknown (if class can’t be email and departments determined) Disclaimers Warnings LP Signatures No action Separators Email sentiment … Unknown Manual review Urgent Low-Pressure ©2024 Databricks Inc. — All rights reserved 13 EMAIL CLASSIFICATION ©2024 Databricks Inc. — All rights reserved 14 EMAIL CATEGORIES Out of three email categories “Job” is the most important one JOB QUESTION LOWER -PRIORITY (LP) Customer requests for actions Requests for updates on Automated responses, existing jobs or questions gratitude expressions, meeting Typical requests: new regarding them notifications connections, service alterations, upgrades, Inquiries potentially leading to Spam emails downgrades, substation work creating new jobs Indication of the category: Indication of the category: Indication of the category: subject line includes `meeting Emails with attachments, Communication for other forward notification:`, including filled forms teams’ attention `automatic reply:`, `accepted:` ©2024 Databricks Inc. — All rights reserved 15 EVALUATION DATASET Evaluating on 600 emails with an equal representation of each category 600 randomly selected emails – 200 from each category Email category Number of emails Complexity in retrieving labels Job 200 Each email with a clean email body and a Question 200 subject line free from post-processing LP 200 alterations Table 1. Email categories representation ©2024 Databricks Inc. — All rights reserved 16 CLASSIFICATION RESULTS LLMs improved identifying job-related emails by 19% p. Predicted category Model based on OpenAI GPT 3.5 Job Question LP Unknown Turbo 195 3 0 1 True category Job Question 75 125 1 0 Attained overall accuracy of 75% LP 32 29 131 8 Identified 98% of Jobs Unknown 0 0 0 0 Table 1. Confusion matrix ~1.5% of emails fell under the ‘Unknown' category and required Precision Recall F1-score Support manual review Job 0.65 0.98 0.78 199 Question 0.80 0.62 0.70 201 LP 0.99 0.66 0.79 200 Human Benchmark Unknown 0.00 0.00 0.00 0.00 overall accuracy: 68% Accuracy 0.75 0.75 0.75 0.75 jobs identified: 79% Weighted avg. 0.81 0.75 0.76 600 Table 2. Classification error metrics ©2024 Databricks Inc. — All rights reserved 17 EMAIL SENTIMENT ©2024 Databricks Inc. — All rights reserved 18 EMAIL SENTIMENT CLASSIFICATION The priority is to identify and address urgent emails URGENT LOW-PRESSURE negative sentiment neutral sentiment Customer clearly states the request is Requests with formal tones urgent BAU approach Customer seems to be impatient An email chain can be long, but there’s an Emails have been back and forth without agent assigned to resolve the query a clear resolution ©2024 Databricks Inc. — All rights reserved 19 EMAIL SENTIMENT CLASSIFICATION RESULTS Using LLMs enabled identifying 80% of urgent emails Labels assigned manually after consulting the SMEs Metrics Precision Recall F1-score Model based on OpenAI GPT Sentiment Urgent 1.0 0.8 0.89 class 3.5 Turbo Low-Pressure 0.9 1.0 0.95 Attained an overall accuracy of Accuracy 0.93 0.93 0.93 93% Table 1. Error metrics for sentiment classification Identified 80% of all urgent emails ©2024 Databricks Inc. — All rights reserved 20 EMAIL SUMMARISATION ©2024 Databricks Inc. — All rights reserved 21 SUMMARISATION RESULTS Summaries reduced the time needed to review emails by up to 90% p. Model based on OpenAI GPT 3.5 Turbo Metric name Score Semantic Textual Overlap 0.83 Attained a semantic textual overlap Precision (Information Retrieval Metric) 0.68 (similarity of the meaning, regardless of the Key-phrase Overlap 0.56 phrasing) of 83% Table 1. Error metrics for summarisation The reading time of long email chains reduced by 82-90% p. Word count Reading time Word count of Reading time of SMEs confirmed the reliability and summary summary completeness of summaries after Job 300 90s 52 15.6s manual review Question 587 176.1s 54 16.2s Table 2. Reading-time metrics for summarisation ©2024 Databricks Inc. — All rights reserved 22 EMAIL RESPONSES ©2024 Databricks Inc. — All rights reserved 23 TEMPLATE-BASED EMAIL RESPONSES Determining if customer's query matches a scenario covered by a template Scenario Matching Response Generation 1. Inquiring about the fuse sizes of an existing Using a provided template to generate a connection response 2. Inquiring about the available capacity of a Modifying the template based on details in connection customer's email Yes 3. Inquiring about the available capacity of a Addressing customer's inquiry and guiding network them through the steps. ©2024 Databricks Inc. — All rights reserved 24 EMAIL RESPONSES RESULTS The initial results highlight the complexity of each template scenario Learnings 100% of emails matching the template More labelled examples are needed scenario were identified and responded to. The subjectivity of generated responses is a significant challenge Focusing on a narrow use-case is the 25% of emails not related to any first step to generating reliable template received an unnecessary responses response. ©2024 Databricks Inc. — All rights reserved 25 MODEL PERFORMANCE & COST COMPARISON ©2024 Databricks Inc. — All rights reserved 26 DISTILBERT RESULTS DistilBert shows promising results and can reduce costs in the future Model evaluated on the same 600 DistilBert OpenAI emails used for the original task Finetuning DistilBert on a batch of 0.80 0.98 Job (Recall) different 600, balanced-class emails Question 0.71 0.70 (F1) The model is much smaller and LP 0.9 0.99 can be run on a small GPU cluster (Precision) or with CPU Overall 0.79 0.75 (Accuracy) Promising results for optimizing Table 1. Error metrics for DistilBert vs. OpenAI the costs in the future ©2024 Databricks Inc. — All rights reserved 27 META LLAMA V2 RESULTS Meta LlaMa V2 performs significantly worse than OpenAI GPT 3.5 Turbo Model evaluated on the same 13B Meta 70B Meta OpenAI 600 emails used for the original LlaMa V2 LlaMa V2 task Job 0.98 0.81 0.98 (Recall) Two versions of Llama V2 model Question 0.19 0.46 0.70 tested: 13B and 70B (F1) Both versions have significantly LP (Precision) 0.95 0.99 0.99 worse results than Open AI GPT 3.5 Turbo, especially regarding Overall 0.55 0.67 0.75 (Accuracy) queries Table 1. Error metrics for Meta Llama V2 vs. OpenAI ©2024 Databricks Inc. — All rights reserved 28 ASSUMPTIONS FOR COSTS COMPARISON The costs comparison assumes 3 million tokens are used every hour Costs based on the average usage seen on sample runs: Assumption Value 3 million tokens per hour Hours online 10 Days working 6 Number of weeks 4 Table 1. Assumptions for costs comparison ©2024 Databricks Inc. — All rights reserved 29 COSTS COMPARISON There is potential for reducing operating costs of using LLMs in the long run small open- source models Relative cost serverless endpoints using LLMs for the most complex Number of tokens in millions tasks ©2024 Databricks Inc. — All rights reserved 30 CONCLUSIONS & CHALLENGES ©2024 Databricks Inc. — All rights reserved 31 CHALLENGES The biggest challenge is reducing costs while maintaining high performance 0 Cleaning 1 Classification 2 Sentiment 3 Summaries 4 Responses High costs Misclassification of Sentiment difficult LLM evaluation and Risks associated associated with job-emails to distinguish internal knowledge with direct exposure using OpenAI API to customers Hybrid approach LLM-as-Judge Continuous LLM-as-Judge Human in the loop (rule-based & ML Misclassification improvement Fine-tuning LLMs Focusing on & open-source LLMs) intervention based on feedback limited scenarios ©2024 Databricks Inc. — All rights reserved 32 CONCLUSIONS The solution can be easily implemented for different industries and channels of customer communication Using LLMs improves the identification of the most urgent emails by 19% p. It saves 2.5 hours of the team’s time every day, allowing them to focus on the most complex queries and personalised support. Automated email processing allows to reply to the customer in 5 seconds instead of 1.5 days. Using OpenAI and tuning prompts enables fast iterations, crucial at the early stage of development. There’s potential for reducing operating costs by using open-source models on small machines and limiting use-cases handled by LLMs. ©2024 Databricks Inc. — All rights reserved 33 ©2024 Databricks Inc. — All rights reserved 34