Recommender Systems PDF

introduction to recommender systems Inês Gomes (adapted slides from Carlos Soares from materials from the book “Recommender Systems – An Introduction” by Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich Cambridge University Press) data you worked with so far: back to tables… strongly structured strongly independent reasonably well https://www.smartsheet.com/database-relationships distributed RS 2 data in the world: … can be sparse ACD https://github.com/CSKrishna/Recommender-Systems-for- very sparse Implicit-Feedback-datasets RS 3 RS 4 https://link.springer.com/book/10.1007/978-3-319-29659-3 5 plan Aggarwal introduction – ch. 1 collaborative filtering – ch. 2 (.1-.3) evaluation – ch. 7 – ch. 2 (.1-.3) more algorithms – ch. 3 (.6.1-.6.3) – more about CF approaches.1-.5: review of standard – content-based algorithms – ch. 4 (.1-.2+.5-.6) – knowledge-based – ch. 5 (.1) – hybrid approaches – ch. 6 challenges – ch. 8 (.1) and 12 (.1) RS 6 gps introduction – purpose and success criteria – definition Feedback Interaction matrix – paradigms of recommender systems collaborative filtering evaluation RS 7 purpose and success criteria Purpose Success Criteria User System Company Recommendation Items “unknown” to Serendipity Optimize sales user margins and profit Retrieval Users know in Provide “correct”/ Reduce search costs advance what they ”relevant” proposals want Prediction Estimate degree of Optimize sales interest of users in margins and profit item Interaction Give users a “good Convince/persuade Increase "hit", feeling” users - explain "clickthrough", "lookers to bookers” rates RS 8 serendipity and the long tail [or why the best place to hide a dead body is the 2 nd page of results of google search] The fact of finding interesting or valuable things by chance [Cambridge dictionary] ▪ 20% of items accumulate 74% of all positive ratings – less likely to be interesting recommendations ▪ recommend widely unknown items that users might actually like! – much harder RS 9 purpose and success criteria Purpose Success Criteria User System Company Recommendation Items “unknown” to Serendipity Optimize sales user margins and profit Retrieval Users know in Provide “correct”/ Reduce search costs advance what they ”relevant” proposals want Prediction Estimate degree of Optimize sales interest of users in margins and profit item Interaction Give users a "good Convince/persuade Increase "hit", feeling" users - explain "clickthrough", "lookers to bookers” rates RS 10 purpose and success criteria Purpose Success Criteria User System Company Recommendation Items “unknown” to Serendipity Optimize sales user margins and profit Retrieval Users know in Provide “correct”/ Reduce search costs advance what they ”relevant” proposals want Prediction Estimate degree of Optimize sales interest of users in margins and profit item Interaction Give users a “good Convince/persuade Increase “hit”, feeling” users - explain “clickthrough”, “lookers to bookers” rates RS 11 purpose and success criteria Purpose Success Criteria User System Company Recommendation Items “unknown” to Serendipity Optimize sales user margins and profit Retrieval Users know in Provide “correct”/ Reduce search costs advance what they ”relevant” proposals want Prediction Estimate degree of Optimize sales interest of users in margins and profit item Interaction Give users a “good Convince/persuade Increase “hit”, feeling” users - explain “clickthrough”, “lookers to bookers” rates RS 12 RS: definition given – user model e.g. ratings, preferences, demographics, situational context – items with or without description of item characteristics find – relevance score for set of items typically used for ranking Recommender systems reduce information overload by estimating relevance RS 13 RS: definition feedback User + Item = feedback Explicit Implicit direct and explicit indications of includes user interactions and user preferences, such as behaviors, such as clicks and views, numerical ratings and written which indicate preferences reviews indirectly RS 14 RS: definition feedback explicit – typical choices: 1 to 5, 1 to 7 Likert response scales probably the most precise ratings – … possibly multidimensional e.g. ratings for actors and sound as opposed to the movie – main challenge: users are not always willing to rate many items RS 15 RS: definition feedback implicit – user action is interpreted as a rating e.g. access to content on social media … access to the product’s page and/or buying it – easy to collect transparently, without additional effort – main challenge: action doesn’t necessarily have the same meaning as a rating e.g. , the user might not like all the books he or she has bought; the user also might have bought a book for someone else RS 16 RS: definition interaction matrix Binary Rating 1 5 2 3 3 3 5 3 4 3 1 5 2 4 2 Can be transformed in a binary matrix RS 17 RS: process i2, i1, i7 Ranked recommendations RS 18 RS: paradigms (1/4) Collaborative: "Tell me what's popular among my peers" RS 20 RS: paradigms (2/4) Content-based: "Show me more of the same what I've liked" RS 21 RS: paradigms (3/4) Knowledge-based: "Tell me what fits based on my needs" RS 22 RS: paradigms (4/4) Hybrid: combinations of various inputs and/or composition of different mechanism RS 23 RS: paradigms Mohammed Al-Ghuribi, S., Azman, S., Noah, M.: A Comprehensive Overview of Recommender System and Sentiment Analysis (2021) RS 24 gps introduction collaborative filtering – pure CF approaches – user-based nearest-neighbor (UBCF) evaluation RS 25 Collaborative Filtering (CF) Most prominent approach – used by large, commercial e-commerce sites – well-understood, various algorithms and variations exist – applicable in many domains (book, movies, DVDs,..) users give ratings to catalog items – implicitly or explicitly “Wisdom of the crowd” to generate recommendations Assumption: customers who had similar tastes in the past, will have similar tastes in the future RS 26 pure CF approaches degree to what a top-N list of the current user recommended items will like or dislike a certain item RS 27 User-based nearest-neighbor CF RS 28 User-based nearest-neighbor CF basic algorithm – given an "active user" u and an item 𝑖 not yet seen by u find a set of users (peers/nearest neighbors) who liked the same items as u in the past and who have rated item 𝑖 combine their ratings to predict, if u will like item 𝑖 – e.g. average – … repeat do this for all items that u has not seen – recommend the best-rated items RS 29 User-based nearest-neighbor CF basic assumptions – if users had similar tastes in the past they will have similar tastes in the future – user preferences remain stable and consistent over time RS 30 user-based nearest-neighbor CF Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4 ? User1 3 1 2 3 3 User2 4 3 4 3 User3 3 3 1 5 4 User4 1 5 5 2 1 User5 5 4 4 3 calculate whether the neighbors' ratings for the unseen item 𝑖 are higher or lower than their average σ𝒃 ∈𝑵 𝒔𝒊𝒎 𝒖, 𝒃 ∗ (𝒓𝒃,𝒑 − 𝒓𝒃 ) rating 𝒑𝒓𝒆𝒅 𝒖, 𝒑 = 𝒓𝑢 + σ𝒃 ∈𝑵 𝒔𝒊𝒎 𝑢, 𝒃 … weight, using the similarity with the active user, 𝑢, as a weight add/subtract the active user's average rating RS 31 User-based nearest-neighbor CF: example Item 1 (p1) Item 2 (p2) Item3 (p3) Item 4 (p4) Item 5 (p5) Alice (u) 5 3 4 4 ? User1 (b1) 3 1 2 3 3 sim(u,b1) = 0,85 User2 (b2) 4 3 4 3 User3 (b3) 3 3 1 5 4 sim(u,b3) = 0,70 User4 (b4) 1 5 5 2 1 sim(u,b4) = -0,79 User5 (b5) 5 4 4 3 σ𝒃 ∈𝑵 𝒔𝒊𝒎 𝒖, 𝒃 ∗ (𝒓𝒃,𝒑 − 𝒓𝒃 ) rb4,p5 𝒑𝒓𝒆𝒅 𝒖, 𝒑 = 𝒓𝑢 + σ𝒃 ∈𝑵 𝒔𝒊𝒎 𝑢, 𝒃 Alice Item5 User1, User2,… 5+3+4+4 𝒓𝑢 = = 4.5 4 RS 32 gps introduction collaborative filtering evaluation – introduction – offline evaluation – metrics – online evaluation RS 33 evaluate, we must… business questions – do customers like/buy recommended items? – do customers buy items they otherwise would have not? – are they satisfied with a recommendation after purchase? … lead to empirical evaluation – is the approach efficient with respect to a specific criteria like accuracy, user satisfaction, response time, serendipity, online conversion, ….... during development and in deployment RS 34 … because the no-free-lunch theorem is out to get you! many techniques available – [will be discussed later] so… – which one is the best in a given application domain? – what are the success factors of different techniques? – comparative analysis based on an optimality criterion? RS 35 offline evaluation method data – collected in your problem – benchmark datasets RS 36 offline evaluation method performance estimation methods – split ratings – split users …not trivial metrics – IR/binary classification i.e. based on the confusion matrix – regression – ranking RS 37 performance estimation: split ratings training set – randomly selected share of known ratings – build the model testing set – the remaining share of withheld ratings – ground truth to evaluate the model’s quality – … by comparing with its recommendations perhaps taking time into account rather than randomly RS 38 performance estimation: split users training set – randomly selected share of known users – build the model testing set – the remaining share of withheld users – recommendations based on observed items – … compared to hidden items perhaps taking time into account rather than randomly … and removing useless data RS 39 Metrics Rank metrics DCG Rating Binary Average Precision interactions interactions MAE Precision RMSE Recall F1 score ROC AUC RS 40 Metrics Rating Binary 1 5 2 3 3 3 5 3 3 5 5 4 2 Regression evaluation metrics Classification evaluation metrics RS 41 Metrics: Rating Prediction MAE and RMSE ground truth = ratings – i.e. regression problem Mean Absolute Error (MAE) computes the deviation between predicted ratings and actual ratings 1 n MAE =  n i =1 | pi − ri | Root Mean Square Error (RMSE) is similar to MAE, but places more emphasis on larger deviation 1 n RMSE =  i i n i =1 ( p − r ) 2 RS 42 From ratings to binary Rating Binary 1 5 0 2 3 0 0 3 3 5 3 0 0 0 0 3 0 5 5 0 0 4 2 Regression evaluation metrics Classification evaluation metrics If rating > 3 –> True RS 43 Metrics: binary prediction borrowing from information retrieval (IR) Reality Actually Good Actually Bad Rated True Positive (tp) False Positive (fp) All recommended items Prediction Good Rated False Negative (fn) True Negative (tn) e.g. we recommend all Bad items with rating > 3 All good items e.g. all items that the user actually bought RS 44 Metrics: Binary Prediction Precision and Recall Recommendation is viewed as information retrieval task – i.e. retrieve (recommend) all items which are predicted to be “good” Precision: a measure of exactness, determines the fraction of relevant items retrieved out of all items retrieved – E.g. the proportion of recommended movies that are actually good Recall: a measure of completeness, determines the fraction of relevant items retrieved out of all relevant items – E.g. the proportion of all good movies recommended RS 45 Metrics: Binary Prediction ROC, AUC and friends typically when a recommender system is tuned to increase precision, recall decreases as a result – or vice versa RS 46 Metrics: Binary Prediction F1 Metric The F1 Metric attempts to combine Precision and Recall into a single value for comparison purposes. – May be used to gain a more balanced view of performance The F1 Metric gives equal weight to precision and recall – Other Fβ metrics weight recall with a factor of β. RS 47 Metrics: Rank For a user: Actually good Rank Recommended -> (ground truth) top 3 Rank Hit? Item A hit 1 Item G 1 Item C 2 Item A 2 X 3 Item F 3 Ground truth (i.e., hidden items) is not ordered → it is just a fact: users interacted with those items – Because it is based in binary interactions – If you have ratings, you must convert to GOOD/BAD and select GOOD as recommendations The recommendation is a hit if the recommendation is part of the ground truth RS 48 Metrics: Rank Precision/Recall @K 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡𝑜𝑝 𝑘 Precision@k = 𝑘 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡𝑜𝑝 𝑘 Recall@k = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 Rank Hit? 1 2 2 X P@3 = = 0.67 3 3 X 4 X 2 R@3 = = 0.5 4 5 6 X RS 49 Metrics: Rank Average Precision ranked precision metric that places emphasis on highly ranked correct predictions (hits) – average of precision values determined after each successful prediction, i.e. Rank Hit? Rank Hit? 1 1 X 2 X 2 3 X 3 4 X 4 X 5 5 X RS 50 Metrics: Rank Discounted Cumulative Gain Discounted cumulative gain (DCG) – Logarithmic reduction factor Idealized discounted cumulative gain (IDCG) – Assumption that items are ordered by decreasing relevance Normalized discounted cumulative gain (nDCG) – Normalized to the interval [0..1] K is the number of elements in the rank i is the element position Ideal position 𝑛𝐷𝐶𝐺𝑘 = 𝑛𝐷𝐶𝐺@𝑘 RS 51 Metrics: Rank Discounted Cumulative Gain https://en.wikipedia.org/wiki/Binary_logarithm RS 52 Example Rank Hit? 1 1 1 𝐷𝐶𝐺5 = + + ≈ 1.56 𝑙𝑜𝑔2(2 + 1) 𝑙𝑜𝑔2(3 + 1) 𝑙𝑜𝑔2(4 + 1) 1 2 X 1 1 1 𝐼𝐷𝐶𝐺5 = + + ≈ 2.13 𝑙𝑜𝑔2(1 + 1) 𝑙𝑜𝑔2(2 + 1) 𝑙𝑜𝑔2(3 + 1) 3 X 4 X 𝐷𝐶𝐺5 𝑛𝐷𝐶𝐺5 = ≈ 0.73 𝐼𝐷𝐶𝐺5 5 RS 54 effect of data sparsity in evaluation Nr. UserID MovieID Rating (ri) Prediction (pi) |pi-ri| (pi-ri)2 1 1 134 5 4.5 0.5 0.25 MAE = 0.46 2 1 238 4 5 1 1 3 1 312 5 5 0 0 RMSE = 0.75 4 2 134 3 5 2 4 Removing line nr. 4 5 2 767 5 4.5 0.5 0.25 MAE = 0.29 6 3 68 4 4.1 0.1 0.01 RMSE = 0.42 7 3 212 4 3.9 0.1 0.01 Removing lines 8 3 238 3 3 0 0 1,2,4,5 9 4 68 4 4.2 0.2 0.04 10 4 112 5 4.8 0.2 0.04 MAE = 0.1 4.6 5.6 RMSE = 0.13 RS 56 online evaluation: example Effectiveness of different algorithms for recommending cell phone games [Jannach, Hegelich 09] Involved 150,000 users on a commercial mobile internet portal Comparison of recommender methods Random assignment of users to a specific method RS 57 online evaluation: characteristics of methods who is the subject online customers, students, historical online that is in the focus of sessions, computers, … research? what research experiments, quasi-experiments, non-experimental methods are research applied? in which setting does lab, real-world scenarios the research take place? RS 58 online evaluation: settings Lab studies Field studies Expressly created for the Conducted in a preexisting real- purpose of the study world enviroment Extraneous variables can be Users are intrinsically motivated controlled more easily by to use a system selecting study participants... who should behave as they would in a real-world environment … but doubts may exist about participants motivated by money, prizes or social pressure RS 59 research method: experimental RS 60 research method: quasi-experiments Lack random assignments of units to different treatments Problem: – questions of causality cannot be answered are users of the recommender systems more likely convert? does the recommender system itself cause users to convert? – … due to lack of random assignments i.e. some hidden exogenous variable might influence the choice of using RS as well as conversion RS 62 research method: non-experimental Non-experimental / observational research – Surveys / Questionnaires – Longitudinal research Observations over long period of time E.g. customer life-time value, returning customers – Case studies Focus on answering research questions about how and why E.g. answer questions like: How recommendation technology contributed to Amazon.com‘s becomes the world‘s largest book retailer? – Focus group Interviews Think aloud protocols RS 63 analysis of results in general Are observed differences statistically meaningful or due to chance? – Standard procedure for testing the statistical significance of two deviating metrics is the pairwise analysis of variance (ANOVA) – Null hypothesis H0: observed differences have been due to chance – If outcome of test statistics rejects H0, significance of findings can be reported Practical importance of differences? – Size of the effect and its practical impact – External validity or generalizability of the observed effects RS 64 dilemma of establishing ground truth ▪ IR measures are frequently applied, however: Offline experimentation Online experimentation Ratings, transactions Ratings, feedback Historic session (not all recommended items are Live interaction (all recommended items are rated) rated) Ratings of unrated items unknown, but “Good/bad” ratings of not recommended items interpreted as “bad” (default assumption, user are unknown tend to rate only good items) If default assumption does not hold: False/true negatives cannot be determined True positives may be too small False negatives may be too small Precision may increase Precision ok Recall may vary Recall questionable Results from offline experimentation have limited predictive power for online user behavior. RS 65 discussion & summary overview of problem of recommendation collaborative filtering approaches focus on evaluation – general principles of empirical research an current state of practice in evaluating recommendation techniques were presented – focus on how to perform empirical evaluations on historical datasets – discussion about different methodologies and metrics for measuring the accuracy or coverage of recommendations – overview of which research designs are commonly used in practice. – from a technical point of view, measuring the accuracy of predictions is a well accepted evaluation goal – … but other aspects that may potentially impact the overall effectiveness of a recommendation system remain largely under developed discussion of some “complexities” of data involved RS 67

Recommender Systems PDF

Document Details

Tags

Related

Summary

Full Transcript