Systematization of Knowledge (SoK): A Systematic Review of Software Based Web Phishing Detection PDF
Document Details
Zuochao Dou, Issa Khalil, Abdallah Khreishah, Ala Al-Fuqaha, Mohsen Guizani
Tags
Summary
This publication systematically examines software-based methods for detecting web phishing. It covers different approaches, evaluation datasets, detection features, and metrics. The article aims to provide insights for developing more effective phishing detection systems.
Full Transcript
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE...
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 1 Systematization of Knowledge (SoK): A Systematic Review of Software Based Web Phishing Detection Zuochao Dou, Student Member, IEEE, Issa Khalil, Member, IEEE, Abdallah Khreishah, Member, IEEE, Ala Al-Fuqaha, Senior Member, IEEE and Mohsen Guizani, Fellow, IEEE Abstract—Phishing is a form of cyber attacks that leverages solutions (e.g., website filtering); (2) Preventive solutions social engineering approaches and other sophisticated techniques (e.g., strong authentication , , , , , , to harvest personal information from users of websites. The ); and (3) Corrective solutions (e.g., Site takedown , average annual growth rate of the number of unique phishing websites detected by Anti Phishing Working Group (APWG) ). In this paper, we focus on detective solutions. More is 36.29% for the past six years and 97.36% for the past two specifically, we look at software-based phishing detection years. In the wake of this rise, alleviating phishing attacks has schemes that are specialized in identifying and classifying received a growing interest from the cyber security community. phishing websites. This class of approaches is arguably more Extensive research and development have been conducted to important than other approaches because it helps in reducing detect phishing attempts based on their unique content, network, and URL characteristics. Existing approaches differ significantly human errors. Preventative and corrective solutions take a in terms of intuitions, data analysis methods as well as evaluation different approach, but if the user behind the keyboard has methodologies. This warrants a careful systematization so that been successfully tricked by a phishing attempt, and willingly the advantages and limitations of each approach, as well as submitted sensitive information, then no firewall, encryption the applicability in different contexts, could be analyzed and software, certificates, or authentication mechanism can help in contrasted in a rigorous and principled way. This paper presents a systematic study of phishing detection preventing the attack from materializing. Software-based schemes, especially, software based ones. Starting from the phish- phishing detection also delivers improved results compared ing detection taxonomy, we study evaluation datasets, detection to detection by user education (e.g., , , ) because features, detection techniques and evaluation metrics. Finally, we phishing attacks normally aim at exploiting human weaknesses provide insights that we believe will help guide the development. For example, a study of phishing detection using user of more effective and efficient phishing detection schemes. education shows a 29% false negative rate (F N R) for the best performance, while the software based approaches that are I. I NTRODUCTION surveyed by the same study have F N R in the range of 0.1% Phishing, one form of cyber-attacks, continues to be a to 10%. For this reason, we focus our study on software based growing concern not only to cyber security specialists but also phishing detection systems, and the term “phishing detection” to e-business users and owners. The severity of such cyber will refer only to this form of detection in the rest of the paper. attack vector is continuously growing with the exponential Although the research area of phishing detection and classi- increase in digital information generation and the increased fication is relatively rich, there is a lack of systematic analysis reliance of people and business on cyber space. The Anti- of the requirements, the capabilities, and the shortcomings of Phishing Working Group (APWG) has seen rapid growth in the existing anti-phishing techniques. For example, websites the number of unique phishing websites detected from 2014 that offer identification and classification of phishing as a to 2016. The average annual growth rate is 97.36% and service have been popular in recent years, however, those is expected to continue to grow. Estimates of annual direct services leverage different evaluation datasets from various financial loss to the US economy caused by phishing activities sources at different time periods to validate their outcomes. range from $61 million to $3 billion. Albeit those schemes may have similar performance results To mitigate the increasing damage caused by phishing, a (e.g., in terms of false positive rate, true positive rate, etc.), broad range of anti-phishing mechanisms have been proposed it is difficult to compare their performance because of the over the past two decades. These anti-phishing techniques can variation in the evaluation datasets employed. Consequently, a be categorized into three broad groups : (1) Detective systematic assessment of the datasets used to validate phishing detection approaches is desired, as well as necessary, in order Zuochao Dou is with the Electrical and Computer Engineering Department, to provide a foundation for comprehensive comparisons among New Jersey Institute of Technology, Newark, USA, E-mail: [email protected]. Issa Khalil is with the Qatar Computing Research Institute (QCRI), Hamad different phishing detection schemes, and ultimately, select the bin Khalifa University (HBKU), Doha, Qatar. E-mail: [email protected]. best in practice. Abdallah Khreishah is with the Electrical and Computer Engineering In this work, we complement the existing survey papers Department, New Jersey Institute of Technology, Newark, USA, E-mail: [email protected]. on phishing detection, including , , and , by Ala Al-Fuqaha is with the NEST Research Lab, College of Engineering & providing a broad systematic analysis of software based anti- Applied Sciences Computer Science Department, Western Michigan Univer- phishing approaches. In , the authors focus on studying, sity, Kalamazoo, USA, E-mail: [email protected]. Mohsen Guizani is with the Electrical and Computer Engineering Depart- analyzing, and classifying the most significant and novel ment, University of Idaho, Moscow, USA, E-mail: [email protected]. detection techniques, and pointed out the advantages and 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 2 TABLE I: MOST POPULAR DEFINITIONS OF PHISHING disadvantages of each approach. On the other hand, we present Definition a more comprehensive systematic review of phishing detection PhishTank Phishing is a fraudulent attempt, usually schemes, not only from the perspective of detection algo- made through email, to steal your personal rithms, but also from a broader perspective that covers other information. APWG Phishing is a criminal mechanism employing important aspects including the phishing detection life cycle, both social engineering and technical taxonomy of phishing detection schemes, evaluation datasets, subterfuge to steal consumers’ personal detection features, and evaluation metrics and strategies. The identity data & financial account credentials. Xiang et al. Phishing is a form of identity theft, in work in focuses more on the attack side of phishing. which criminals build replicas of target Web More specifically, it presents details about phishing attacks sites and lure unsuspecting victims to including the anatomy of such attacks, why people fall in disclose their sensitive information like passwords, personal identification numbers phishing attacks and how bad phishing is. However, it only (PINs), etc.. provides a high level analysis of the state-of-the-art phishing Whittaker et al. A phishing page is any web page countermeasures. In order to provide a systematic review of that, without permission, alleges to act on behalf of a third party with the intention the phishing detection research, we first present the necessary of confusing viewers into performing an action information about the phishing attacks by answering three with which the viewer would only trust a true questions: (1) What is phishing?, (2) How does phishing work? agent of the third party. Khonji et al. Phishing is a type of computer attack that and (3) What is the current status of phishing? Then, we communicates socially engineered messages to conduct systematic review of phishing detection schemes in humans via electronic communication channels a detailed and comprehensive manner. Finally, Khonji et al. in order to persuade them to perform certain actions for the attacker’s benefit. present a literature survey about anti-phishing solutions Ramesh et al. Phishing is a fraudulent act to acquire (e.g., user training, email filtering and website detection, etc.), sensitive information from unsuspecting users including their classification, detection techniques and evalu- by masking as a trustworthy entity in an electronic commerce. ation metrics. Compared to , we focus on the software based phishing website detection schemes, which are proved TABLE II: TARGETS AND STRATEGIES OF PHISHING to be the most effective anti-phishing solutions and are not Target Strategy systematically studied in. PhishTank Personal information Social engineering Identity data, In a nutshell, the objective of this paper is to provide APWG Financial account Social engineering a systematic understanding of existing phishing detection credentials studies and provide a comprehensive way to evaluate phishing Xiang et al. Sensitive information Not specified Whittaker et al. Not specified Not specified detection approaches from different perspectives in order to Khonji et al. Not specified Social engineering guide future developments and validations of new or upgraded Rameshe et al. Sensitive information Not specified anti-phishing techniques. We summarize our contributions in this work as follows: away lessons for researchers and practitioners in the area of Compile a comprehensive profile of phishing through its phishing detection. Section VI concludes the paper. various definitions, detailed ecosystem (i.e., in terms of phishing life cycle, actors involved and their operations, etc.), and the state-of-the-art phishing trends. II. BACKGROUND Present a systematic review of the software based phish- A. State-of-the-art Phishing Attacks ing detection schemes from different perspectives includ- In this section, we first present the various definitions of ing the life cycle, taxonomy, evaluation datasets, detection phishing, then we introduce some statistics about phishing features, detection techniques and evaluation metrics. between January 2010 and June 2016. Finally, we describe Introduce a novel feature, Network Round Trip Time the phishing ecosystem. (N RT T ), for efficient and real time detection of phishing 1) What is Phishing?: There is no consensus on how attacks. phishing should be defined. Different phishing definitions Provide detailed takeaway lessons for researchers and lead to different research directions and approaches (e.g., practitioners in the area of phishing detection that we email filtering or website detection). It is important to clearly believe will help guide the development of effective identify the target of any phishing detection approach to phishing detection schemes. avoid confusion about its applicability in different scenarios. The rest of the paper is organized as following: Section The target and scope of phishing detection approaches can II describes the state-of-the-art phishing attacks, and presents be analyzed from the definition of phishing which has been the life cycle of phishing detection approaches. Section III adopted by such approaches. Therefore, presenting a back- introduces the taxonomy of phishing detection schemes with ground on the different definitions of phishing can help the the corresponding literature review. Section IV presents a sys- readers understand the scope and the capabilities of different tematic review of software based phishing detection schemes approaches. Table I summarizes the popular definitions of from different perspectives: (1) phishing detection datasets; (2) phishing. On one hand, the definitions of PhishTank , phishing detection features; (3) phishing detection techniques; APWG , Xiang et al. , Tameshe et al. cover the and (4) evaluation metrics. Section V provides detailed take- majority of cases in which phishers aim at stealing sensitive 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 3 TABLE III: OPERATIONS OF DIFFERENT PLAYERS INVOLVED IN PHISHING personal information such as authentication credentials. Table Basic Operations Advanced Operations II shows the comparison of those phishing definitions based 1.Data collection 1. Evasion of anti-phishing on phishing target and phishing strategy. The most dominant A 2.Website development techniques phishing strategies are social engineering (e.g., through fraudu- 3.Email engineering 1.User behavior detection lent emails) and technical subterfuge (e.g., malware infection). B 1.Blacklist announcement 2.Strong authentications However, sophisticated techniques (e.g., pharming ) are C 1.Human detection 1.Phishing detection toolbar also used to harvest users’ personal information from the 2.Browser filter D 1.Policy enforcement 1.Brand monitoring Internet. On the other hand, the definitions of Whittaker et 1.Phishing data analysis 1.Law enforcement al. and Khonji et al. do not limit the attacker’s E 2.Anti-Phishing solutions 2.Government coalition target (e.g., sensitive personal information). They describe the 1.Email filtering F 1.Employee training 2.Phishing detection software phishing strategy (e.g., phishing website or socially engineered A: Phisher; B: Web service provider; C: Web service subscriber; messages) without stating a specific phishing target (e.g., D: Web hosting provider; E: Anti-Phishing institute; only state the attackers’ benefit). To sum up, the definition F: Spear Phishing targets of Whittaker et al. is the most general among those reviewed, while APWG defines the most commonly used phishing attacks in a specific manner. on-line shopping, etc.) on the Internet (usually through a website). 2) How Does Phishing Work?: In this section, we intro- Web service subscriber: Customers who subscribe to duce the ecosystem of phishing in terms of phishing process, web services provided by the web service provider. Sub- actors involved, their actions and interactions. scribers are the potential targets of traditional phishing (i) Phishing Process: In a generic/traditional phishing attacks. scenario (i.e., mass-email phishing campaigns), an attacker Web hosting provider: Companies that provide website hosts a fake website, and presents users of a web service hosting services to web service companies. with convincing emails containing a link to the fake website. Anti-phishing institutes: Institutes that support those When a user of the web service opens the link and enters tackling the phishing menace and provide advice on anti- her sensitive data, data is collected by the server hosting the phishing controls and information on current trends. fake website. As shown in Figure 1, Mihai and Giurea Spear phishing targets: Specific individuals or compa- suggest that a generic phishing process can be identified in nies targeted by phishers. five steps: (1) Reconnaissance: Phishers look for famous web service brands with a broad customer base; (2) Weaponization: Each actor involved in the phishing process has different Phishers design the phishing websites and social engineer actions and reactions (summarized in Table III). Phishers try on email spam; (3) Distribution: Phishers deliver emails to to use sophisticated techniques to evade phishing detection the victims; (4) Exploitation: Phishers exploit weaknesses of approaches (e.g., DNS poisoning ). In addition, there is a humans to lure the victims into phishing traps via socially growing trend in which phishers have decoupled the process of engineered emails. (5) Exfiltration: Phishers collect sensitive phishing website hosting from the process of sending phishing data from the phishing databases. emails in order to evade the anti-phishing solutions (Han, Kheir, & Balzarotti ). Unlike generic phishing attacks, spear phishing targets par- Web service providers usually announce blacklists of phish- ticular individuals or organizations.. Spear ing websites and recommend users to use strong authentication phishing attacks typically extract sensitive data from their schemes (e.g., , , , , ). Additionally, web victims by attaching a type of malware to emails or in service subscribers highly depend on browser filters (e.g., the phishing website. Industry statistics indicate that spear Google Safe Browser ) and other third party anti-phishing phishing attacks have a success rate of 19%, while the success toolbars (e.g., Netcraft ) to detect and block phishing rate of generic phishing attacks is less than 5%. attempts. For the purpose of this paper, we will not consider email fil- The role of web hosting providers is rather ambiguous in the tering (e.g., , , ) as a phishing detection method. phishing process. Reputable providers usually enforce strict Our focus is on detection of website phishing for both generic “Terms of Use” and avail certain anti-phishing solutions (e.g., and spear phishing attacks. brand monitoring ). Due to financial constraints, many (ii) Phishing Actors: There are six actors involved in a free-to-use web hosting providers may not be able to afford typical phishing life cycle (see Figure 2), as defined in the deploying good anti-phishing security measures, which leaves following paragraphs: their customers not only vulnerable, but even worse, attractive Phisher: Individuals or organizations that conduct phish- targets for phishing. ing attacks in order to obtain a certain type of benefit, Anti-phishing institutes collect and analyze phishing data such as financial gain, identity hiding (e.g., refers to the (e.g., suspicious websites reported by users) from various situation in which phishers do not use the stolen identities sources (e.g., users’ reports via anti-phishing toolbars), and directly, but rather sell them to interested criminals and provide anti-phishing suggestions and solutions (e.g., up-to- cyber attackers.), fame and notoriety, etc.. date phishing website blacklist, phishing detection toolbars, Web service provider: Companies that provide a certain etc.). In addition, they may also cooperate with government type of service (e.g., email, social network, e-banking, agencies such as public security and law enforcement to detect 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 4 Fig. 1: Illustration of the phishing process. Fig. 2: Players involved in the phishing process. 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 5 160,000 140,000 120,000 100,000 80,000 60,000 Fig. 4: The number of phishing sites that use HTTPS. Re-printed from. 40,000 20,000 B. Life cycle of phishing detection As mentioned in Section I, we do not incorporate phishing 0 2010 Jan. 2010 Dec. 2011 Dec. 2012 Dec. 2013 Dec. 2014 Dec. 2015 Dec.2016 Jun. detection approaches that rely on user education due to their poor performance. In addition, we do not cover phishing Fig. 3: The number of unique phishing sites per month from Jan. 2010 to Jun. 2016. detection methods that perform email filtering because it is a different detection theme that warrants a separate compre- hensive study on its own. we reemphasize here that our focus is on the area of software-based phishing detection which aims and prevent cyber attacks. at detecting or blocking phishing websites. The life-cycle of software-based phishing detection is il- 3) What is the Current State of Phishing?: According lustrated in Figure 5. Starting from the initial inputs, the to phishing activity trends reports published by APWG detection scheme extracts phishing detection features (or called from Jan. 2010 to Jun. 2016 (shown in Figure 3), the number heuristics, as detailed in Section IV-B ) and/or blacklists from of unique phishing websites established per month increased various sources (e.g., URL related information, trusted third significantly since 2015 (i.e., the average number for 2016 party, WHOIS server, etc.) via different feature mining ap- is 2.93 times the average from prior years). It is clear that proaches (e.g., search engines, target identification algorithms, phishers profited from this type of cyber-attacks, which result etc.). Then, it applies different data mining algorithms and/or in financial loss for both web subscribers and business owners. proposes various detection strategies to the engineered features Therefore, agile techniques to mitigate phishing will continue to achieve its objectives (e.g., identifying phishing links, to be a pressing need. blocking phishing websites, etc.). To evaluate the performance Phishing attacks tend to empoly advanced techniques to of phishing detection schemes, various evaluation datasets lure web service users into their rogue websites. Using the are collected from different sources (e.g., PhishTank, Yahoo database from Trend Micro web reputation technology, Pajares directory, etc.). Finally, leveraging the collected datasets and reports the number of phishing sites that use HTTPS following various validation strategies (e.g., cross validation), connections increased significantly in 2014 compared to 2010 the proposed scheme is evaluated based on multiple metrics (shown in Figure 4). Attackers become more cautious and (e.g., False Positive Rate, False Negative Rate, etc.). attentive when designing phishing websites to evade existing In the coming sections, following the life cycle of software- phishing detection methods. Some phishing groups are based phishing detection schemes, we present a comprehen- capable and desire to perform more advanced phishing attacks. sive study of the phishing detection research from 5 differ- Avalanche (commonly known as the Avalanche Gang) is a ent perspectives, namely, classification of phishing detection criminal syndicate involved in phishing attacks. In 2010, techniques, validation datasets, detection features, detection APWG reported that Avalanche was responsible for two-thirds techniques and detection criteria. of all phishing attacks in the second half of 2009, describing it as “one of the most sophisticated and damaging on the III. P HISHING DETECTION SCHEMES : TAXONOMY AND Internet” and “the world’s most prolific phishing gang”. It THE CORRESPONDING LITERATURE REVIEW has been discovered that Avalanche uses different techniques In phishing literature, software-based phishing detection to evade the anti-phishing mechanisms. schemes are usually categorized into heuristic and blacklist In addition, more and more sophisticated techniques are based schemes. Heuristic-based approaches examine being used to implement phishing attacks. For example, the contents of the web pages including: (1) surface level content pharming attack, a refined version of phishing attacks, is (e.g., the URL); (2) textual content (e.g., terms or words that designed to steal users’ credentials by redirecting them to appear on a given web page); (3) visual content (e.g., the fraudulent websites using DNS-based techniques ,. layout, and the block regions etc.). These methods can Many computer security experts predict that the use of pharm- detect phishing attacks as soon as they are launched but also ing attacks will continue to grow as more criminals embrace introduce relatively high false positive rates (F P R). Blacklist- these techniques. based approaches have a higher level of accuracy. However, 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 6 toolbars typically come in the form of web browser extensions (i.e., default extensions or third party extensions) that warn Heuristic users about a suspicious phishing site after clicking on its Blacklist URL/Blacklist Features Data Sources URL. Publicly available anti-phishing toolbars are either embed- ded in the browser as default extensions (e.g., Microsoft SmartScreen Filter ) or can be downloaded from third Phishing/Legitimate party websites (e.g., Netcraft ). They both display security Feature Mining Approach Data Sources warnings on screen when certain actions are triggered in the browser. These security warnings can be classified into two types : Passive warnings: Passive warnings display various in- Machine Detection formation (e.g., user ratings, site suggestions, etc.) about Learning Evaluation Data Set Strategy Algorithm the website that is currently being visited but do not block the content of the website, as depicted in Figure 6. Active warnings: Active warnings display warning in- formation about the website a user is trying to visit and Evaluation Criteria block the content of the website, as depicted in Figure 7. Many studies have shown that the majority of web ser- Validation Strategy vice users ignore security warnings provided by anti-phishing toolbars. Furthermore, Egelman et. al. found that active warnings are much more effective than Fig. 5: Life Cycle of typical phishing detection schemes passive warnings (79% of participants paid attention to active warnings while only 13% participants paid attention to passive warnings). Table V summarizes the information gathered about they do not defend against zero-hour attacks. Com- the state-of-the-art anti-phishing toolbars. In the following binations of heuristic and blacklist based approaches provide paragraphs, we discuss the details of those toolbars: more robust and flexible defense against phishing attacks than Google Safe Browsering: It uses a browser to check either one on a standalone basis. URLs against Google’s constantly updated blacklist of unsafe In this paper, we classify phishing detection approaches as web resources (e.g., phishing websites) and provides either public phishing detection toolbars or academic phishing active warnings to the end users. According to Google Safe detection/classification schemes. Phishing detection toolbars Browsing’s website, for different platform and threat types, it use blacklists and/or selected heuristics to identify phishing examines pages against the safe browsing lists. It also issues websites. There is usually little information about what heuris- reminders before users access risky links. tics these toolbars use and how they are used. Academic McAfee SiteAdvisor: This is a web application that reports phishing detection solutions are similar to phishing detec- on the identity of websites by scanning them for potential mal- tion toolbars, but usually apply more complex technologies ware and spam. The detection result is decided according and are usually not available/feasible for public use. Most to a combination of heuristics and manual verification, such academic phishing classification schemes apply combinations as the age and country of the domain registration, the number of heuristics features into various data mining algorithms to of links to other known-good sites, third-party cookies, and enhance the classification accuracy. Table IV summarizes the user reviews. In addition, it provides passive warnings. differences between phishing detection toolbars and academic Netcraft Anti-Phishing Toolbar:Provides Internet security phishing detection/classification schemes. Note, the “scheme services including anti-fraud and anti-phishing services, ap- details” column in Table IV estimates the amount of publicly plication testing and PCI scanning. According to its available details about detection schemes, such as detection website, Netcraft’s toolbar screens and identifies the deceiving methodology, data mining algorithms, and datasets. contents in URLs. It also ensures that the navigational controls Furthermore, based on the heuristic/blacklist classification, (e.g., toolbar and address bar) are activated in order to prevent we further classify the academic phishing detection approaches pop-up windows (particularly for Firefox). In addition, it into more specific and fine-grained sub-categories, namely, shows the geographic information of the hosting location (1) heuristic: URL based methods; (2) heuristic: page content of the sites and analyzes fraudulent URLs (e.g., the real based methods; (3) heuristic: visual similarity based methods; citibank.com or barclays.co.uk sites have little possibility to (4) heuristic: other methods; (5) blacklist based methods; (6) be located in the former Soviet Union ). hybrid methods. Details about each category are introduced in SpoofGuard: A heuristics-based anti-phishing toolbar de- Section IV-B. veloped for Internet Explorer with passive warnings. The heuristics used include (1) Domain name check: examines A. Public phishing detection toolbars if the domain name for the attempted URL matches recent Many freely available anti-phishing toolbars offer detection entries; (2) URL Check: checks if the username, the port and blocking services against Internet phishing attacks. These number, as well as the domain name, are suspicious; (3) Email 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 7 TABLE IV: SUMMARY OF DIFFERENCES BETWEEN PHISHING DETECTION TOOLBARS AND ACADEMIC PHISHING DETECTION/CLASSIFICATION SCHEMES Public usage Data Mining Exact Scheme Blacklist Goal availability Algorithms Heuristics Details Publicly available Yes Very little Very little Yes Little Detect and block phishing websites Phishing detection toolbars Academic phishing Very little Yes Yes Little Yes Detect or classify phishing websites detection/classification schemes TABLE V: INFORMATION ABOUT SELECTED STATE-OF-THE-ART ANTI-PHISHING TOOLBARS Technology Warning Type Platform Google Safe Browsing BlacklistHeuristics(Suspected) Active Internet Explorer, Firefox and Chrome McAfee SiteAdvisor Heuristics, Blacklist(Manual verification) Passive Internet Explorer, Firefox and Chrome Netcraft Anti-Phishing Toolbar BlacklistHeuristics Active Internet Explorer, Firefox and Chrome SpoofGuard Heuristics Passive Internet Explorer Microsoft SmartScreen Filter Blacklist Active Internet Explorer Blacklist(User ratings and manual verification), EarthLink Toolbar Passive Internet Explorer and Firefox Heuristics (Unknown) eBay Toolbar BlacklistHeuristics Passive Internet Explorer Blacklist Third-party reputation services and GeoTrust TrustWatch Toolbar Passive Internet Explorer certificate authorities WOT (Web of Trust) Blacklist(User ratings, Third-party information) Active Internet Explorer, Firefox and Chrome Check: determines whether the current URL directs to the more, it blocks the initial attempt when visiting potentially browser via email; (4) Password Field Check: determines if the unsafe websites and warns users in case of a risk in revealing input fields of type “password” are located in the document; information to the site. (5) Link Check: searches for risky links in the body of the Web of Trust (WOT): A browser extension that tells the document; (6) Image Check: analyzes the images of the new user which websites he can trust via active warnings. It site vs. the previous sites; (7) Password Tracking: prevents the ensues the user’s Internet safety from scams, malware, rogue user from typing the same username and password for multiple web stores and dangerous links based on community ratings sites. and reviews. Microsoft SmartScreen Filter: A blacklist-based phishing and malware filter implemented in several Microsoft browsers, B. Academic phishing detection/classification schemes including Internet Explorer and Microsoft Edge. When browsing the site, SmartScreen helps monitor and identify the Unlike the public anti-phishing toolbars, which aim at pro- possibility of visiting a suspicious page. If so, it issues an viding real-time warnings about the legitimacy of visited web- active warning before next step is taken, as well as soliciting sites, academic phishing detection and classification schemes feedback from users. SmartScreen also maintains a list of normally focus on improving the detection accuracy and reported phishing and software sites. It screens the list to check reducing the number of false alerts by employing sophisticated if a match is found. In that case, it issues a warning message technologies and various machine learning algorithms. Table while blocking the site for user’s safety. In addition, security VI shows the time-based (from 2005 to 2016) development checks are also performed when the user starts a download of 41 selected academic phishing detection/classification ap- from the site. Moreover, SmartScreen compares the download proaches. In order to choose the most representative studies, to a list of existing downloads by other users. A warning is in this paper, we comply with the following criteria based on issued if it’s a brand new download. state-of-the-art literature: EarthLink Toolbar: Helps to protect the user from on-line Pioneering: Research that introduces new ideas or meth- scams by displaying a security rating (i.e., passive warning) ods to the literature. for all the websites the user visited previously. Additionally, Attention: Research that receives more attentions in it alerts the user if he tries to access a previously known terms of the number of citations. fraudulent website. It appears to rely on a combination of Completeness: Research that presents their work follow- heuristics, user ratings, and manual verification. ing the entire life cycle of phishing detection in depth. eBay Toolbar: Helps the buyers and sellers with real time Based on the proposed criteria, all of the 41 selected alerts and keeps users safe from spoofing and fraudulent works are introduced in the following sections and twelve attacks by detecting fake sites via a combination of heuristics representative studies are chosen as examples to illustrate the and blacklists through passive warnings. detailed detection methodology in each category. They are GeoTrust TrustWatch Toolbar: Provides website veri- listed in Table VI and introduced below. fication service that alerts the users to potentially unsafe, Visual similarity based methods: Chen et. al. describe or phishing web sites based on the information of several a novel heuristic anti-phishing system that explicitly employs third-party reputation services and certificate authorities via gestalt and decision theory concepts to model perceptual passive warnings. TrustWatch notifies the users that the similarity. More specifically, they apply logistic regression website has passed the verification scan based on a list of algorithm to a set of normalized page content features. The disreputable sites. It would also recommend additional caution proposed scheme can achieve 100% true positive rate and when inputting sensitive information to the website. Further- 0.74% false positive rate. 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 8 Fig. 6: Passive warnings from Netcraft anti-phishing toolbar. Reprinted from: http://toolbar.netcraft.com/. Fig. 7: Active warning from Google Safe Browsering. Reprinted from: https://googleblog.blogspot.com/2015/03/protecting-people-across-web-with.html. 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 9 TABLE VI: TIME-LINE BASED DEVELOPMENT OF PHISHING DETECTION SCHEMES FROM 2005 TO 2016 Research Year Represented(Y/N) EMD based visual similarity for detection of phishing webpages 2005 Phishing web page detection 2005 Anomaly based web phishing page detection 2006 Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD) 2006 Y Cantina: a content-based approach to detecting phishing web sites 2007 Y A framework for detection and measurement of phishing attacks 2007 Y Anti-phishing based on automated individual white-list 2008 A phishing sites blacklist generator 2008 Visual similarity-based phishing detection 2008 B-apt: Bayesian anti-phishing toolbar 2008 Phishzoo: An automated web phishing detection approach based on profiling and fuzzy matching 2009 Fighting phishing with discriminative keypoint features 2009 Beyond blacklists: Learning to detect malicious web sites from suspicious URLs 2009 Y A hybrid phish detection approach by identity discovery and keywords retrieval 2009 Identifying suspicious URLs: An application of large-scale online learning 2009 Y Visual similarity-based phishing detection without victim site information 2009 Automatic detection of phishing target from phishing webpage 2010 Phishnet: predictive blacklisting to detect phishing attacks 2010 Y Large-scale automatic classification of phishing pages 2010 Y Lexical feature based phishing URL detection using online learning 2010 Detecting visually similar web pages: Application to phishing detection 2010 Intelligent phishing detection system for e-banking using fuzzy data mining 2010 Using domain top-page similarity feature in machine learning-based web phishing detection 2010 Textual and visual content based anti-phishing: A bayesian approach 2011 Cantina+: A feature-rich machine learning framework for detecting phishing web sites 2011 Y PhishDef: URL names say it all 2011 Design and evaluation of a real-time URL spam filtering service 2011 Y Antiphishing through phishing target discovery 2012 Using visual website similarity for phishing detection and reporting 2012 PhishAri: Automatic realtime phishing detection on twitter 2012 Phishing Website detection using latent Dirichlet allocation and AdaBoost 2012 Phishing detection plug-in toolbar using intelligent Fuzzy-classification mining techniques 2013 Phishstorm: Detecting phishing with streaming analytics 2014 An efficacious method for detecting phishing webpages through target domain identification 2014 Y An anti-phishing system employing diffused information 2014 Y Predicting phishing websites based on self-structuring neural network 2014 Examination of data, rule generation and detection of phishing URL using online logistic regression 2014 Feature extraction and classification phishing websites based on URL 2015 New rule-based phishing detection method 2016 Know Your Phish: Novel Techniques for Detecting Phishing Sites and their Targets 2016 Y PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder 2016 The most representative work in this category is done by Fu et. al.. They propose an effective phishing website detection approach via visual similarity assessment based on dij = N Df eature (ϕi , ϕj ) = p ∗ N Dcolor (dci ; dcj ) Earth Mover’s Distance (EMD). The detection process +q ∗ N Dcentroid (Cdci ; Cdcj ) contains two phases, namely, generating signature of web pages and computing visual similarity score from EMD. where ϕi =< dci , Cdci >, dc =< dA; dR; dG; dB > is the The web page processing phase (i.e., generate the signature) color tuple, and Cdc is the centroid value. Suppose we have contains three steps: (1) obtain the image of a web page signature Ss,a and signature Ss,b , the EMD between Ss,a and from its URL using Graphic Device Interface (GDI) API; Ss,b can be calculated as: (2) perform image normalization (the normalized image size P fij ∗ dij is 100 x 100, and Lanczos algorithm is used to resize EM D{Ss,a , Ss,b } = P fij the image); (3) transform the web page image by a visual signature. The signature is comprised of the image color tuple where fij is the flow matrix calculated through linear program- using the [Alpha, Red, Green, and Blue] (ARGB) scheme and ming. Note that if EMD=0, the two images are identical, the centroid of its position in the image. if EMD=1, they are completely different. The second step is to compute the EMD between the visual Finally, the EMD-based visual similarity of two images is similarity signatures of the two web pages (legitimate site and defined as: phishing site). Firstly, the normalized Euclidean distance of V S{Ss,a , Ss,b } = 1 − [EM D{Ss,a , Ss,b }]α the degraded ARGB colors and the centroids are computed. Then the two distances are added up with their corresponding where α ∈ (0, +∞) is an amplification factor that limits the weights (i.e., p and q, p + q = 1). The normalized feature skewness of the visual similarity for the distributed in the (0,1) distance between ϕi and ϕj is defined as: range. 1553-877X (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/COMST.2017.2752087, IEEE Communications Surveys & Tutorials 10 Large-scale experiments with 10,281 suspected web pages learning techniques to detect phishing. It has been shown to are carried out and the proposed scheme achieves 0.71% false achieve 0.4% false positive rate and over 92% true positive positive rate and 89% true positive rate. rate. Similar works based on visual similarly include Similar works based on page content include.. Page content based methods: Zhang et. al. propose URL based methods: Garera et. al. claim that it is of- CANTINA, a novel content-based approach for detecting ten possible to tell whether or not a URL belongs to a phishing phishing web sites based on the Term Frequency/Inverse attack without requiring any knowledge of the corresponding Document Frequency (TF-IDF) information retrieval metric. page data. By applying several selected features (i.e., page In addition, using some heuristics, the false positive rate is rank, domain name white list and URL based features) into reduced. Generally, CANTINA works as follows: logistic regression learning algorithm, the proposed scheme is 1) CANTINA calculates the TF-IDF scores of each term efficient and has a high accuracy. of the content in the given website. The most representative work in this category is done by 2) CANTINA generates a lexical signature by taking the Ma et. al.. They propose a phishing detection approach five terms with highest TF-IDF weights. to automatically classify URLs based on different data mining 3) CANTINA sends the lexical signature to a search engine algorithms across both lexical and host based URL features. (i.e., in their case, Google Search). The lexical features selected in this method include the 4) If the domain name of the current website matches the length of the hostname, the length of the entire URL, as well as domain name of the top N search results, it is considered the number of dots in the URL. In addition, the authors create to be a legitimate website. Otherwise, it is concluded to a binary feature for each token in the hostname (delimited by be a phishing site. Note that, the value of N affects the “.”) and in the path URL (strings delimited by “/”, “?’, “.”, “=”, false positives. “-” and “ ”). The host-based features contain: (1) IP address CANTINA with TF-IDF alone results in a relatively high false properties (e.g., is the IP address in a blacklist?); (2) WHOIS positive rate. Therefore, several heuristics are used to reduce properties (e.g., the date of registration, update, and expira- the false positive rate, including: tion); (3) Domain name properties (e.g., the time-to-live (TTL) value for the DNS records associated with the hostname); (4) Age of Domain: it examines the age of the domain name. Geographic properties (e.g., the continent/country/city that the If the page has been registered for more than 12 months, IP address belongs to). the heuristic returns +1 (i.e., legitimate), otherwise it All the features of the URL are encoded into high dimen- returns -1 (phishing). sional feature vectors and then different types of classifiers are Known Images: it examines whether a page contains applied to them. Here are some examples of the classifiers: inconsistent well-known logos. Naive Bayes: Let x denote the feature vectors and y ∈ Suspicious URL: it examines if the URL contains an “@” {0, 1} denote the label of the website, with y = 1 for or a “-” in the domain name. malicious and y = 0 for legitimate ones. P (x|y) denotes Suspicious Links: for each link in the webpage, it per- the conditional probability of the feature vector given forms the above three URL checks. its label. Then, assuming that malicious and legitimate IP Address: it examines if the URL contains an IP websites are equally probable, the posterior probability address. that the feature vector x belongs to a malicious URL is Dots in URL: it examines the number of dots in the URL. computed as: Forms: it examines if a web page contains any HTML text entry form requesting sensitive personal data (e.g., P (x|y = 1) P (y = 1|x) = password). P (x|y = 1) + P (x|y = 0) In addition, CANTINA uses a simple forward linear model Finally, the right hand side of the equation is thresholded to make the decision: to predict the binary label of the feature vector x. X Support Vector Machine (SVM): The decision using S = f( wi ∗ hi ) SVMs is expressed in terms of a kernel function K(x, x0 ) where hi is the result of each heuristic, wi is the weight of that computes the similarity between two feature vectors each heuristic, and f is a simple threshold function. and non-negative coefficients αi that indicate which train- ing examples lie close to the decision boundary. SVMs f (x) = 1 if x > 0, f (x) = −1 if x 99% False Positive Rate 0.71% 1% 0.7% 0.1% 3%