A Novel Phishing Page Detection Mechanism Using HTML Source ...

September 27, 2016 | Author: Anonymous | Category: HTML/HTML5

Short Description

We then developed a browser capable of detecting phishing pages. The browser is tested with more than 20 phishing sites ...

Description

A Novel Phishing Page Detection Mechanism Using HTML Source Code Comparison and Cosine Similarity Roopak.S, Tony Thomas School of Computer Science & IT Indian Institute of Information Technology and Management - Kerala Thiruvananthapuram,India Email: [email protected]

Abstract—Phishing is a social engineering technique used by hackers to steal information and sometimes money from online users. Phishing web sites are imitating sites of other legitimate web sites. Our aim is to detect the phishing pages and block it. In this paper, we propose a novel method for detecting phishing pages by searching the similar web pages through mining the web and compares them by matching the HTML source codes as well as computing the cosine similarity of their textual contents. We then developed a browser capable of detecting phishing pages. The browser is tested with more than 20 phishing sites from Phishtank.com with different tag match percentage and cosine similarity values. The results indicate that the detection rate of the proposed mechanism is high compared to the other existing methods. Keywords-phishing; social engineering; web mining; cosine similarity

I. I NTRODUCTION Phishing is the act of attempting to acquire information such as usernames, passwords, and credit card details (and sometimes, indirectly, money) by masquerading as a trustworthy entity in an electronic communication [1][2][14]. This occurs while communicating from social networking sites, e-commerce sites, auction sites, banking sites and so on. Phishing is a technique used by hackers to trap online users. It is commonly carried out by sending malicious web site links as an email to the victim. The source address may be a spoofed e-mail address. When the victim clicks the link, it will be executed and a malicious copy of the legitimate site opens. The victim misunderstands it as the original web site and enters his username and password or credit card details to it.The duplicate site is running on the attacker’s server. The entered details of victim are fetched to the attackers system. It may be in the form of text files. Then the attacker can uses this information to log in the original site or do financial transactions. This is now a serious problem in the field of security as this can result in loss of data and money. It is usually difficult to distinguish between the legitimate and malicious duplicate sites because they may look exactly similar. Phishing is a well known social engineering attack. Social engineering, in the context of information security, refers to psychological manipulation

of people into performing actions or divulging confidential information [4]. The first recorded mention of the term ‘phishing’ is found in the hacking tool AOHell (according to its creator), which included a function for stealing the passwords or financial details of America Online users [4]. A typical phishing attack is performed as follows. First take the copy of the legitimate web site and then save it in the hard disk. The next step is to edit that page in an editor. If there is a login form the action attribute should be changed and redirected to a program written in the attacker’s PC. The duplicate site is hosted in a server installed on the attacker’s PC. Then the IP address is send to the user through an mail. When a victim clicks the link of the IP address, the malicious web site gets loaded. When the victim enters the login details and other information on the duplicate site, the attacker gets the information. The nontechnical ways to prevent phishing attacks are the following: • Do not reply to emails with any personal information. If you have reason to believe that the request is genuine, call the institution or company directly; • Do not click on links in an email message. If you have reason to believe that the request is genuine, type the web address for the company or the institution directly into your web browser. The rest of the paper is organized as follows. Section II describes in detail the related work and their limitations. Section III describes the proposed method and it’s implementation. Section IV provides the results of our experiments. Section V provides concluding remarks and some directions for research. II. R ELATED W ORKS Google Safe Browsing API[5] maintains a detected list of phishing URLs which is frequently udated by the Google. It compares the URL entered by a user and compares with the list. If a match is found, it gives the phishing alert. Google Safe Browsing API is used by web browsers such as Mozilla, FireFox, Google Chrome etc. This protocol is still in experimental stage only. It’s drawback is that the

phishing sites do not exist for a long time. The attackers continuously change the URL’s and IP addresses or may delete the phishing site from their server. Hence, maintaining a huge black list phishing URLs and frequently updating is not easy. CANTINA[8] is based on a TF-IDF information retrieval algorithm. TF stands for the ‘term frequency’ which is the total number of times a particular word appears in a document. IDF stands for the ‘inverse document frequency’ which is the number of times a particular word is found in the group of documents. These are often used in information retrieval and text mining operations. CANTINA is available as a toolbar with the internet explorer. It’s method is to search the google with 5 words having maximum TF-IDF values. If the first n results contains the URL then it is treated as a legitimate site, otherwise as a phishing site. It’s drawback is that unregistered legitimate sites also get blocked and shown as phishing sites. Hara, M,Yamada A, and Miyake Y suggested a method[9] based on the comparison of browser rendered screen shots for visual similarity. Eric Medvet, and Engin Kilda suggested a method[7] based on the comparison of the web page signatures. The signatures are generated by using text, image and the overall browser rendered web page image. Text similarity, image similarity and overall similarity of the web pages are considered in this. Its drawback is that due to the dynamic nature of the websites, web contents are changing continuously over the time. Hence it cant detect the changes in the web site. Generally phishers try to copy an entire web site and makes it as a phishing site. This means that they generally reuse the html, css and image files of the site. In [10] Brad Wardman and Gary Warner suggested a method to download all the files of the web site and find the MD5 hash value of main page as well as the component pages. ‘Digital PhishNet’ (DPN) is a database that contains the MD5 hash values of phishing pages and the associated files. If the MD5 checksum of the site matches with one in the DPN database, the site is confirmed as a phishing site and is shared with the DPN database. It’s drawback is that if just one bit of a file in the malicious phishing site differs from that of the legitimate site, then the MD5 hash of the phishing site differs from that of the legitimate site and this detection process fails. In layout similarity based approach for detecting phishing page[10], the authors compare the two websites based on HTML source code analysis. They extract the DOM tree representation of HTML code and compare each other for similarity. They used two comparison methods: one is simple tag comparison method and the other is based on isomorphic sub graph identification. Tag comparison method is more effective than the other method. These methods have certain disadvantages. In modern web sites tags are repeated many times and this can result in multiple matching.

III. T HE P ROPOSED M ETHOD AND ITS I MPLEMENTATION The proposed method is by searching the similar web pages by Google and compare them by HTML source code matching. If there is no match in the HTML source code, then the cosine similarities of the pages are compared. The flowchart of the proposed method is shown in Figure 1.

Figure 1.

Flow chart

A. Mining URL by Google Search First make a signature by using the title of the web page and 5 words of it with maximum TF (term frequency) values. Put that signature for Google search and get the first n web pages. The next step is to compare the user entered web page URL with these pages for similarities. B. Compare The Pages Next step is to compare the web pages each other by HTML code matching and also by checking the textual part of these web pages for cosine similarity. If any of these two are matching, then the pages are assumed to be similar. For this, we treat the URL entered web page as primary web page and the pages returned by the Google search as secondary pages. We compare the primary page with these n secondary pages for matching. If a match found, then the IP address of the pages are checked. If the IP addresses are different, then the page is considered as a phishing page, otherwise it is considered as a legitimate page C. HTML Source Code Comparison Method In this step, the primary web page is compared with the secondary web page by attribute matching of the HTML tags. The number of matching tags and mismatching tags of

the web pages are computed. The percentage of matching tags is computed. If this ‘percentage tag matching similarity’ exceeds 50% then the pages are treated as similar. This algorithm is given below. Data: Vector A,Vector B Result: matchfound i=0; j=0; match=0; while A(i) is empty do while B(j) is empty do if A(i)==B(j) then if A(i).attribuite==B(j).attribuite then match=match+1; i=i+1; else j=j+1; end else j=j+1; end end i=i+1; end if ((match/.totaltags)*100) is greater than 50 then matchfound=true; else matchfound=false; end Algorithm 1: HTML code comparison Algorithm

can make a vote for it depending on whether the site is a phishing site or not. Tag Match Percentage 0 10 20 30 40 50 60 70 80 90 100

False Positive 100 80 75 60 20 0 0 0 0 0 0

False Negative 0 0 0 0 0 0 25 40 65 85 100

Table 1:False Positive and Negative Rates in HTML Tag Match Similarity In Table 1, we show the percentage of false posititves (FP) and false negatives (FN) against the tag match similarity. It use tag match percentage as threshold. The FP and FN values are obtained for different match percentage values. Compared to layout-similarity-based approach for detecting Phishing Pages[11], the false positives we got is very low as our algorithm also checks for similarity of attributes. The graph thus obtained is shown in Figure 2. In Table 2, we show the percentage of false positives (FP) and false negatives(FN) against the cosine similarity. It use a particular cosine similarity value as threshold. The FP and FN values are obtained for different cosine similarity values. The graph thus obtained is shown in Figure 3.

D. Comparison Based on Cosine Similarity The textual part of the primary web page is compared with the secondary web page for ‘cosine similarity’. If the cosine similarity exceeds 0.50 then the pages are treated as similar. E. Implementation The proposed algorithm is implemented by using java with jsoup html parser and gson package. We built a test browser using java. The jsoup html parser is used to parse the html source code which is obtained from the user entered URL. The Gson package is used for searching google to get the results as URLs. IV. E XPERIMENTAL R ESULTS The proposed method is tested with many phishing sites taken from phishitank.com. PhishiTank is an antiphishing site launched in October 2006 by David Ulevitch. It is used by Opera, Yahoo, Mcafee and so on. It is the phishing system where a user can submit a suspected phishing site and others

Figure 2. False Postive and Negative Rates in HTML Tag Match Similarity

[2] Van der Merwe, Loock M, Dabrowski M, Characteristics and Responsibilities involved in a Phishing Attack, Winter International Symposium on Information and Communication Technologies, CapeTown, January 2005. [3] Langberg, Mike, AOL Acts to Thwart Hackers,San Jose Mercury News September 8, 1995. [4] Rekouche, Koceilah, Early Phishing, arXiv:1106.4692, 2011. [5] Google, ”Google safe browsing API”, http://code.google.com/apis/safebrowsing/ , accessed Oct 2011. [6] Eric Medvet,Engin Kirda, Visual-Similarity-Based Phishing Detection Proceedings of the 4th international conference on Security and privacy in communication networks 2008, ACM New York, NY USA.

Figure 3.

False Positive an Negative Rates in Cosine Similarity

Tag Match Percentage 0 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1 Table 2:-False Positive an ilarity

False Positive 100 90 80 65 30 15 0 0 0 0 0 Negative Rates

False Negative 0 0 0 0 0 0 15 35 55 75 100 in Cosine Sim-

V. C ONCLUSION In this paper, we proposed a novel mechanism to detect phishing pages. The experiments carried out show that the false positive rate is comparatively low in the proposed mechanism as we utilize the google page ranking information. Its detection rate is also high compared to other existing mechanisms. The proposed phishing page detection algorithm is computationally efficient also as it is just analyses the source codes only. Further it requires only limited memory. The proposed mechanism may not work under code obfuscation. In this case, the HTML source code is an obfuscated format. The source code is very difficult for a human to understand. It may be overcome to an extend by browser rendered screen shot analysis. This can be a future direction for research. R EFERENCES [1] Ramzan, Zulfikar, Phishing attacks and countermeasures, Handbook of Information and Communication Security, Page No: 433, 2010.

[7] Yue Zhang, Jason Hong, and Lorrie Cranor (2007), CANTINA: A content based approach to detecting phishing web sites, Proceedings of the 16th international conference on World Wide Web 2007, ACM, New York,NY USA [8] Hara, M,Yamada A,Miyake Y, Visual Similarity-based Phishing Detection without Victim Site Information , Computational Intelligence in Cyber Security, 2009, CICS ’09,Nashville, TN. [9] Brad Wardman,Gary Warner , Automating Phishing Website Identification through Deep MD5 Matching , eCrime Researchers Summit, 2008, Atlanta, GA. [10] Rosiello, Angelo P.E, Kirda E, Kruegel C,Ferrandi, F, A Layout-Similarity-Based Approach for Detecting Phishing Pages, Security and Privacy in Communications Networks and the Workshops, 2007, SecureComm 2007, Nice, France. [11] Khonji, M, Iraqi, Y, Jones A, Phishing Detection: A Literature Survey, Communications Surveys Tutorials, IEEE (Volume:15 , Issue:4 ), 2013. [12] H. Zhang, G. Liu, T. Chow, and W. Liu, Textual and visual content based anti-phishing: A bayesian approach , IEEE Transactions on Neural Networks, Vol. 22, No.10. Oct. 2011. [13] Obfuscation (software), en.wikipedia.org/wiki/Obfuscation(Software) [14] Phishing, en.wikipedia.org/wiki/phishing(Software) [15] [15] Phish Tank Join and Fight Against Phishing, https://www.phishtank.com/

A Novel Phishing Page Detection Mechanism Using HTML Source ...

Short Description

Description

Comments