Unfortunately, phishing is lucrative, difficult to detect and relatively easy to get involved. With digital transformations accelerating around the world, phishing is likely to experience continued explosive growth.
According to Phishlabs, the number of phishing attempts in Q1 2021 has increased by nearly 50%. There is also no reason to believe that it will stop rising.
This means increased levels of digital damage and risk. To counteract this increase, new approaches to phishing detection must be tested or current ones improved. One way to improve existing approaches is to make use of web scraping.
Phishers would be hard-pressed to completely replicate the original site. Placing all URLs identically, replicating images, cooking domain age, etc. would require more effort than most people would be willing to put in.
Also, a perfect parody would likely have a lower success rate due to the target’s ability to get lost (clicking on an unrelated URL). Finally, as with any other scam, fooling everyone is not necessary, so the perfect replica would be a wasted effort in most cases.
However, those who do phishing are not stupid. Or at least those who are successful at it are not. They even do their best to make a believable replica with the least amount of effort required. It might not be effective against those who are tech-savvy, but even a perfect replica might not be effective against the cautious. In short, phishing depends on being “good enough”.
So, due to the nature of the activity, there are always one or two glaring holes that can be discovered. Two good ways to get a head start are to look for similarities between frequently phishing sites (eg fintech, SaaS, etc.) and suspected phishing sites, or to collect known attack patterns and work from there.
Unfortunately, with the volume of phishing sites appearing daily and the intent to target less tech-savvy people, solving the problem may not be as simple as it seems at first glance. Of course, as is often the case, the answer is automation.
Looking for phishing
There have been more methods developed over the years. An overview article written in 2018 by ScienceDirect lists URL-based detection, layout recognition, and content-based detection. The former often lags behind phishers as databases are updated more slowly than new websites. Layout recognition is based on human heuristics and is therefore more prone to failure. Content-based detection is computationally heavy.
We’ll be paying a little more attention to layout recognition and content-based detection, as these are cumbersome processes that benefit a lot from web scraping. In the past, a group of researchers had created a framework for detecting phishing sites called CANTINA. It was a content-based approach that checked data like TF-IDF rates, domain age, suspicious URLs, misuse of punctuation marks, etc. However, the study was released in 2007 when automation opportunities were limited.
Web scraping can improve the structure immensely. Instead of manually trying to find outliers, automated apps can browse websites and download relevant content. Important details such as those described above can be extracted from the content, analyzed and evaluated.
building a network
CANTINA, developed by the researchers, had one drawback – it was only used to prove a hypothesis. For these purposes, a database of phishing and legitimate websites has been compiled. The status of both was known a priori.
Such methods are suitable for proving a hypothesis. They are not so good in practice where we don’t know the status of sites in advance. Practical applications of projects similar to CANTINA would require a significant amount of manual effort. At some point, these applications would no longer be “practical”.
Theoretically, though, content-based recognition appears to be a strong contender. Phishing sites need to reproduce content almost identically to the original. Any inconsistencies such as misplaced images, misspellings, missing parts of text can trigger suspicion. They can never stray too far from the original, which means metrics like TF-IDF would have to be similar out of necessity.
The downside of content-based recognition has been the slow and expensive side of manual work. Web scraping, however, transfers most of the manual effort to full automation. In other words, it allows us to use existing detection methods on a significantly larger scale.
First, instead of manually collecting URLs or pulling them from an existing database, scraping can quickly create your own. They may be collected through any content that hyperlinks or links to these alleged phishing sites in any form or format.
Second, a scraper can go through a collection of URLs faster than any human ever could. There are benefits to manual overview, such as the ability to see the structure and content of a site as it is, rather than retrieving raw HTML.
Visual representations, however, are of little use if we use mathematical detection methods such as link depth and TF-IDF. They can even serve as a distraction, pulling us away from important details due to heuristics.
Analysis also becomes a path to detection. Parsers often fall apart if layout or design changes occur on the site. If there are some unusual parsing errors when compared to the same process performed on parent sites, they can serve as an indication of a phishing attempt.
In the end, web scraping doesn’t produce completely new methods, at least as far as I can see, but it does allow for older methods. It provides a path to scaling methods that might otherwise be too expensive to implement.
launching a net
With the right web scraping infrastructure, millions of websites can be scanned daily. Since a scraper collects the source HTML, we have all the text content stored wherever we want. Some further analysis, the plain text content can be used to calculate the TF-IDF. A project would likely start by collecting all the important metrics from popular phishing targets and move on to detection.
Also, there is a lot of interesting information that we can extract from the source. Any internal links can be visited and stored in an index to create a representation of the overall depth of the link.
You can detect phishing attempts by building a site tree by indexing it with a web crawler. Most phishing sites will be superficial for the reasons described above. On the other hand, phishing attempts copy websites from highly established companies. These will have great link depths. Superficiality alone can be an indicator of a phishing attempt.
However, the data collected can be used to compare TF-IDF, keywords, link depth, domain age, etc., with metrics from legitimate websites. A mismatch would be cause for suspicion.
There is one caveat that must be decided “on the move” – what margin of difference is a cause to be investigated? A line in the sand must be drawn somewhere, and at least at first it will have to be quite arbitrary.
Also, there is an important consideration for IP addresses and locations. Some content on a phishing site may only be visible to IP addresses from a specific geographic location (or not from a specific geographic location). Getting around these issues is, under normal circumstances, challenging, but proxies provide an easy solution.
Since a proxy always has a location and an IP address associated with it, a sufficiently large pool will provide global coverage. Whenever a geo-based block is found, a simple proxy switch is all that is needed to jump over the hurdle.
Finally, web scraping, by its nature, reveals a lot of data about a specific topic. Most of it is unstructured, something usually fixed by analysis, not labeled, something usually fixed by humans. Structured and labeled data can serve as a great terrain for machine learning models.
Building an automated phishing detector through web scraping produces a lot of data for evaluation. Once evaluated, the data would normally lose its value. However, as with recycling, this information can be reused with a few tweaks.
Machine learning models have the disadvantage of requiring huge amounts of data to start making acceptable quality predictions. However, if phishing detection algorithms start using web scraping, this amount of data will naturally be produced. Of course, labeling might be necessary, which would require a considerable amount of manual effort.
Regardless, the information would already be structured in order to produce acceptable results. While all machine learning models are black boxes, they are not entirely opaque. We can predict that data structured and labeled in a certain way will produce certain results.
For clarity, machine learning models can be thought of as the application of mathematics to physics. Certain mathematical modeling seems to fit exceptionally well with natural phenomena like gravity. Gravitational attraction can be calculated by multiplying the gravitational constant by the mass of two objects and dividing the result by the distance between them squared. However, if we only knew the necessary data, it would not give us understanding about gravity itself.
Machine learning models are pretty much the same. A certain data structure produces the expected results. However, how these models arrive at their predictions is unclear. At the same time, in all phases the rest is as expected. Therefore, outside of the marginal cases, the “black box” nature does not detract much from the results.
Furthermore, machine learning models seem to be among the most effective methods for detecting phishing. Some automated crawlers with ML implementations can achieve 99% accuracy, according to research by Springer Link.
The future of web scraping
Web scraping seems to be the perfect complement to any current phishing solution. After all, most cybersecurity goes through vast arrays of data to make the right protection decisions. Phishing is no different. At least through the lens of cybersecurity.
There seems to be a holy trinity in cybersecurity waiting to be harnessed to its fullest potential – analytics, web scraping, and machine learning. There have been some attempts to combine two of three together. However, I have yet to see all three harnessed to their full potential.