The Idea
First, you have to have a proxy capable of whatever the load will be. For this discussion, I will assume
a building or school, but the idea could easily be scaled to the size of an ISP network using multiple
proxies for different areas. Anyway, the idea is that the proxy will check each request, and see
if the domain name is registered as a phishing site. Next it will do several other checks, including
a referse DNS lookup if it is an IP address, check the country of origin, etc, and develope a score
much like most spam software does.
If the url is a registered phishing site, or anything else prohibited, the proxy will obviously block
the page. Nothing new there. The new part comes into play with the phishing score and AI built in,
that will:
1. Remove all javascript and flash from the page.
2. Display an html element at the top of the page that explains the page might be fake, but that
has a button to close the page if the person does not care.
In this way, pages found to be scam sites due to high scores (for instance, a page from China linking
to images from paypal.com) will be rendered static by removing the javascript, and a warning placed
right on the page. The removal of javascript is important as javascript can remove warnings or rewrite
the entire page onload. The way it calculates the score is what is really important.
Calculating the Score
First, Check known databases of urls (cache results too) and a match gets a score of 100/100. Otherwise,
check the local list of domain names and scores to see if the score is known and up to date. If it is
unknown or too old, recalculate. Note that the score exists for the domain name, even though it is
calculated by the exact url. This is because most hacked servers hold multiple phishing sites most
of the time, so until the score is recalculated, the entire domain should be blocked.
If a score is to be calculated, start with the country of origin. Although the country does effect the
score directly, it will matter to other steps. Next, check the links on the page. If the majority of
links go to another url, check if the url if probably a bank. Also check if the site linked to has
a good majority of the exact same html. If so, the site would seem to be copying, and thus is probably
a scam site. Next, check any forms. Any text near a
form element such as "PIN" "credit card" "Expiration Date" means it could be a scam. Next, check the
images or other files linked to by the page. If they come from another server, especially one that
is found to be a bank, it could be a scam. Next, check the url. If the url consists of a directory that
begins with a period (meaning it is a hidden directory) it is probably a scam. If the url uses the
IP address instead of a domain name, it is probably a scam.
Many other checks can be written, but these are the major ones in my eyes. After that, you can calculate
as score, and clean the html if needed. Although I'm not saying exactly how to calculate the score, I
will hopefully be doing that later on since this idea is in an early phase.