Topic > Review Paper for Spam URL Detection and Image Spam Filtering Using Machine Learning

Table of ContentsIntroductionRelated WorkProposed IdeaIdea 1: Pseudo-OCR for Image Spam FilteringIdea 2: Character Feature Based on Key PointsIdea 3: Image Spam FilteringIdea 4: Spam URL Detection Using SVM AlgorithmConclusionThe growing volume of harmful content in social media requires automated methods to detect and delete such content. This article describes a machine learning classification model that will be built to detect the distribution of malicious content in social networks/online media (ONS/OMS). Multi-source capabilities were used to detect social network posts containing vitriolic Uniform Resource Locators (URLs). These URLs could direct users to websites that contain malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection phase, the Twitter streaming application programming interface (API) was used and VirusTotal was used to label the dataset. A random forest classification model was used with a combination of features derived from a range of sources. The fraudulent practice of sending emails constitutes a criminal scheme to obtain your personal data and other login and confidential information. It is known as phishing which acquires users' private information such as passwords, bank account details, credit card number, financial username and password etc. and can subsequently be mishandled by an attacker. Our goal is to use key visual characteristics of a web page's appearance as the basis for detecting page similarities. We propose a new solution to efficiently detect phishing web pages. Keep in mind that page layouts and content are key features of the appearance of web pages. Since the standard way to specify page layouts is via style sheet (CSS), we develop an algorithm to detect similarities in key CSS-related elements. In this paper, we proposed a system that uses SVM technique together with Image Spam filter, mapreduce spam archetype to achieve higher accuracy in detecting spam URLs and iamge spam. After further investigation and applying parameter optimization and feature selection methods, however, we were able to improve the performance of the classifier. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay IntroductionThe main challenges for social network security administrators are not only to protect the management system and database of social networks, but also to protect OSN users from exposure to malicious content spread on such social networks. 60% of social network users have received or been exposed to harmful content such as spam, scams and drive-by downloads. A number of OSNs are now developing malicious content detection systems for such attacks, for example Facebook's immune system detects suspicious activity such as like-jacking, social bots and fake content. An identity theft that occurs when a malicious website disguises a legitimate one is called Phishing. Such theft occurs to obtain sensitive information such as passwords, bank account details, or credit card numbers. Phishing uses spoofed emails that look exactly like a genuine email. These emails are sent to a large number of users and appear to come from legitimate sources such as banks, websitese-commerce, payment gateways etc. The makers of such illegitimate website have made them look exactly legitimate so that no user can identify the difference easily. Phishing attackers use different types of social engineering tactics to attract users, such as offering attractive offers for simply visiting the site. Malicious URL is a URL created with malicious purposes, including downloading any type of malware onto the affected computer, which can be contained in spam or phishing messages, or even improving one's position in search engines using Blackhat SEO techniques .Smart Malicious URL Detection System is an anti-phishing technique to safeguard our web experiences. Our approach uses Chinese image spam lexical features, host-based features, and site popularity features of a website to detect any suspicious or phishing websites. These features are obtained from the source code by taking the URL as input and then these features are fed to the classification algorithm. The results obtained from our experiment show that our proposed methodology is very effective in preventing such attacks and the performance was measured using Confusion Matrix for all classifiers. Related work Most studies in this area aim to find the most predictive features possible to acquire and the best algorithm to develop a classification model. Researchers in this field mainly focus on finding new features with high discriminative power as well as coming up with the most accurate machine learning model. Finding highly discriminating characteristics in the field of Internet security and social networks is a real challenge due to the variety of attacks and techniques used by spammers. Thanks to the inventiveness of spammers, spammer detection systems are bypassed after some time and the set of features used for spam detection must be regularly reviewed. Similar to how security researchers study attacks, spammers and hackers investigate detection systems; therefore, they may modify user properties, content, or delivery mechanism to bypass certain restrictions or detection rules. For example, a study on spam detection on Twitter suggested that the number of followers is one of the features with the highest discriminating power. However, the feature's discriminating power has been increasingly weakened by spammers who have made their accounts more popular. They do this by running spam campaigns that link their "fake" accounts with other fake accounts, increasing their number of followers and followers. Burnap et al. used a completely different method to detect malicious URLs. They implemented a highly interactive honey-net2 to collect system state changes, such as sending/receiving packets and CPU usage. The training dataset contained 2,000 examples with a 1:1 spam/non-spam ratio. Ten attributes were used to create a classifier that reflected system state changes after the tweet URL was opened. Burnap et al. examined the shortest time needed to provide advance warning of the existence of malicious content at a particular URL. The best result was reported for Multilayer Perceptron (MLP) using features acquired after 210 seconds (0.723 in F-measure metric). The features used by Burnap et al. require complex data analysis; however, they make it difficult for spammer sites to disguise their true nature. Although recent literature has compared severalalgorithms, there is a lack of information on the important steps in building a machine learning model. In particular, little information is provided on how feature selection methods are handled and how parameter tuning is conducted. We address this issue in Section IV. Furthermore, in this article, the author introduces a method that combines fingerprint technique and big data processing to detect spam emails. Support Vector Machine (SVM) is the machine learning technique used for spam filtering. SVM training is a very extensive process, so the Spam Filter Training Platform was used to manage this MapReduce platform. In this article the author used a content-based spam filter. The classification of the email as spam or ham is based on the data present in the content of the email. Therefore the header section is ignored in case of content-based spam filtering. This paper specifically includes the comparison between the implementations of the Fisher-Robinson Inverse Chi-Square function, the implementation of the AdaBoost classifier, and the KNN classifier. Proposed Idea This section details the main steps of this study, starting with data collection and labeling of the dataset, followed by a brief comparison of the most common techniques used in related studies. The main purpose of the system is not only to protect the management system and database of social networks, but also to protect OSN users from exposure to harmful content spread on such social networks since many users of social networks have received or have been exposed to malicious content content such as spam, scams, and drive-by downloads.Idea 1: Pseudo-OCR for Image Spam FilteringImage spam production technology makes image spams more similar to malicious ones, thus more difficult to directly identify from the characteristics of the image without any information about the content. What's even more serious, for some advanced applications, the spam image filtering process actually requires more contextual information than a simple filtering result. We therefore believe that it is essential for an anti-spam system to obtain information about the extent of the current image content which apparently could only be obtained through long-established OCR-based methods. Despite the discussed disadvantages mentioned above, traditional OCR is not our top choice. Therefore, the idea of ​​pseudoOCR is proposed to avoid such defects while still being able to extract sufficient content information. Compared with established technology, our proposed pseudo-OCR shows the following improvements for Chinese image spam filtering. First, pseudo-OCR has more accessible requirements for reading characters. You simply need to determine whether or not a certain character feature belongs to the spam image rather than recognizing it. Second, pseudo-OCR can effectively process a very wide range of images, even those with complex backgrounds and human interference that are usually difficult to handle with traditional OCR-based methods. Finally, for Chinese character recognition, the proposed pseudo-OCR generates model features from certain training images instead of a set of standard Chinese characters. Feedback provides the system with the learning ability to maintain high and adequate performance over a long period. It is well known in anti-spam communities that spammers tend to modify their image spam patterns over time, which would lead to inevitable performance degradation for methods basedalmost on duplicates. Although our proposed methods are not strictly based on near-duplicates, they adopt a similar methodology to extract pattern character features from some known spam images. To handle such predictable defect, a feedback mechanism is introduced in our system. By using detected spam as an additional source of template fonts, it is very possible to replace obsolete template font characteristics with new ones, thus sustaining better performance. Idea 2: Character feature based on keypoints To meet the requirements of pseudo-OCR, the extracted Chinese character feature should also be modified. Regarding only some key points of a character, we have devised a new character feature, which probably fails to be used for traditional character recognition but is sufficient to reserve sufficient content information for pseudo-OCR. The core of extracting such functionality is a two-step procedure. During the first stage the keypoints and their connectivity information are extracted and stored as an adjacency matrix using a DFS-based algorithm, and then the actual feature is calculated from this adjacency matrix in the second stage. To identify image spam using this feature, each character feature extracted from a given image is compared with those of the template to determine the category information first, and then the distribution of category information of all these characters is used for the final judgment . Idea 3: Image Spam Filtering From the feature extraction described above, any input image will be converted into a set of character features based on 20-dimensional keypoints. To use these features for filtering image spam, you must first obtain the category information. For a given character element, the minimum L1 distance between it and all model elements is calculated and compared to a certain threshold to determine its category. Here, this threshold is called category threshold to distinguish it from the following default threshold. Given all the information about the category of character features of an image, the distribution of those features is used to make the final judgment. Because all the features of the model fonts in our implemented system fall into two categories, spam or ham. Then, by comparing the proportion of spam features to a predefined threshold calculated during the training process to choose the minimum spam image feature ratio of all training spam images, we are able to determine whether or not it is a spam image. In our system, a minimum threshold of 0.25 is chosen out of a total of 82 training spam images. The experiment results show that our proposed Chinese image spam filtering system using pseudo-OCR usually achieves better performance than the traditional OCR-based method. Idea 4: Detecting Spam URLs Using the SVM AlgorithmAn identity theft that occurs when a malicious website disguises a legitimate one is called Phishing. Such theft occurs to obtain sensitive information such as passwords, bank account details, or credit card numbers. Phishing uses spoofed emails that look exactly like a genuine email. These emails are sent to a large number of users and appear to come from legitimate sources such as banks, e-commerce sites, payment gateways etc. Machine learning is a subset of intelligence.