IndexIntroductionData MiningRole Mining AssociationHadoopLiterary SurveyResearchers Mohit K. Gupta, Geeta Sikka worked on Multiobjective GeneticsObjectivesConclusion and Future ScopeInternet has occupied a larger space in the life of humanity. It has become prominent in almost every industry in the world. The basic advantage of the Internet is fast communication and rapid transfer of information through various modes. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original EssayAs with the evolution of technology, the Internet is used not only for gaining knowledge but also for communication purposes. It has become a means to exchange or express one's ideas. Mostly the current scenario is that people use social networking sites as a means to connect with other people and share information with them. A social network is a large network of individuals interconnected by interpersonal relationships. Since a lot of data is used by individuals to exchange information in the form of images, videos, etc. The data generated is known as social network data. This data helps determine various aspects of the company. Data mining is the process of inspecting data from different perspectives to find unknown things. One of the significant tasks of data mining, which helps in discovering associations, correlations, statistically relevant patterns, causality, emerging patterns in social networks, is known as association rule mining IntroductionPrevious people were used to communicating verbally or non-verbally. Non-verbal communication occurs by writing letters in newspapers or making drafts etc. This communication has some limitations and is a bit limited. There were fewer or not many means for nonverbal communication. The effect of the Internet, also known as network of networks, has led people to obtain information globally in various aspects. Initially the only use of the web was to collect information and share it. Today, the Internet has occupied a larger space in the life of humanity. It has become prominent in almost all sectors of the world. The basic advantage of the Internet is fast communication and rapid transfer of information through various modes. As time went by, the need to gather information to share, contribute and make an impact increased and eventually gave the impetus to collect, analyze and channel huge data precisely. The creation, collection, storage, retrieval and presentation of data have become an integral part of people related to the knowledge society. Ultimately the Internet is not only a means of gaining knowledge, but is now also used as a means of communication. Today, millions of people use the Internet as a way to express their ideas and share information. Most people use social networking sites or blogs to connect with people and share information. Social networking has therefore spread throughout the world with noteworthy speed. Many social networking sites such as Facebook, Twitter etc. are now available. Facebook had more than 1.44 billion active users in 2015. This translates into a drastic boom in the emergence of social sites. For example, Twitter is one such social networking site that has become popular in a short span of time due to its simple and innovative features like tweets which are short text messages. These tweets are much faster and are used to gather various information. There are millions of tweets every day that are used to gather information that could help make decisions. A social networknetwork is basically a network of individuals connected by interpersonal relationships. Social network data refers to data generated by people socializing on this social media site. This user-generated data helps to examine different socializing community resources when analyzed and mined. This can be accomplished via Social Network Analytics. Mapping and measuring relationships is known as social network analysis [SNA]. Therefore SNA plays a decisive role in representing the various resources of the socializing community. Data Mining Various data from various social networking sites are stored in files and other repositories, which helps us to analyze and interpret such a huge amount of data together which offers us a lot of interesting data. knowledge that could help us make further decisions. Data Mining, also known as knowledge discovery process[4], is the process of finding unknown information by analyzing data from different perspectives. Here patterns are discovered in large datasets. Information is extracted from a dataset and reshaped. Therefore data mining and knowledge discovery in databases (or KDD) are used as substitutes for each other, but data mining is the actual process of the knowledge discovery process. Association Rule MiningOne of the significant tasks Data mining, which helps in the discovery of associations, correlations, statistically correlated patterns, causality and emergent patterns in social networks, is performed through association rule mining. Another mining technique known as frequent itemset mining plays an important role in many data mining tasks. .Mining frequent itemsets plays a significant role in many data mining tasks that attempt to discover interesting patterns from databases such as association rules, correlations, sequences, classifiers, and clusters. Extracting association rules is one of the main problems of all these. The recognition of sets of elements, products, manifestations and peculiarities, which often appear together in a given database, can be seen as one of the most primitive tasks of Data Mining. For example, the bread, potatoes->sandwich association rule would reveal that if a customer buys bread and potatoes together, it is likely that he will also buy a sandwich. Here the bread and potatoes are support and the sandwich is trust. This knowledge can be used for decision-making purposes. Consider a social network environment that collects and shares user-generated text documents (e.g. discussion threads, blogs, etc.). It would be useful to know what words people generally use in speech related to a specific topic, or what set of words are often used together. For example, in a discussion topic related to "American Elections", the frequent use of the word "Economy" shows that the economy is the most important aspect in the bureaucratic environment. Therefore, a frequent set of count-one items might be a good indicator of the central topic of the discussion. Likewise, a set of frequent items with a count or length of two can show what other important factors are. Therefore, a frequent itemset mining algorithm run on a set of text documents produced on a social network can visualize the central topic of discussion and word usage pattern in discussion threads and blogs. As social network data grows exponentially toward a terabyte or more, it has become increasingly difficult to analyze data on a single machine. Therefore the Apriori algorithm [6], which is one of the best known methods for extracting frequent item sets in a transactional database, is proving to be inefficientto manage constantly increasing data. To address this problem, the MapReduce framework [7] is used, which is a technique for cloud computing. HadoopHadoop is an open source platform under the Apache v2 license that provides the analytical technologies and computing power needed to work with large volumes of data. The Hadoop framework is built in such a way that it allows the user to store and process big data in a distributed environment across many computers connected in clusters using simple programming models. It is designed so that it can manage thousands of machines from a single server, with a local storage and computing facility. It breaks data into manageable chunks, replicates it, and distributes multiple copies across all nodes in a cluster so you can get your data processed quickly and reliably later. Rather than relying on hardware to ensure high availability, the Apache Hadoop software library itself is designed to detect and handle errors at the application level, thus providing a highly available service across a cluster of computers. Hadoop is also used to conduct data analysis. The main components of Apache Hadoop consist of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Literary survey methods for discovering the relationship between variables in large databases are called Association Rule Mining. It was introduced to check regularity between products in large scale transactions via point of scale (POS) system by Rakesh Agrawal. This was based on the rule of association. For example: bread, tomatoes, mayonnaise directly refer to a sandwich. According to various supermarket sales data, if a customer buys tomato and mayonnaise together, they might also buy a sandwich. To make decisions this data can be used.T. Karthikeyan and N. Ravikumar, in their article, concluded after examining and observing. They conclude that a lot of attention and focus has been given to the performance and scalability of the algorithms, but not to the quality of the generated rule. According to them, the algorithm could be improved to reduce execution time, complexity and would also improve accuracy. Furthermore, it was concluded that more attention is needed in the direction of designing an efficient algorithm with reduced I/O operations by reducing database scanning in the association rule mining process. This paper provides a theoretical investigation of some existing association rule mining algorithms. The concept behind this is provided at the beginning, followed by an overview of the research works. This paper aims to provide a theoretical investigation into some of the association rule mining algorithms. The pros and cons of the same are discussed and concluded with an inference. Rakesh Agrawal and Ramakrishnan Srikant proposed a seed set concept to generate new large item sets called candidate item sets that counted actual support for these at the end of the step until new large itemsets were found. These two algorithms for finding association rules between items in a large database of sales transactions were named Apriori and AprioriTid.J. Han, J. Pei, and Y. Yi developed a systematic FP tree-based mining method called FP growth for extracting recurrent patterns based on the concept of fragment growth. The problem was addressed in 3 aspects: mainly the data structure called FP-tree, where only elements of recurring length will have nodes in the tree, they also originated a model based on FP-tree which examined its basisconditional and then built his FP-tree and periodically mined with such a tree. Furthermore, the divide and conquer method was used instead of the bottom-up research technique. A new strategy to extract frequent itemsets from terabyte-scale datasets on cluster systems was developed by S. Cong, J. Han, J. Hoeflinger, and D. Padova who focused on the idea of a framework based on sampling for parallel data mining. The whole idea of targeted data mining was included in the algorithm. Processor performance, memory hierarchy and available network were taken into account. This developed algorithm was the fastest sequential algorithm that could extend its work in parallel and thus used all the resources made available effectively. A new narrative for data mining was introduced by P. V. Sander, W. Fang, K. K. Lau which used next-generation graphics processing units (GPUs) known as GPUMiner. The system depended on the multi-threaded SIMD (Single Instruction, Multiple-Data) architecture provided by GPUs. The GPU Miner consists of three components: buffer manager and CPU-based storage that handles data and I/O transfer between graphics processing unit and central processing unit, also integrated a co-parallel processing mining module CPU-GPU and finally included a GPU-based mining visualization module. The two FP-Tree based techniques, a block-free dataset tiling parallelization and a cache-aware FP-arrays were proposed in “Optimizing Recurrent Itemset Mining on Multi-Core Processors,” which covered of the low utilization of the multi-core system and effectively improved the performance of data location and uses hardware and software preloading. The FP-tree creation algorithm can also be approximated via a block-free parallelization algorithm. To divide the recurring task of itemset mining into the top-down approach, C. Aykanat, E. Ozkural, and B. Ucar developed a database transaction-based distribution scheme. This method works on a graph where the vertices correspond to the recurring element and the edges correspond to the recurring element sets of size 2. A vertex separator separates this graph so that the distribution of elements can be decided and extracted independently. The two new mining algorithms were developed from this scheme. The elements that match the separator are recreated by these algorithms. One of the algorithms recreates the work and the second calculates the same. The studied algorithm is based on the MapReduce mode on which it was used by the association rules. The performance of the algorithm is rendered ineffective due to single memory and limited CPU resources. The paper developed by S. Ghemawat and J. Dean describes the improved Apriori algorithm which can handle huge datasets with huge number of nodes on the Hadoop platform and can study many problems which are usually larger and multi-dimensional datasets . For Cloud Computing, Jongwook Woo and Yuhang Xu proposed a market basket (key, value) analysis algorithm whose code can be executed on the Map/Reduce platform. The merge function technique was used by this algorithm to produce paired items. The transaction is sorted alphabetically before generating the (key, value) pair to avoid errors. A new effective imitation of recurring itemsets based on the Map/Reduce framework was proposed by Nick Cercone and Zahra Farzanyar, [20] this framework was then used in social network data. This reduced Map/Apriori algorithm.
tags