Introduction To Data Mining By Tan Steinbach And Vipin Kumar

0 views

Skip to first unread message

Marilina Crawn

unread,

Aug 4, 2024, 12:02:27 PM8/4/24

to mestemona

An introduction to data mining, including data cleaning, the application of statistical and machine learning techniques to discover patterns in data, and the analysis of the quality and meaning of results. Machine learning topics may include algorithms for discovering association rules, classification, prediction, and clustering. Lab assignments provide practice applying specific techniques and analyzing results. An independent project provides students with the opportunity to guide a project from data selection and cleaning through to presentation of results. Pre-requisite: CSCI 362 and statistics (MATH 235, MATH 333 or MATH 335) or permission of instructor

La seconda parte del corso completa le conoscenze del primo modulo con una rassegna delle tecniche avanzate per il mining per dati tabulari e tecniche avanzate di mining per nuove forme di dati. Le tecniche avanzate di classificazione sono relative a reti neurali, SVM, metodi ensemble. Nuovi problemi affrontati sono outlier detection, transactional clustering, time series forecasting e sequential pattern mining. Inoltre vengono analizzati problemi relativi alla explainability dei classificatori.

The formidable advances in computing power, data acquisition, data storage and connectivity have created unprecedented amounts of data. Data mining, i.e., the science of extracting knowledge from these masses of data, has therefore been affirmed as an interdisciplinary branch of computer science.

Data mining techniques have been applied to many industrial, scientific, and social problems, and are believed to have an ever deeper impact on society. The course objective is to provide an introduction to the basic concepts of data mining and the process of extracting knowledge, with insights into analytical models and the most common algorithms.

The second part of the course completes the knowledge of the first module with a review of advanced mining techniques for tabular data and advanced mining techniques for new forms of data. The advanced classification techniques are related to neural networks, SVM, ensemble methods. New problems addressed are outlier detection, transactional clustering, time series forecasting and sequential pattern mining. In addition, problems related to the explainability of classifiers are analyzed.

Per la verifica delle conoscenze acquisite nel corso gli studenti dovranno sostenere una prova orale che coprir tutti gli argomenti trattati a lezione. Inoltre sar chiesto agli studenti di organizzarsi in gruppi per collaborare alla realizzazione di un progetto che ha l'obiettivo di analizzare un dataset con i diversi metodi di mining presentati a lezione. La modalit di esame la stessa per i due moduli.

To verify the knowledge acquired during the course, students will have to take a written test that will cover all the topics covered in class. Furthermore, students will be asked to organize themselves into groups to collaborate in the realization of a project that aims to analyze a dataset with the different mining methods presented in class. Finally, the student will also have to take two oral tests on the topics covered in the two modules.

Lo studente potr maturare abilit nel lavoro di gruppo. Inoltre potr acquisire e/o sviluppare opportune sensibilit nelle scelte progettuali e di impostazione del processo analitico. Infine, lo studente potr imparare come intepretare i risultati analitici e come visualizzarli in modo opportuno.

The student can master skills in team work. It will also acquire and/or develop appropriate sensitivity in the choices for the design and set-up of an analytical process. Finally, the student will learn how to interpret analytical results and how to visualize them properly.

During the exam, the project choices made by the student group and the ability to process the data with analytical and mining tools will be evaluated. In addition, the accuracy and precision applied by the group in the design activities will be evaluated.

L'esame consiste in una prova orale sugli argomenti trattati a lezione per la verifica delle conoscenze teoriche dove lo studente metter anche in pratica la simulazione degli algoritmi di mining con esercizi scritti, e un progetto svolto in gruppo con consegnat di report e discussione del progetto durante la prova orale.

The exam consists of an oral test on the topics covered in class for the verification of theoretical knowledge where the student will also put into practice the simulation of the mining algorithms with written exercises, and a project carried out in a group with report delivery and project discussion. during the oral exam.

Recent progress in scientific and engineering applications has accumulated huge volumes of high-dimensional data, stream data, unstructured and semi-structured data, and spatial and temporal data. Highly scalable and sophisticated data mining tools for such applications represent one of the most active research frontiers in data mining. Here, we outline the related challenges in several emerging domains.

Biology is in the midst of a revolution, with an unprecedented flood of data forcing biologists to rethink their approach to scientific discovery. First, large-scale data-collection techniques have emerged for a number of data sources limited by throughput, or the amount of available data. Examples of the data glut include: systematic genome DNA sequencing of organisms; high-throughput determination of small molecule structures, as well as large macromolecular structures (such as proteins, RNA, and DNA); large-scale measurements of molecular interactions; and simultaneous measurement of the expression level of all genes (thousands to tens of thousands) in a population of cells. Second, the availability of this data requires biologists to create systems for organizing, storing, and disseminating it, thus creating a need for standard terminologies and the development of standards for interchange and annotation. Third, because of the apparent opportunities for automated learning from the data sets, a market for robust machine learning and data mining algorithms has emerged to take advantage of previous knowledge, without being overly biased in the search for new knowledge. As a result, biology has changed from a field dominated by an attitude of "formulate hypothesis, conduct experiment, evaluate results" to more of a big-science attitude of "collect and store data, mine for new hypotheses, confirm with data or supplemental experiment." The long-term significance of the new data of molecular biology is that it can be combined with clinical medical data to achieve a higher-resolution understanding of the causes for and treatment of disease. A major challenge for data mining in biomedicine is therefore the organization of molecular data, cellular data, and clinical data in ways allowing them to be integrated for the sake of knowledge extraction.

A major additional source of information is the published medical literature, increasingly available online in full-text form or as useful (but unstructured) summaries of the main data and biomedical hypotheses.

Data mining flourishes in telecommunications due to the availability of vast quantities of high-quality data. A significant stream of it consists of call records collected at network switches used primarily for billing; it enables data mining applications in toll-fraud detection [1] and consumer marketing [2].

In toll-fraud detection, data mining has been instrumental in completely changing the landscape for how anomalous behaviors are detected. Nearly all fraud detection systems in the telecommunications industry 10 years ago were based on global threshold models; they can be expressed as rule sets of the form "If a customer makes more than X calls per hour to country Y, then apply treatment Z." The placeholders X, Y, and Z are parameters of these rule sets applied to all customers.

Methods of this type were presumably in place for the credit card industry a few years before emerging in telecom. But the size of the transaction streams are far greater in telecom, necessitating new approaches to the problem.

It is expected that algorithms based on call-graph analysis and customized monitoring will become more prevalent in both toll-fraud detection and marketing of telecommunications services. The emphasis on so-called "relational data" is an emerging area for data mining research, and telecom provides relational data of unprecedented size and scope.

The scope, coverage and volume of digital geographic data sets have grown rapidly in recent years due to the progress in data collection and data processing technologies. These data sets include digital data of all sorts, created, processed, and disseminated by government- and private-sector agencies on land use and socioeconomic infrastructure; vast amounts of georeferenced digital imagery and video data acquired through high-resolution remote sensing systems and other monitoring devices; geographic and spatiotemporal data collected by global positioning systems, as well as other position-aware devices, including cellular phones, in-vehicle navigation systems, and wireless Internet clients; and digital geographic data repositories on the Web. Moreover, information infrastructure initiatives, including the U.S. National Spatial Data Infrastructure, facilitate data sharing and interoperability, making enormous amounts of space-related data sharable and analyzable worldwide.

The increasing volume and diversity of digital geographic data easily overwhelm traditional spatial analysis techniques that handle only limited and homogeneous data sets with high-computational burden. To discover new and unexpected patterns, trends, and relationships embedded within large and diverse geographic data sets, several recent studies of geospatial data mining [4] have developed a number of sophisticated and scalable spatial clustering algorithms, outlier analysis techniques, spatial classification and association analysis methods, and spatial data-cleaning and integration tools.