Leave a comment

Towards Online Concept Drift Detection with Feature Selection for Data Stream Classification

 

Authors: Mahmood Hammoodi, Frederic Stahl and Mark Tennant

Abstract

Data Streams are unbounded, sequential data instances that are generated very rapidly. The storage, querying and mining of such rapid flows of data is computationally very challenging. Data Stream Mining (DSM) is concerned with the mining of such data streams in real-time using techniques that require only one pass through the data. DSM techniques need to be adaptive to reflect changes of the pattern encoded in the stream (concept drift). The relevance of features for a DSM classification task may change due to concept drifts and this paper describes the first step towards a concept drift detection method with online feature tracking capabilities.

The manuscript is accessible online at:

http://centaur.reading.ac.uk/68360/1/asPrintedOpenAccess.pdf

 

Leave a comment

BCS SGAI Workshop on Data Stream Mining Techniques and Applications

Introduction
============
The four main dimensions of Big Data are known as Volume, referring to the size of the data, Velocity, referring to the data that is generated rapidly, Veracity, referring to uncertainty in data and Variety, referring to data from different kinds of sources such as text, structure and video data. This workshop’s focus is on the Velocity dimension of Big Data. The analytics of high velocity data has many applications, such as topic detection in Twitter, traffic control, network intrusion detection, etc. The difference compared with data that is stored on a disk is that real-time data may change its characteristics over time. However, decision support applications rely on the recency of their supporting data, hence, data generated at a high velocity needs to be processed ‘on the fly’. On the other hand, there are applications that are more interested in the actual change of the data, i.e. intrusion detection and network fault detection. Hence there is a need for computationally efficient real-time techniques that take changes of the data into consideration.
This workshop not only welcomes papers on data stream mining of high velocity data but also application from various domains, such as science, engineering, finance, web, etc. The workshop’s aim is to bring together researchers in this field to present their latest work, discuss challenges and future directions of research in Data Stream Mining.
Submitted extended abstracts (2 pages) will be reviewed. The authors of the best abstracts will be invited to submit full workshop papers, which will be further reviewed.
Publication of Workshop Papers
==========================
Accepted papers will be published in a special issue of the BCS SGAI publication Expert Update: http://expertupdate.org/
Workshop Website
===============
Topics of interest
==================
* High Velocity Data Stream mining algorithms and techniques
* Big Data Streams
* Concept Drift Detection
* Real-time data mining applications
* Real-time event detection from streaming data.
Important dates
==============
* Extended Abstract Submission (2 pages any format): extended until 12th of August 2016
* Invitation to submit full papers (8 pages):19th August
* Submission deadline for full paper: 9th September
* Notification of acceptance: 3rd October 2016
* Camera ready papers and workshop registration: 14th October 2016
* Workshop: 13 December 2016
Workshop chair
==============
* Frederic Stahl, University of Reading, UK
Programme committee
===================
* Frederic Stahl (University of Reading, UK)
* Max Bramer (University of Portsmouth, UK)
* Mohamed Medhat Gaber (Robert Gordon University, UK)
* Joao Gomes (DataRobot, Singapore)
* Thien Le (University of Reading, UK)
Paper submission
================
Extended Abstracts can be directly send to Dr Frederic Stahl (F.T.Stahl@reading.ac.uk)
Workshop Registration
=====================
One author per paper must present their work at the workshop and be registered for the workshop day of the AI2016 conference: http://www.bcs-sgai.org/ai2016/
Regular Rate £120
Student Rate £75
VAT is charged at 20%
Leave a comment

Scaling Up Classification Rule Induction Through Parallel Processing

This paper has been published in Cambridge University Press’s  Knowledge Engineering Review. The manuscript is accessible online at:

https://fredericstahl.files.wordpress.com/2012/02/paper13.pdf

Authors: Frederic Stahl and Max Bramer

Abstract

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelisation seems to be a natural and cost effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelisation in the field of classification rule induction.

Leave a comment

Towards cost-sensitive adaptation: When is it worth updating your predictive model?

This paper has been published in Elsevier’s Neurocomputing.  The manuscript is accessible online at:

http://centaur.reading.ac.uk/38834/1/NEUCOM-D-13-01447R2.pdf

Authors: Indre Zliobaite, Marcin Budka and Frederic Stahl

Abstract

Our digital universe is rapidly expanding, more and more daily activities are digitally recorded, data arrives in streams, it needs to be analyzed in real time and may evolve over time. In the last dec ade many adaptive learning algorithms and prediction systems, which can automatically update themselves with the new incoming data, have been developed. The majority of those algorithms focus on improving the predictive performance and assume that model update is always desired as soon as possible and as frequently as possible. In this study we consider potential model update as an investment decision, which, as in the financial markets, should be taken only if a certain return on investment is expected. We introduce and motivate a new research problem for data streams –cost-sensitive adaptation. We propose a reference framework for analyzing adaptation strategies in terms of costs and benefits. Our framework allows to characterize and decompose the costs of model updates, and to asses and interpret the gains in performance due to model adaptation for a given learning algorithm on a given prediction task. Our proof-of-concept experiment demonstrates how the framework can aid in analyzing and managing adaptation decisions in the chemical industry.
Leave a comment

A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach

This paper has been published in Springer’s Transactions on large-scale data and knowledge-centered systems.  The manuscript is accessible online at:

http://centaur.reading.ac.uk/39793/1/typeinst.pdf

Authors: Frederic Stahl, David May, Hugo Mills, Max Bramer and Mohamed Medhat Gaber

Abstract

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Leave a comment

A Survey of Data Mining Techniques for Social Network Analysis

This paper has been published in Episciences’s Journal of Data Mining and Digital Humanities.  The manuscript is accessible online at:

http://centaur.reading.ac.uk/40754/1/AuthorFinal.pdf

Authors: Mariam Adedoyin-Olowe, Mohamed Medhat Gaber and Frederic Stahl

Abstract

Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors.

Leave a comment

Random Prism: a noise‐tolerant alternative to Random Forests

This paper has been published in Wiley’s Expert Systems.  The manuscript is accessible online at:

http://centaur.reading.ac.uk/32914/1/AI2011%20%281%29.pdf

Authors: Frederic Stahl and Max Bramer

Abstract

Ensemble learning can be used to increase the overall classification accuracy of a classifier by generating multiple base classifiers and combining their classification results. A frequently used family of base classifiers for ensemble learning are decision trees. However, alternative approaches can potentially be used, such as the Prism family of algorithms that also induces classification rules. Compared with decision trees, Prism algorithms generate modular classification rules that cannot necessarily be represented in the form of a decision tree. Prism algorithms produce a similar classification accuracy compared with decision trees. However, in some cases, for example, if there is noise in the training and test data, Prism algorithms can outperform decision trees by achieving a higher classification accuracy. However, Prism still tends to overfit on noisy data; hence, ensemble learners have been adopted in this work to reduce the overfitting. This paper describes the development of an ensemble learner using a member of the Prism family as the base classifier to reduce the overfitting of Prism algorithms on noisy datasets. The developed ensemble classifier is compared with a stand-alone Prism classifier in terms of classification accuracy and resistance to noise.