Project title: Unsupervised Predictive Algorithms for Data Streaming Mining coping with unlabelled classes input
Supervisors: Dr Frederic Stahl, Prof. Atta Badii
Project Overview: The field of Data Stream Mining is concerned with the analytics of high velocity Big Data Streams. A data stream is a sequence of consecutive data instances that is infinite and generated in real-time. Thus applications, such as data mining can only read the sequence once using limited computing and storage capabilities. Predictive analytics is one of the most important types of data mining techniques, where an unknown variable in a dataset is predicted. For example, imagine a sequence of twitter posts that is generated in real-time. One application could be to predict if a tweet is related to a specific topic, e.g. politics. A data stream predictor would learn a model that can then be applied to new tweets in order to predict whether they are related to politics. Particular challenges here are the generation of data mining models that automatically adapt to changes of the pattern encoded in the stream (concept drift). In the example a concept drift could be “breaking news” related to politics which influences the topics which are being discussed on twitter. Further application examples are detection of performance bottlenecks in computer networks or traffic congestion forecasting in smart cities.
The aim of this PhD project is to develop new cutting edge predictive analytics methods/algorithms for Big Data Streams that can forecast events ahead of time and adapt to concept drift. The project is in collaboration with StreamCentral Data Insights Limited (www.streamcentraldata.com), an industry partner that will contribute real-world case studies and data stream processing infrastructure.
- Applicants should hold or expect to gain a minimum of a 2:1 Bachelor Degree or equivalent in Computer Science, Mathematics or related subject.
- Due to restrictions on the funding this studentship is open to UK/EU students.
- Starts September 2019
- 3 – year award
- Tuition fees plus RCUK stipend
How to apply:
To apply for this studentship please submit an application for a PhD in Computer Science at http://www.reading.ac.uk/graduateschool/prospectivestudents/gs-how-to-apply.aspx.
- Please quote the reference ‘GS19-025’ in the ‘Scholarships applied for’ box which appears within the Funding Section of your on-line application.
- When you are prompted to upload a research proposal, please omit this step.
Please note that, where a candidate is successful in being awarded funding, this will be confirmed via a formal studentship award letter; this will be provided separately from any Offer of Admission and will be subject to standard checks for eligibility and other criteria.
For further details please contact Dr Frederic Stahl: F.T.Stahl@reading.ac.uk, tel. +44(0)118 378 8983
Invitation to submit a paper to the Data Stream Analytics (DSM) track of the 33rd INTERNATIONAL CONFERENCE ON MODELLING AND SIMULATION in Napoli, Italy
Going to attend this event on Artificial Intelligence, mainly because of the excellent tracks related to data science. This year there is also a walking tour through Cambridge including Kings College.
Will be attending this long standing conference on AI. Will I see you there?
Authors: Mahmood Shakir Hammoodi, Frederic Stahl, Mark Tennant, Atta Badii
Data streams are unbounded, sequential data instances that are generated with high velocity. Classifying sequential data instances is a very challenging problem in machine learning with applications in network intrusion detection, financial markets and sensor networks. Data stream classification is concerned with the automatic labelling of unseen instances from the stream in real-time. For this the classifier needs to adapt to concept drifts and can only have a single pass through the data if the stream is fast. This research paper presents our work on a real-time pre-processing technique, in particular a feature tracking technique that takes concept drift into consideration. The feature tracking technique is designed to improve Data Stream Mining (DSM) classification algorithms by enabling real-time feature selection. The technique is based on adaptive summaries of the data and class distributions, known as Micro-Clusters. Currently the technique is able to detect concept drift and identifies which features have been involved.
please request a copy here.
Authors: Thien Le, Frederic Stahl, Mohamed Medhat Gaber, João Bártolo Gomes, Giuseppe Di Fatta
Mining data streams is a core element of Big Data Analytics. It represents the velocity of large datasets, which is one of the four aspects of Big Data, the other three being volume, variety and veracity. As data streams in, models are constructed using data mining techniques tailored towards continuous and fast model update. The Hoeffding Inequality has been among the most successful approaches in learning theory for data streams. In this context, it is typically used to provide a statistical bound for the number of examples needed in each step of an incremental learning process. It has been applied to both classification and clustering problems. Despite the success of the Hoeffding Tree classifier and other data stream mining methods, such models fall short of explaining how their results (i.e., classifications) are reached (black boxing). The expressiveness of decision models in data streams is an area of research that has attracted less attention, despite its paramount of practical importance. In this paper, we address this issue, adopting Hoeffding Inequality as an upper bound to build decision rules which can help decision makers with informed predictions (white boxing). We termed our novel method Hoeffding Rules with respect to the use of the Hoeffding Inequality in the method, for estimating whether an induced rule from a smaller sample would be of the same quality as a rule induced from a larger sample. The new method brings in a number of novel contributions including handling uncertainty through abstaining, dealing with continuous data through Gaussian statistical modelling, and an experimentally proven fast algorithm. We conducted a thorough experimental study using benchmark datasets, showing the efficiency and expressiveness of the proposed technique when compared with the state-of-the-art.
access a copy here.
Authors: Mark Tennant, Frederic Stahl, Omer Rana, João Bártolo Gomes
Inducing adaptive predictive models in real-time from high throughput data streams is one of the most challenging areas of Big Data Analytics. The fact that data streams may contain concept drifts (changes of the pattern encoded in the stream over time) and are unbounded, imposes unique challenges in comparison with predictive data mining from batch data. Several real-time predictive data stream algorithms exist, however, most approaches are not naturally parallel and thus limited in their scalability. This paper highlights the Micro-Cluster Nearest Neighbour (MC-NN) data stream classifier. MC-NN is based on statistical summaries of the data stream and a nearest neighbour approach, which makes MC-NN naturally parallel. In its serial version MC-NN is able to handle data streams, the data does not need to reside in memory and is processed incrementally. MC-NN is also able to adapt to concept drifts. This paper provides an empirical study on the serial algorithm’s speed, adaptivity and accuracy. Furthermore, this paper discusses the new parallel implementation of MC-NN, its parallel properties and provides an empirical scalability study.
access a copy here.