IBLStreams MOA Extension
![]() |
Ammar Shaker, Eyke Hüllermeier, and Jürgen Beringer. Knowledge Engineering & Bioinformatics (KEBI) Department of Mathematics and Computer Science Philipps-Universität Marburg |
Abstract
IBLStreams is an instance-based learning algorithm for classification and regression on data streams. The method is able to handle large streams with low requirements in terms of memory and computational power. Moreover, it disposes of mechanisms for adapting to concept drift and concept shift. The implementation of IBLStreams is supposed to be used as an extension to the MOA (Massive Online Analysis) framework for data stream mining. Thus, prior to using the IBLStreams software, you need to install MOA, which can be downloaded directly from this link. Both MOA and its extensions are implemented in JAVA.
Documentation
Prediction Strategies
In instance-based learning, a prediction for the query instance is obtained by combining, in one way or the other, the outputs of the neighbors of this instance in the training data. The type of aggregation depends on the type of problem to be solved. We offer four different prediction schemes, namely the WeightedMode for classification, the WeightedMedian for ordinal classification, and the WeightedMean and LocalLinearRegression for regression problems.
Classification
Regression
Adaptation Strategies
IBLStreams offers two possibilities to adapt the size of the neighborhood, which has an important influence on the performance of an instance-based learner. Still, in case you don't want to allow the learner to adapt, the AdaptationStrategy parameter can be set to none by using the option "-a none". The two adaptation strategies are:
Adapting k
- InitialKValue: The initial value for k. The
default value is 16, while we recommend to set k equal to 4 times the
number of attributes.
- MinKValue: The smallest allowed value for k. The
default value is 4, while we recommend 2 times the number of
attributes.
- MaxKValue: The largest allowed value for k. The
default value is 40, while we recommend 10 times the number of
attributes.
Please note that setting both MinKValue and MaxKValue is meaningless unless AdaptK is used as an adaptation strategy.
- UseDefaultK: Activation of this flag by using the
"-g" option in the command line or enabling the corresponding checkbox
in the GUI means that the initial value for k is set to 4*(num.
dimensions); setting this flag overrides the InitialKValue
parameter.
Adapting the kernel width
The second strategy controls the size of the neighborhood indirectly via the weighting function or, more specifically, the corresponding kernel width. Using this adaptation strategy only makes sense in combination with the Gaussian or the exponential kernel. To activate this adaptation strategy, use the option "-a AdaptSigma", i.e., AdaptSigma for the AdaptationStrategy parameter. Like in the case of AdaptK, the algorithm will then continuously check whether increasing or decreasing the kernel width sigma (by a certain percentage) appears to be beneficial. The following parameters are related to the kernel width:- InitialSigma: The initial value of sigma. The
default value is 0.5. Usage "-q 0.6".
- UseDefaultSigma: By activating this flag, sigma
will be set to the recommended value sqrt(num. dimensions)/10. Setting
this flag overrides the sigma value even if it was defined. Usage
"-h".
Weighting Methods
We offer five different methods for weighting the influence of the k neighbors of the query instances:
- equal: All neighbors are given equal weight,
namely 1/k. Usage: "-w equal"
- inverseDistance: The weight of an instance is
proportional to 1/d, where d is its distance from the query. Usage: "-w
inverseDistance"
- linear: The weight function is a linearly
decreasing function of the distance d. The slope is determined such
that the neighbor with the highest distance has a small weight of 0.001
(while an instance with distance 0 would have a weight of 1). Usage:
"-w linear"
- GaussianKernel: Neighbors are weighted by
centering a Gaussian kernel at the query instance. Usage: "-w
GaussianKernel"
- exponentialKernel: Neighbors are weighted by
centering an exponential kernel at the query instance. Usage: "-w
exponentialKernel"
Required Libraries
The required libraries can be downloaded with our package in the same compressed file.
MOA (Massive Online Analysis), the MOA framework under which our package is working.
XXL (eXtensible and fleXible Library) offers a rich collection of index structures.
JAMA (Java Matrix Package) is a basic linear algebra package for Java.
Additional Parameters
Two further parameters have not been explained so far:
- InitialWidth defines the size of the initial set
of instances on which IBLStreams produces an intial model (in a batch
manner). After having seen this number of instances, the algorithm
switches to its incremental mode of learning. The default value is 1000
instances.
- MaxInstancebaseSize upper-bounds the size of the
case base (training set); the default value is 5000 instances.

The following block shows how the help looks like:
| -i InitialWidth (default: 1000) Size of first Window for training learner. -s PredictionStrategy (default: WeightedMode(Classification)) The way the target value is predicted. WModeClass ==> WeightedMode (Classification) WMedianOClass ==> WeightedMedian (OrdinalClassification) WMeanReg ==> WeightedMean (Regression) LocLinReg ==> Local Linear Regression -a AdaptationStrategy (default: AdaptK) The used adaptation Strategy. AdaptK ==> AdaptK AdaptSigma ==> AdaptSigma none ==> none -w WeightingMethod (default: equal) The used weighting method. equal ==> equal inverseDistance ==> inverseDistance linear ==> linear GaussianKernel ==> GaussianKernel exponentialKernel ==> exponentialKernel -m MaxInstancebaseSize (default: 5000) The maximum allowed size for the instancebase. -j MinKValue (default: 4) Minimum k value, recomended 2 times the number of attributes. This value is considered when the used adaptation strategy is AdaptK. -k InitialKValue (default: 16) Initial k value, recomended 4 times the number of attributes. -l MaxKValue (default: 40) Maximum k value, recomended 10 times the number of attributes. This value is considered when the used adaptation strategy is AdaptK. -q InitialSigma (default : 0.5) Initial sigma value. This value is considered when the used adaptation strategy is AdaptSigma. -h UseDefaultSigma In this case the default sigma is used, when a Gaussian or an exponential kernel is used. sig=sqrt(num. dimensions)/10. -g UseDefaultK In this case the default k. k=(num. dimensions)* 4. |
Examples
- Learning a binary classifier from the hyperplane generator, in 4
dimensional spaces.
-The initial learning is done on a block on 1000 instances at the beginning of the whole process.
-The adaptation strategy is to adapt k, while the prediction is done using the weighted mode. Equal weighting is used to weight instances.
-The incremental learning and the evaluations is done on a streams of 120000 examples.
-The stream is divided into blocks, where the classifier is trained incrementally on a block of 5000 instances and then evaluated (but no longer adapted) on the next 1000 instances, then again trained on the next 5000 and tested on the subsequent 1000 instances, and so forth.
-The final results of every evaluation is logged into the file C:\HyperplaneBinary.csv
java -cp IBLStreams.jar;Jama-1.0.2.jar;xxl-core-2.0.beta.jar;moa.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -e BasicClassificationPerformanceEvaluator -l (LearnModel -l (moa.classifiers.IBLStreams -s WModeClass -a AdaptK -w equal -i 1000) -s (generators.HyperplaneGeneratorReg -m Binary -i 1 -a 4) -m 1000) -s (generators.HyperplaneGeneratorReg -m Binary -i 1 -a 4) -n 1000 -i 100000 -f 5000 -d C:\\HyperplaneBinary.csv"
-In order to add a concept drift, we merge the data from two different hyperplane generators. The stream starts from one pure hyperplane, later on the concept drift starts after 50000 examples, this is when instances have a higher probability to be from the second hyperplane.
java -cp IBLStreams.jar;Jama-1.0.2.jar;xxl-core-2.0.beta.jar;moa.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -e BasicClassificationPerformanceEvaluator -l (LearnModel -l (moa.classifiers.IBLStreams -s WModeClass -a AdaptK -w equal -i 1000) -s (generators.HyperplaneGeneratorReg -m Binary -i 1 -a 4) -m 1000) -s (ConceptDriftStream -s (generators.HyperplaneGeneratorReg -m Binary -i 1 -a 4) -d (generators.HyperplaneGeneratorReg -m Binary -i 13 -a 4) -p 50000 -w 1000) -n 1000 -i 100000 -f 5000 -d C:\\HyperplaneBinary.csv"
- Learning a regression model form a data generator which generates
the distance from a point to a hyperplane, again the hyperplane
generator is in 4 dimensional spaces. The stream starts from one
hyperplane and ends with a different one.
-The initial learning is done on a block on 1000 instances at the beginning of the whole process.
-The adaptation strategy is to adapt k, while the prediction is done using the weighted mean. Linear kernel is used for the weighting.
-The way the incremental learning is interleaved with the evaluation is similar to the previous case.
-The final results of every evaluation is logged into the file C:\HyperplaneDistance.csv
java -cp IBLStreams.jar;Jama-1.0.2.jar;xxl-core-2.0.beta.jar;moa.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -e BasicRegressionPerformanceEvaluator -l (LearnModel -l (moa.classifiers.IBLStreams -s WMeanReg -a AdaptK -w linear -i 300) -s (generators.HyperplaneGeneratorReg -m Distance -i 1 -a 4) -m 1000) -s (ConceptDriftStream -s (generators.HyperplaneGeneratorReg -a 4 -m Distance) -d (generators.HyperplaneGeneratorReg -i 13 -a 4 -m Distance) -p 50000 -w 10000) -n 1000 -i 100000 -f 5000 -d C:\\HyperplaneDistance.csv"
- This case is similar to the previous case except that the data is
generated as the square distance from a point to a hyperplane
-The adaptation strategy is to adapt the width of the Gaussian kernel and the prediction is done by fitting a local linear regression in the neighborhood instances.
-The final results of every evaluation is logged into the file C:\HyperplaneSquareDistance.csv
java -cp IBLStreams.jar;Jama-1.0.2.jar;xxl-core-2.0.beta.jar;moa.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -e BasicRegressionPerformanceEvaluator -l (LearnModel -l (moa.classifiers.IBLStreams -s LocLinReg -a AdaptSigma -w GaussianKernel -i 300) -s (generators.HyperplaneGeneratorReg -m SquareDistance -i 1 -a 4) -m 1000) -s (ConceptDriftStream -s (generators.HyperplaneGeneratorReg -a 4 -m SquareDistance) -d (generators.HyperplaneGeneratorReg -i 13 -a 4 -m SquareDistance) -p 50000 -w 10000) -n 1000 -i 100000 -f 5000 -d C:\\HyperplaneSquareDistance.csv"
Publications
J. Beringer and E. Hüllermeier.Efficient Instance-Based Learning on Data Streams.
Intelligent Data Analysis 11(6):627-650, 2007.
[Draft-PDF ]
More Publications are to be added soon...
Download
Extension (jar file)Extension with all required libraries (zip file)


