# MySpider

*MySpider* is an extension to Spider. The *Spider* is an advanced data mining *Matlab* toolbox released on GNU GPL v3 licence. It is object oriented set of classes which allows to build complex data analysis flow. It includes tools for data preprocessing, large set of classification algorithms, bat also algorithms for regression, data clustering, feature selection, model selection etc. Because *Spider* is object oriented building complex flow is very simple.

In *MySpider* some of the basic classes were modified or corrected, such that some of the bugs were fixed. It also modifies the algorithm configuration procedure, such that now it is a common standard of modification of any parameter.

MySpider can be obtained from Spider - this version also includes Spider Toolbox, or the newest version also with full Spider can be downloaded from SVN repository. However to get the SVN version it is required to get rights on SVN server so please contact us: SVN link: https://hpc.kzi.polsl.pl:8443/svn/myspider

Access to the repository is available through user: guest passwd: guest

# Instalation

To install MySpider simply download the zip file or check-out SVN version and put it to spider folder. Then it is required to update the matlab path. To do this you can run use_spider function available in spider\use_spider.m, this will create and automatically add the folder structure to the matlab path. Alternatively using function spider_subdirs you can obtain the full path structure and then do whatever you wont. The folder structure is returned as a cell array of string. myspider directory is a subdirectory of the spider folder.

# MySpider structure

*MySpider* implements following algorithms:

- basic - set of basic operators
- nne - Nonparametric noise estimation - tool that allows for estimating level of noise in regression problems
- resampple - class which allows to resample the input dataset
- outlier - outlier elimination based on IQR statistic
- mynormalize - new normalization class, much better then the original one (bugs free)
- myparam - a modified version of param class, which is much more powerful
- statistics - a class for extracting data statistics
- link_data - allows for data concatenation
- dcv/pcv - distributed/parallel cross validation test

- clust - data clustering
- myfcm - fuzzy c-means algorithm
- em - Expectation Maximization algorithm
- cfcm - Conditional fuzzy c-means
- dendrogramclust - dendrogram clustering - it requires Statistical toolbox

- feat_sel - set of feature selection and dimensionality reduction algorithms
- fs_infosel - feature selection algorithms based on Infosel++ library (requires Infosel++ library)
- fs_pca - dimensionality reduction by PCA
- fs_ranker - feature selection based on ranking
- fs_rw - feature selection based on ranking, where a classifier is used to rank features
- fs_srw - similar to fs_rw, however after ranking it uses rank values to start the searching process
- fs_sfs - feature selection based best first search (forward selection and backward elimination)
- fs_clust_pca - under construction
- fs_trans_ranker - under construction
- fs_featcor - under construction

- fuzzy
- myanfis - ANFIS model (requires fuzzy logic toolbox)
- nfident - it uses nfidnet for Neuro-Fuzzy Function Approximation

- pat - models for classification
- noptdl - Ordered Prototype Threshold Decision List a classification algorithm extracting prototype-threshold rules - based on gradient optimization of rule properties
- soptdl - Ordered Prototype Threshold Decision List a classification algorithm extracting prototype-threshold rules - based on search optimization of rule properties
- cv_committee - a committee of models obtained form crossvalidation test
- DistanceTree - decision tree based on distance matrix (requires Statistical toolbox
- matlabTree - implements decision tree form Statistical toolbox
- Gtree and cvGtree - Generalized tree, where node function can be defined as any decision model. cvGtree is used to prune the tree
- SimpleSplit - a simple node function for the decision tree
- rbf_net - RBF network
- mlp - MLP based on Neural Networks toolbox

- redser - SVM reduced set methods
- res_cfcm - reduced set method based on Conditional fuzzy C-means
- res_wlvq - reduced set method based on Weighted LVQ algorithm
- res_proto - reduced set method based on any instance selection method

- proto - set of instance selection and optimization algorithms
- c_clust_sel - a generalized model for instance selection in classification problem based on conditional clustering
- cfcm_l_sel - cfcm based instance selection for large datasets
- cfcm_sel - cfcm based instance selection
- clust_sel - generalized instance selection based on any clustering algorithms
- cnn_sel - CNN algorithm
- ELGrow_sel - ELGrow algorithm
- elh - class implementing Encoding Length Heuristic
- ELH_sel - ELH algorithm
- enn_sel - ENN algorithm
- Explore - Explore algorithm
- ge_sel - Gabriel Editting algorithm
- IB3_sel - IB3 algorithm
- lvq_sel - a famili of Learning Vector Quantization algorithms (LVQ1, LVQ2, LVQ3, LVQ2.1, OLVQ, WLVQ) Some of this algorithms are implemented in C
- manual_sel - allows for manually selection of instances
- random_sel - random selection algorithm
- rank_sel - instance selection based on ranking coefficient known from feature selection
- rcnn_sel - modification of CNN algorithm
- reg_c_clust_sel - instace selection for regression based on conditional clustering
- reg_clust_sel - instace selection for regression based on clustering
- renn_sel - Repeated ENN algorithm
- rng_sel - Relative Neigbor Graph algorithm
- rnd_sel - simple algorithm which selects the best subset of instances from several random runs
- simple_mean_sel - replace all vectors from each class by its mean
- proto_sel - a base class for all instance selection algorithms

# Examples

## Example 1

This is first simple example of Spider adn MySpider

- d = gen(toy);
- m = chain({cnn_sel,knn});
- [r,m]=train(m,d);
- r=test(a,d);
- loss( r )

Line 1 generates dataset, then in line 2 a chain is constructed, such that when processing data the CNN algorithm will be executed, and then the knn one. Line 3 starts execution, line 4 applies already trained model on test set.

## Example 2

This example is similar to the previous one, except that cross validation is used to estimate the accuracy of the system.

- d = gen(toy);
- m = chain({ mynormalize, enn_sel, cnn_sel, ge_sel, knn});
- c = cv(m);
- [r,m]=train(c,d);
- get_mean( r )

Line 2 extends the chain by adding also normalization and other algorithms ENN, CNN, ge_sel. In line 3 this chain is inserted into cross validation, and after training (line 4) the system calculates the mean and standard deviation

## Example 3

This example shows how to optimize model parameters. Here the prototype initialization for LVQ algorithm is optimized using a series of algorithms

- d = gen(toy);
- d = train(mynormalize,d);
- m = chain({ lvq_sel, knn});
- p = myparam(m,{{’lvq_sel’ ’proto’ enn_sel cnn_sel clust_sel(myfcm,{'k',3}) }})
- c = cv(p);
- [r,m]=train(c,d);
- get_mean( r )

In this example in line 4 a model generator is used. This model generator generates new models, here LVQ algorithm with different settings, and execute it. *myparam* has is construced as a name of the class which we wont to set, and a property of that class *proto* and after it is a list of possible values, here ENN,CNN,FCM algorithms

## Example 4

A simple process which compares classification error of two algorithms

- d = gen(toy);
- d = train(mynormalize,d);
- m = group({ noptdl soptdl});
- c = cv(m);
- [r,m]=train(c,d);
- get_mean( r )

In this process in line 2 data normalization is performed, then in line 3 a group of two algorithms is constructed, this group is plugged into crossvalidation test, this test is executed in line 5 and in line 6 we obtain final results