fuzzytrees package

Submodules

fuzzytrees.fdt_base module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.fdt_base.BaseFuzzyDecisionTree(disable_fuzzy, X_fuzzy_dms, fuzzification_options, criterion_func, max_depth, min_samples_split, min_impurity_split, **kwargs)[source]

Bases: object

Base fuzzy decision tree class that encapsulates all base functions to be inherited by all derived classes (and attributes, if required).

Warning

This interface should not be used directly. Use derived algorithm classes instead.

Attention

See FuzzyDecisionTreeWrapper for descriptions of all parameters and attributes in this class.

fit(X_train, y_train)[source]
predict(X)[source]
predict_proba(X)[source]
print_tree(tree=None, indent='  ', delimiter='=>')[source]
class fuzzytrees.fdt_base.BinarySubtrees(subset_true_X=None, subset_true_y=None, subset_false_X=None, subset_false_y=None)[source]

Bases: object

A class that encapsulates two subtrees under a node, and each subtree has two subsets of the samples’ features and target values that has been split.

Parameters
  • subset_true_X (array-like of shape (n_samples, n_features)) – The subset of feature values of the samples that meet the split_rule after splitting.

  • subset_true_y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The subset of target values of the samples that meet the split_rule after splitting.

  • subset_false_X (array-like of shape (n_samples, n_features)) – The subset of feature values of the samples that do not meet the split_rule after splitting.

  • subset_false_y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – The subset of target values of the samples that do not meet the split_rule after splitting.

class fuzzytrees.fdt_base.DecisionTreeInterface[source]

Bases: object

Interface for decision tree classes based on different algorithms.

Warning

This interface should not be used directly. Use derived algorithm classes instead.

Attention

The purpose of this interface is to establish protocols for functions (excluding constructor and attributes) in classification decision trees and regression decision trees that to be developed.

abstract fit(X_train, y_train)[source]
abstract predict(X)[source]
abstract predict_proba(X)[source]
abstract print_tree(tree=None, indent='  ', delimiter='=>')[source]
class fuzzytrees.fdt_base.FuzzificationOptions(r_seed=0, conv_size=1, conv_k=3, num_iter=1, feature_filter_func=None, feature_filter_func_param=None, dataset_df=None, dataset_mms_df=None, X_fuzzy_dms=None)[source]

Bases: object

A protocol message class that encapsulates all the options (excluding functions) of the fuzzification settings used by a fuzzy model.

class fuzzytrees.fdt_base.FuzzyDecisionTreeWrapper(fdt_class=None, disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=None, max_depth=inf, min_samples_split=2, min_impurity_split=1e-07, **kwargs)[source]

Bases: fuzzytrees.fdt_base.DecisionTreeInterface

Wrapper class for different decision trees.

Attention

The role of this class is to unify the external calls of different decision tree classes and implement dependency injection for those decision tree classes.

The arguments of the constructors for different decision trees should belong to a subset of the following parameters.

Parameters
  • fdt_class (Class, default=None) – The fuzzy decision tree estimator specified.

  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • X_fuzzy_dms (array-like of shape (n_samples, n_features)) – Three-dimensional array, and each element of the first dimension of the array is a two-dimensional array of corresponding feature’s fuzzy sets. Each two-dimensional array is of shape of (n_samples, n_fuzzy_sets), but has transformed membership degree of the feature values to corresponding fuzzy sets.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"gini", "entropy"} for a classifier, {"mse", "mae"} for a regressor) – The criterion function used by the function that calculates the impurity gain of the target values.

  • max_depth (int, default=float("inf")) – The maximum depth of the tree.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

root

The root node of a decision tree.

Type

Node

_impurity_gain_calculation_func

The function to calculate the impurity gain of the target values.

Type

function

_leaf_value_calculation_func

The function to calculate the predicted value if the current node is a leaf: In a classification tree, it gives the target value with the highest probability. In a regression tree, it gives the average of all the target values.

Type

function

_is_one_dim

The Boolean value that indicates whether the y is a multi-dimensional set, which means whether y is one-hot encoded.

Type

bool

_best_split_rule

The split rule including the index of the best feature to be used, and the best value in the best feature.

Type

SplitRule

_best_binary_subtrees

The binary subtrees including two subtrees under a node, and each subtree is a subset of the sample that has been split. It is one of attributes of the node (including root node) in a decision tree.

Type

BinarySubtrees

_best_impurity_gain

The best impurity gain calculated based on the current split subtrees during a tree building process.

Type

float

_fuzzy_sets

All the coefficients of the degree of membership sets based on the current estimator. They will be used to calculate the degree of membership of the features of new samples before predicting those samples. Therefore, their life cycle is consistent with that of the current estimator. They are generated in the feature fuzzification before training the current estimator. NB: To be used in version 1.0.

Type

array-like of shape (n_features, n_coefficients)

fit(X_train, y_train)[source]

Train a decision tree estimator from the training set (X_train, y_train).

Parameters
  • X_train (array-like of shape (n_samples, n_features)) – Training instances.

  • y_train (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Target values (class labels) as integers or strings.

plot_fuzzy_reg_vs_err(filename=None)[source]

Plot fuzzy regulation coefficient versus training error and test error on each numbers of fuzzy clusters respectively.

Illustrate how the performance on unseen data (test data) is different from the performance on training data.

Parameters

filename (str, default None) – Fetch the data from the specified file if filename is not None. Otherwise try from memory and the latest file in the default directory in turn.

predict(X)[source]

Predict the target values of the input samples X.

In classification, a predicted target value is the one with the largest number of samples of the same class in a leaf.

In regression, the predicted target value is the mean of the target values in a leaf.

Parameters

X (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

Returns

pred_y – The target values of the input instances.

Return type

list of n_outputs such arrays if n_outputs > 1

predict_proba(X)[source]

Predict the probabilities of the target values of the input samples X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

Returns

pred_y – The probabilities of the target values of the input instances.

Return type

list of n_outputs such arrays if n_outputs > 1

print_tree(tree=None, indent='  ', delimiter='-->')[source]

Recursively (in a top-to-bottom approach) print the built decision tree.

Parameters
  • tree (Node) – The root node of a decision tree.

  • indent (str) – The indentation symbol used when printing subtrees.

  • delimiter (str) – The delimiter between split rules and results.

search_fuzzy_params_4_clf(ds_name_list, conv_k_lim, fuzzy_reg_lim)[source]

Search fuzzy parameters for evaluating and choosing through fitting a number of groups of FDT classifiers from specified datasets in parallel (multi-process/master-worker mode).

The fuzzy feature extraction before pretraining is based on specified fuzzy regulation coefficients and a number of fuzzy clusters that each feature belongs to.

Attention

Use this function to prepare evaluation and plotting data when you need to evaluate the effect of different degrees of fuzzification on model training in advance.

Parameters
  • ds_name_list (array-like) –

  • fuzzy_reg_lim (tuple, (start, stop, step)) –

  • conv_k_lim (tuple, (start, stop, step)) –

class fuzzytrees.fdt_base.MultiProcessOptions(n_cpu_cores_req=None, allow_growth=False)[source]

Bases: object

A protocol message class that encapsulates all the options (excluding functions) of the multi-process settings.

Parameters
  • n_cpu_cores_req (int, default=None) – The number of CPU cores to request. If left to None this is automatically set to the number of all CPU cores available.

  • allow_growth (bool, default=False) – Whether to dynamically request more CPU resources.

class fuzzytrees.fdt_base.Node(split_rule=None, leaf_value=None, leaf_proba=None, branch_true=None, branch_false=None)[source]

Bases: object

A Class that encapsulates the data of the node (including root node) and leaf node in a decision tree.

Parameters
  • split_rule (SplitRule, default=None) – The split rule represented by the feature selected as a node, and branching decisions are made based on this rule.

  • leaf_value (float, default=None) – The predicted value indicated at a leaf node. In the classification tree it is the predicted class, and in the regression tree it is the predicted value. NB: Only a leaf node has this attribute value.

  • leaf_proba (float, default=None) – The predicted probability indicated at a leaf node. Only works in the classification tree. NB: Only a leaf node has this attribute value.

  • branch_true (Node, default=None) – The next node in the decision path when the feature value of a sample meets the split rule split_rule.

  • branch_false (Node, default=None) – The next node in the decision path when the feature value of a sample does not meet the split rule split_rule.

class fuzzytrees.fdt_base.SplitRule(feature_idx=None, split_value=None)[source]

Bases: object

A Class that encapsulates the data of a split rule, which is one of attributes of the node (including root node) in a decision tree.

Parameters
  • feature_idx (int, default=None) – The index of the feature selected as the node representing a split rule.

  • split_value (float, default=None) – The value from the feature indexed as feature_idx representing a split rule, on which branching decisions are made based.

fuzzytrees.fdts module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.fdts.FuzzyC45Classifier(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_entropy>, max_depth=inf, min_samples_split=2, min_impurity_split=1e-07, **kwargs)[source]

Bases: fuzzytrees.fdt_base.BaseFuzzyDecisionTree, fuzzytrees.fdt_base.DecisionTreeInterface

A fuzzy C4.5 decision tree classifier.

The C4.5 algorithm can handle both continuous/numerical and discrete/categorical variables, but can only be used for classification.

Attention

See FuzzyDecisionTreeWrapper for descriptions of all parameters and attributes in this class.

class fuzzytrees.fdts.FuzzyCARTClassifier(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_gini>, max_depth=inf, min_samples_split=2, min_impurity_split=1e-07, **kwargs)[source]

Bases: fuzzytrees.fdt_base.BaseFuzzyDecisionTree, fuzzytrees.fdt_base.DecisionTreeInterface

A fuzzy decision tree classifier.

The CART algorithm can handle both continuous/numerical and discrete/categorical variables, and can be used for both classification and regression.

Attention

See FuzzyDecisionTreeWrapper for descriptions of all parameters and attributes in this class.

class fuzzytrees.fdts.FuzzyCARTRegressor(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_variance>, max_depth=inf, min_samples_split=2, min_impurity_split=1e-07, **kwargs)[source]

Bases: fuzzytrees.fdt_base.BaseFuzzyDecisionTree, fuzzytrees.fdt_base.DecisionTreeInterface

A fuzzy CART decision tree regressor.

The CART algorithm can handle both continuous/numerical and discrete/categorical variables, and can be used for both classification and regression.

Attention

See FuzzyDecisionTreeWrapper for descriptions of all parameters and attributes in this class.

class fuzzytrees.fdts.FuzzyID3Classifier(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_entropy>, max_depth=inf, min_samples_split=2, min_impurity_split=1e-07, **kwargs)[source]

Bases: fuzzytrees.fdt_base.BaseFuzzyDecisionTree, fuzzytrees.fdt_base.DecisionTreeInterface

A fuzzy ID3 decision tree classifier.

The ID3 algorithm can only handle discrete/categorical variables and can only be used for classification.

Attention

See FuzzyDecisionTreeWrapper for descriptions of all parameters and attributes in this class.

fuzzytrees.fgbdt module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.fgbdt.FuzzyGBDT(disable_fuzzy, X_fuzzy_dms, fuzzification_options, criterion_func, learning_rate, n_estimators, validation_fraction, n_iter_no_change, max_depth, min_samples_split, min_impurity_split, is_regression)[source]

Bases: object

Base fuzzy decision tree class that encapsulates all base functions to be inherited by all derived classes (and attributes, if required).

Warning

This class should not be used directly. Use derived classes instead.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"mse", "mae"}, default="mse") – The criterion function used by the function that calculates the impurity gain of the target values. NB: Only use a criterion function for decision tree regressor.

  • learning_rate (float, default=0.1) – The step length taken in the training using the loss of the negative gradient descent strategy. It is used to reduce the contribution of each tree. NB: There is a trade-off between learning_rate and n_estimators.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.

  • n_iter_no_change (int, default=None) – n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside validation_fraction size of the training data as validation and terminate training when validation score is not improving in all of the previous n_iter_no_change numbers of iterations. The split is stratified.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

  • is_regression (bool, default=True) – True or false depending on if we’re doing regression or classification.

_loss_func

The concrete object of the class LossFunction’s derived classes.

Type

LossFunction

_estimators

The collection of sub-estimators as base learners.

Type

ndarray of FuzzyDecisionTreeRegressor

fit(X_train, y_train)[source]

Fit the fuzzy gradient boosting model.

Parameters
  • X_train (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

  • y_train (array-like of shape (n_samples,)) – Target values (strings or integers in classification, real numbers in regression)

predict(X)[source]

Predict class for X.

Parameters

X (array-like of shape (n_samples, n_features)) – The input samples.

Returns

y_pred – The predicted values.

Return type

ndarray of shape (n_samples,)

class fuzzytrees.fgbdt.FuzzyGBDTClassifier(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_variance>, learning_rate=0.1, n_estimators=100, validation_fraction=0.1, n_iter_no_change=None, max_depth=3, min_samples_split=2, min_impurity_split=1e-07)[source]

Bases: fuzzytrees.fgbdt.FuzzyGBDT

Fuzzy gradient boosting decision tree classifier.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"mse", "mae"}, default="mse") – The criterion function used by the function that calculates the impurity gain of the target values. NB: Only use a criterion function for decision tree regressor.

  • learning_rate (float, default=0.1) – The step length taken in the training using the loss of the negative gradient descent strategy. It is used to reduce the contribution of each tree. NB: There is a trade-off between learning_rate and n_estimators.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.

  • n_iter_no_change (int, default=None) – n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside validation_fraction size of the training data as validation and terminate training when validation score is not improving in all of the previous n_iter_no_change numbers of iterations. The split is stratified.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

_loss_func

The concrete object of the class LossFunction’s derived classes.

Type

LossFunction

_estimators

The collection of fitted sub-estimators.

Type

ndarray of FuzzyDecisionTreeRegressor

fit(X_train, y_train)[source]

Fit the fuzzy gradient boosting model.

Parameters
  • X_train (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

  • y_train (array-like of shape (n_samples,)) – Target values (strings or integers in classification, real numbers in regression)

class fuzzytrees.fgbdt.FuzzyGBDTRegressor(disable_fuzzy=False, X_fuzzy_dms=None, fuzzification_options=None, criterion_func=<function calculate_variance>, learning_rate=0.1, n_estimators=100, validation_fraction=0.1, n_iter_no_change=None, max_depth=3, min_samples_split=2, min_impurity_split=1e-07)[source]

Bases: fuzzytrees.fgbdt.FuzzyGBDT

Fuzzy gradient boosting decision tree regressor.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"mse", "mae"}, default="mse") – The criterion function used by the function that calculates the impurity gain of the target values. NB: Only use a criterion function for decision tree regressor.

  • learning_rate (float, default=0.1) – The step length taken in the training using the loss of the negative gradient descent strategy. It is used to reduce the contribution of each tree. NB: There is a trade-off between learning_rate and n_estimators.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.

  • n_iter_no_change (int, default=None) – n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside validation_fraction size of the training data as validation and terminate training when validation score is not improving in all of the previous n_iter_no_change numbers of iterations. The split is stratified.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

_loss_func

The concrete object of the class LossFunction’s derived classes.

Type

LossFunction

_estimators

The collection of fitted sub-estimators.

Type

ndarray of FuzzyDecisionTreeRegressor

fuzzytrees.frdf module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.frdf.BaseFuzzyRDF(disable_fuzzy, fuzzification_options, criterion_func, n_estimators, max_depth, min_samples_split, min_impurity_split, max_features, multi_process_options)[source]

Bases: object

Base fuzzy random decision forests (RF) class that encapsulates all base functions to be inherited by all derived classes (and attributes, if required). This algorithm is a fuzzy extension of the random decision forests proposed by Tin Kam Ho 1.

Warning

This class should not be used directly. Use derived classes instead.

Attention

Note that sharing data between processes may not be the best option due to various unknowable synchronisation issues, but using Pipe or Queue to communicate between multiple processes instead whenever possible. See also Python documentation:

As mentioned above, when doing concurrent programming it is usually best to avoid using shared state as far as possible. This is particularly true when using multiple processes. However, if you really do need to use some shared data then multiprocessing provides a couple of ways of doing so.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"gini", "entropy"}, default="gini", for classification; {"mse", "mae"}, default="mse", for regression) – In classification, the criterion function used by the function that calculates the impurity gain of the target values. In regression, the criterion function used by the function that calculates the impurity gain of the target values.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

  • max_features (int, default=None) – The maximum threshold value of the qualified feature number in the training dataset when training each fuzzy decision tree.

  • multi_process_options (MultiProcessOptions, default=None) – Protocol message class that encapsulates all the options of the multi-process settings. If it is left as None, adopt non-parallel computing mode.

_estimators

The collection of sub-estimators as base learners.

Type

ndarray of FuzzyDecisionTreeClassification

_res_func

In classification, get the final result from the classes given by the forest by majority voting method. In regression, calculate the average of the predicted values given by the forest as the final result.

Type

function, default=None

_n_processes

Number of CPU cores requested in parallel computing mode.

Type

int, default=None

Notes

About RF The first algorithm for random decision forests was created by Tin Kam Ho 1 using the random subspace method 2, which, in Ho’s formulation, is a way to implement the “stochastic discrimination” approach to classification proposed by Eugene Kleinberg.

An extension of the algorithm was developed by Leo Breiman 4 and Adele Cutler 5. The extension combines Breiman’s “bagging” idea and random selection of features, introduced first by Ho 1 and later independently by Amit and Geman 3 in order to construct a collection of decision trees with controlled variance.

The randomness of RF is reflected in two aspects:

1. RF uses the bootstrapping sampling method to randomly selects n samples from the original dataset to train each tree as the base learner, where n is the sample size of the original dataset.

NB: The sample size of each training dataset is the same as that of the original dataset, but the bootstrapping sampling method may make the elements in the same training dataset duplicate, or the elements in different training datasets duplicate.

2. During the construction of each tree, RF also randomly selects m features of the training dataset, and then searches the optimal features from the randomly selected features each time when splitting a tree node to find the best splitting point.

Different RFs have different random feature selection methods. For example, Tin Kam Ho’s RF adopts tree-level random feature selection, i.e. the RF randomly selects m features of the training dataset for subsequently splitting all tree nodes. By contrast, Leo Breiman’s RF adopts node-level random feature selection, i.e. the RF randomly selects m features of the training dataset when splitting a node every time.

Let M be the total number of features of data and m be the number of selected features. Generally, the value can be tried from the following usual practices:

  • For classification problems, \(m = 1 / 3 * M\);

  • For regression problems, \(m = \log_{2} (M + 1)\);

  • By defaults, \(m = \sqrt{M}\).

References

1(1,2,3)

Ho, T.K., 1995, August. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (Vol. 1, pp. 278-282). IEEE.

2

Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8), pp.832-844.

3

Amit, Y. and Geman, D., 1997. Shape quantization and recognition with randomized trees. Neural computation, 9(7), pp.1545-1588.

4

Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.

5

RColorBrewer, S. and Liaw, M.A., 2018. Package ‘randomForest’. University of California, Berkeley: Berkeley, CA, USA.

fit(X_train, y_train)[source]

Fit the fuzzy random decision forest model (in multi-process mode).

Process consists of:

St.1. Randomly resample instances through bootstrapping sampling (WR).

St.2. Random select features.

St.3. Construct trees.

St.4. Majority vote (classification) or simple average (regression) to prevent overfitting and reduce variance.

Parameters
  • X_train (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

  • y_train (array-like of shape (n_samples,)) – Target values (non-negative integers in classification, real numbers in regression) NB: The input array needs to be of integer dtype, otherwise a TypeError is raised.

predict(X)[source]

Predict results for X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input instances to be predicted.

Returns

y_pred – The predicted values.

Return type

ndarray of shape (n_samples,)

Notes

When to use multiple processes?

Divide a prediction calculation into subunits and run them in multi-process mode, making sure that each subunit is sufficiently complex. Otherwise, when the complexity of each subunit falls below a certain threshold, depending on the hardware and software environment, the time consumed by CPU scheduling will be greater than the time saved by multiple processes. As an example, here is a comparison of the elapsed times for multi-process predict and non-multi-process predict on the dataset provided by sklearn.datasets.load_digits().

100 fuzzy trees with other parameters by default:

  • Time elapsed to load data: 0.049551s;

  • Time elapsed to preprocess fuzzification: 1.6184s;

  • Time elapsed to partition data: 0.0041816s;

  • Time elapsed to train a fuzzy classifier: 5.7213s;

  • Time elapsed to predict by the fuzzy classifier:

  • (Multi-process predict) 8.3837s;

  • (Non-multi-process predict) 0.22926s.

1,000 fuzzy trees with other parameters by default:

  • Time elapsed to load data: 0.049702s;

  • Time elapsed to preprocess fuzzification: 1.6178s;

  • Time elapsed to partition data: 0.0043864s;

  • Time elapsed to train a fuzzy classifier: 52.689s;

  • Time elapsed to predict by the fuzzy classifier:

  • (Multi-process predict) 817.67s;

  • (Non-multi-process predict) 2.2753s.

10,000 fuzzy trees with other parameters by default:

  • Time elapsed to load data: 0.050004s;

  • Time elapsed to preprocess fuzzification: 1.6821s;

  • Time elapsed to partition data: 0.0041745s;

  • Time elapsed to train a fuzzy classifier: 515.57s;

  • Time elapsed to predict by the fuzzy classifier:

  • (Multi-process predict) unknown (probably greater than 100 * 817.67s);

  • (Non-multi-process predict) 23.225s.

As shown in the above experimental results, in multi-process mode, \(WallTime_curr / WallTime_prev ≈ (NumberEstimators_curr / NumberEstimators_prev)^2\), while in non-multi-process mode, \(WallTime_curr / WallTime_prev ≈ NumberEstimators_curr / NumberEstimators_prev\). Therefore, predict is not complex enough to be a subunit of multi-process computation, and using multi-process mode on it is usually not the best choice.

class fuzzytrees.frdf.BaseFuzzyRDFClassifier(disable_fuzzy, fuzzification_options, criterion_func, n_estimators=100, max_depth=3, min_samples_split=2, min_impurity_split=1e-07, max_features=None, multi_process_options=None)[source]

Bases: fuzzytrees.frdf.BaseFuzzyRDF

Fuzzy random decision forests classifier.

Attention

For classification tasks, the class that is the mode of the classes of the individual trees is returned.

See derived classes for descriptions of all parameters and attributes in this class.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"gini", "entropy"}, default="gini") – The criterion function used by the function that calculates the impurity gain of the target values. NB: Only use a criterion function for decision tree classifier.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

  • max_features (int, default=None) – The maximum threshold value of the qualified feature number in the training dataset when training each fuzzy decision tree.

  • multi_process_options (MultiProcessOptions, default=None) – Protocol message class that encapsulates all the options of the multi-process settings. If it is left as None, adopt non-parallel computing mode.

_estimators

The collection of sub-estimators as base learners.

Type

ndarray of FuzzyDecisionTreeClassification

_res_func

In classification, get the final result from the classes given by the forest by majority voting method. In regression, calculate the average of the predicted values given by the forest as the final result.

Type

function, default=None

class fuzzytrees.frdf.BaseFuzzyRDFRegressor(disable_fuzzy, fuzzification_options, criterion_func, n_estimators=100, max_depth=3, min_samples_split=2, min_impurity_split=1e-07, max_features=None, multi_process_options=None)[source]

Bases: fuzzytrees.frdf.BaseFuzzyRDF

Fuzzy random decision forests regressor.

Attention

For regression tasks, the mean or average prediction of the individual trees is returned.

Parameters
  • disable_fuzzy (bool, default=False) – Set whether the specified fuzzy decision tree uses the fuzzification. If disable_fuzzy=True, the specified fuzzy decision tree is equivalent to a naive decision tree.

  • fuzzification_options (FuzzificationOptions, default=None) – Protocol message class that encapsulates all the options of the fuzzification settings used by the specified fuzzy decision tree.

  • criterion_func ({"mse", "mae"}, default="mse") – The criterion function used by the function that calculates the impurity gain of the target values. NB: Only use a criterion function for decision tree regressor.

  • n_estimators (int, default=100) – The number of fuzzy decision trees to be used.

  • max_depth (int, default=3) – The maximum depth of the tree to be trained.

  • min_samples_split (int, default=2) – The minimum number of samples required to split a node. If a node has a sample number above this threshold, it will be split, otherwise it becomes a leaf node.

  • min_impurity_split (float, default=1e-7) – The minimum impurity required to split a node. If a node’s impurity is above this threshold, it will be split, otherwise it becomes a leaf node.

  • max_features (int, default=None) – The maximum threshold value of the qualified feature number in the training dataset when training each fuzzy decision tree.

  • multi_process_options (MultiProcessOptions, default=None) – Protocol message class that encapsulates all the options of the multi-process settings. If it is left as None, adopt non-parallel computing mode.

_estimators

The collection of sub-estimators as base learners.

Type

ndarray of FuzzyDecisionTreeRegressor

_res_func

In classification, get the final result from the classes given by the forest by majority voting method. In regression, calculate the average of the predicted values given by the forest as the final result.

Type

function, default=None

fuzzytrees.settings module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.settings.ComparisionMode(value)[source]

Bases: enum.Enum

An enumeration.

BOOSTING = 'fgbdt_vs_nfgbdt'
FF3 = 'ff3_vs_naive'
FF4 = 'ff4_vs_naive'
FF5 = 'ff5_vs_naive'
FUZZY = 'fcart_vs_ccart'
MIXED = 'mfgbdt_vs_nfgbdt'
NAIVE = 'my_naive_vs_sklearn_naive'
class fuzzytrees.settings.DirSave(value)[source]

Bases: enum.Enum

An enumeration.

EVAL_DATA = '/Users/geogeliu/PycharmProjects/FuzzyTrees/docs/fuzzy_trees_v001/data_gen/eval_data/'
EVAL_FIGURES = '/Users/geogeliu/PycharmProjects/FuzzyTrees/docs/fuzzy_trees_v001/data_gen/eval_figures/'
MODELS = '/Users/geogeliu/PycharmProjects/FuzzyTrees/docs/fuzzy_trees_v001/data_gen/pkl_models/'
class fuzzytrees.settings.EvaluationType(value)[source]

Bases: enum.Enum

An enumeration.

FUZZY_REG_VS_ERR_ON_CONV_K = 'fuzzy_reg_vs_err_on_conv_k'

fuzzytrees.util_comm module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_comm.get_cwd_as_prefix()[source]
fuzzytrees.util_comm.get_now_str(tim_str)[source]
fuzzytrees.util_comm.get_timestamp_str()[source]
fuzzytrees.util_comm.get_today_str()[source]

fuzzytrees.util_data_handler module

@author : Anjin Liu, Zhaoqing Liu @email : anjin.liu@uts.edu.au, Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_data_handler.load_German_credit()[source]

Professor Dr. Hans Hofmann Institut f”ur Statistik und “Okonometrie Universit”at Hamburg FB Wirtschaftswissenschaften Von-Melle-Park 5 2000 Hamburg 13

fuzzytrees.util_data_handler.load_chess()[source]
  1. Database originally generated and described by Alen Shapiro.

  2. Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk).

  3. Date: 1 August 1989

fuzzytrees.util_data_handler.load_data_clf(ds_name)[source]

Load a dataset by the specified name.

Parameters

ds_name (str) –

Returns

data

Return type

DataFrame

fuzzytrees.util_data_handler.load_diabetes()[source]
fuzzytrees.util_data_handler.load_iris()[source]
fuzzytrees.util_data_handler.load_vehicle()[source]
Turing Institute Research Memorandum TIRM-87-018 “Vehicle

Recognition Using Rule Based Methods” by Siebert,JP (March 1987)

fuzzytrees.util_data_handler.load_waveform()[source]

Breiman,L., Friedman,J.H., Olshen,R.A., & Stone,C.J. (1984). Classification and Regression Trees. Wadsworth International

fuzzytrees.util_data_handler.load_wine()[source]

fuzzytrees.util_logging module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_logging.setup_logging(default_path='logging_config.yaml', default_level=20)[source]

Configure the logging module.

How to enable global configuration for logging?

Call this function at the beginning of the program’s main function, and then use logging to get a logger wherever you need to log.

How to log?

In development, set the level of all handlers, e.g., console, file, error, etc., in the log configuration file to DEBUG for debugging purposes.

Production systems, set the level of each handler in the log configuration file back to the levels appropriate for production systems, e.g., console to INFO, file to DEBUG, and error to ERROR.

Parameters
  • default_path (str) – File path to the in YAML document for configuring logging.

  • default_level ({logging.CRITICAL, logging.FATAL, logging.ERROR,) – logging.WARNING, logging.WARN, logging.INFO, logging.DEBUG, logging.NOTSET}

Warning

In PyYAML version 5.1+, the use of PyYAML’s yaml.load function without specifying the Loader=… parameter, has been deprecated 6.

References

6

https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation

fuzzytrees.util_plotter module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_plotter.plot_multi_lines(coordinates, title=None, x_label=None, y_label=None, x_limit=None, y_limit=None, x_ticks=None, y_ticks=None, legends=None, fig_name=None, enable_max_annot=False, enable_min_annot=False)[source]

Plot multiple lines in a figure.

Parameters
  • coordinates (array-like) – Where, the values (i.e. coordinates[:, 0]) corresponding to the X-axis must be numeric and in ascending order.

  • title (str, default=None) –

  • x_label (str, default=None) –

  • y_label (str, default=None) –

  • x_limit (tuple, default=None) –

  • y_limit (tuple, default=None) –

  • legends (array-like, default=None) –

  • fig_name (str, default=None) – fig_name is either a text or byte string giving the name (and the path if the file isn’t in the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped.

fuzzytrees.util_preprocessing_funcs module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_preprocessing_funcs.degree_of_membership_build(X_df, r_seed, conv_k, fuzzy_reg)[source]

Build the degree of membership set of a feature. That set maps to the specified number of fuzzy sets of the feature. This is the process of transforming a crisp set into a fuzzy set.

@author : Anjin Liu @email : Anjin.Liu@uts.edu.au

Parameters
  • X_df (DataFrame) – One feature values of the training input samples. NB: All features must be normalized by feature scaling.

  • r_seed (int) – The random seed.

  • conv_k (DataFrame) – The number of convolution over the input sample.

Returns

  • x_new (array-like of shape (n_samples, n_fuzzy_sets)) – Transformed degree of membership set.

  • centriods

  • degree_of_membership_theta

fuzzytrees.util_preprocessing_funcs.extract_fuzzy_features(X, conv_k=5, fuzzy_reg=0.0)[source]

Extract fuzzy features in feature fuzzification to generate degree of membership sets of each feature.

Attention

Feature fuzzification must be done in the data preprocessing, that is, before training the model and predicting new samples.

@author: Anjin Liu @email: Anjin.Liu@uts.edu.au

fuzzytrees.util_preprocessing_funcs.make_diagonal(x)[source]

Transform a vector into an diagonal matrix.

fuzzytrees.util_preprocessing_funcs.one_hot_encode(y, n_ohe_col=None)[source]

One-hot encode nominal values of target values.

Parameters
  • y (array-like, 1-dimensional, elements must be numerical value and start from 0) –

  • n_ohe_col (int) –

fuzzytrees.util_preprocessing_funcs.resample_bootstrap(X, y, n_subsets, n_samples_sub=None)[source]

Draw a specified number of collections of independent sample subsets from the original sample sets in the bootstrapping sampling method. This is a sampling scheme with replacement (‘WR’ - an element may appear multiple times in the one sample).

Parameters
  • X (sequence of array-like of shape (n_samples, n_features)) – The input samples in the format of indexable data-structure, which can be arrays, lists, dataframes or scipy sparse matrices with consistent first dimension.

  • y (array-like of shape (n_samples,)) – The target values (class labels), which are non-negative integers in classification and real numbers in regression. Its first dimension has to be the same as that of X.

  • n_subsets (int) – The number of collections of subsets to generate.

  • n_samples_sub (int, default=None) – The sample size in each subset to generate. If left to None this is automatically set to the first dimension of X.

Returns

X_subsets, y_subsets – The tuple of the generated lists of sample subsets. The first element in the tuple is the list of the sample subsets from X and the second is the list of sample subsets from y.

Return type

tuple, where each is a sequence of array-like

fuzzytrees.util_preprocessing_funcs.resample_simple_random(X, y, n_subsets, n_samples_sub=None, replace=False)[source]

Randomly draw a specified number of collections of independent sample subsets from the original sample sets in the simple random sampling method.

Parameters
  • X (sequence of array-like of shape (n_samples, n_features)) – The input samples in the format of indexable data-structure, which can be arrays, lists, dataframes or scipy sparse matrices with consistent first dimension.

  • y (array-like of shape (n_samples,)) – The target values (class labels), which are non-negative integers in classification and real numbers in regression. Its first dimension has to be the same as that of X.

  • n_subsets (int) – The number of collections of subsets to generate.

  • n_samples_sub (int, default=None) – The sample size in each subset to generate. If left to None this is automatically set to the first dimension of X.

  • replace (bool, default=False) – Implements resampling with replacement. If False, each sampling for a sample subset will implement (sliced) random permutations, which is a simple sampling scheme without replacement (‘WOR’ - no element can be selected more than once in the same sample). If True, each sampling for a sample subset will implement a bootstrapping sampling scheme with replacement (‘WR’ - an element may appear multiple times in the one sample).

Returns

X_subsets, y_subsets – The tuple of the generated lists of sample subsets. The first element in the tuple is the list of the sample subsets from X and the second is the list of sample subsets from y.

Return type

tuple, where each is a sequence of array-like

fuzzytrees.util_preprocessing_funcs.to_nominal(x)[source]

Transform values from one-hot encoding to nominal.

fuzzytrees.util_tree_criterion_funcs module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

class fuzzytrees.util_tree_criterion_funcs.LeastSquaresFunction[source]

Bases: fuzzytrees.util_tree_criterion_funcs.LossFunction

Function class used in a gradient boosting regressor (Friedman et al., 1998; Friedman 2001).

gradient(y, y_pred)[source]
loss(y, y_pred)[source]

Lost function is a Least-square equation: L(y, F) = (y - F) ^ 2 / 2

class fuzzytrees.util_tree_criterion_funcs.LossFunction[source]

Bases: object

Base loss function class that encapsulates all base functions to be inherited by all derived function classes.

Warning

This class should not be used directly. Use derived classes instead.

abstract gradient(y, y_pred)[source]
abstract loss(y, y_pred)[source]
class fuzzytrees.util_tree_criterion_funcs.SoftLeastSquaresFunction[source]

Bases: fuzzytrees.util_tree_criterion_funcs.LossFunction

Function class used in a gradient boosting classifier (Friedman et al., 1998; Friedman 2001).

gradient(y, proba)[source]
loss(y, y_pred)[source]

Lost function (Least-square equation: L(y, F) = (y - F) ^ 2 / 2) is not applicable in classification.

fuzzytrees.util_tree_criterion_funcs.calculate_entropy(y, dm=None)[source]

Calculate the entropy of y.

fuzzytrees.util_tree_criterion_funcs.calculate_gini(y, dm=None)[source]

Calculate the Gini impurity of y.

fuzzytrees.util_tree_criterion_funcs.calculate_impurity_gain(y, sub_y_1, sub_y_2, criterion_func, p_subset_true_dm=None, p_subset_false_dm=None)[source]

Calculate the impurity gain, which is equal to the impurity of y minus the entropy of sub_y_1 and sub_y_2.

fuzzytrees.util_tree_criterion_funcs.calculate_impurity_gain_ratio(y, sub_y_1, sub_y_2, X_sub, criterion_func, p_subset_true_dm=None, p_subset_false_dm=None)[source]

Calculate the impurity gain ratio.

fuzzytrees.util_tree_criterion_funcs.calculate_mae(y_true, y_pred)[source]

Calculate the Mean Absolute Error between y_true and y_pred.

fuzzytrees.util_tree_criterion_funcs.calculate_mean_value(y)[source]

Calculate the mean of y.

Parameters

y (array-like of shape (n_samples, n_labels)) –

Returns

value – at least a 0-d float number The mean values.

Return type

array-like of the shape reduced by one dimension,

fuzzytrees.util_tree_criterion_funcs.calculate_mse(y_true, y_pred)[source]

Calculate the Mean Squared Error between y_true and y_pred.

fuzzytrees.util_tree_criterion_funcs.calculate_proba(y)[source]

Calculate the probabilities of each element in the set.

Attention

Before counting, the elements will be reordered from smallest to largest.

Parameters

y (array-like of shape (n_samples,)) –

fuzzytrees.util_tree_criterion_funcs.calculate_standard_deviation(y)[source]

Calculate the standard deviation of y.

fuzzytrees.util_tree_criterion_funcs.calculate_value_by_majority_vote(y)[source]

Calculate value by majority vote.

Attention

Used in classification decision tree.

fuzzytrees.util_tree_criterion_funcs.calculate_variance(y)[source]

Calculate the variance of y.

fuzzytrees.util_tree_criterion_funcs.calculate_variance_reduction(y, sub_y_1, sub_y_2, criterion_func, p_subset_true_dm=None, p_subset_false_dm=None)[source]

Calculate the variance reduction, which is equal to the impurity of y minus the entropy of sub_y_1 and sub_y_2.

fuzzytrees.util_tree_criterion_funcs.majority_vote(y_preds)[source]

Get the the final classification result by majority voting method.

Parameters

y_preds (array-like of shape (n_samples, n_estimators)) – NB: The input array needs to be of integer dtype, otherwise a TypeError is raised.

Returns

Return type

array-like of shape (n_samples, )

fuzzytrees.util_tree_criterion_funcs.mean_value(y_preds)[source]

Get the final regression result by averaging method.

Parameters

y_preds (array-like of shape (n_samples, n_estimators, n_labels)) –

Returns

y_pred – at least array-like of shape (n_samples, )

Return type

array-like of the shape (n_samples, n_labels) reduced by one dimension,

fuzzytrees.util_tree_split_funcs module

@author : Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au

fuzzytrees.util_tree_split_funcs.split_disc_ds_2_multi(ds, col_idx, split_val)[source]
fuzzytrees.util_tree_split_funcs.split_ds_2_bin(ds, col_idx, split_val)[source]

Split a data set into two subsets by a specified value of a specified feature: If the specified feature is numerical data, split the data set into two subsets based on whether each value of the specified feature is greater than or equal to the split value.

If the specified feature is categorical data, split the data set into two subsets based on whether each value of the specified feature is the same as the split value.

Parameters
  • ds (array-like of shape (n_samples, n_feature)) – The current data set to be split.

  • col_idx (int) – The index of the specified column on which the split based.

  • split_val (int, float, or string) – The specified value of the column indexed as col_idx.

Returns

subset_true, subset_false – Return a tuple of the two split subsets.

Return type

array-like

fuzzytrees.util_tree_split_funcs.split_ds_2_multi(ds, col_idx, split_val)[source]

Module contents

@author: Zhaoqing Liu @email : Zhaoqing.Liu-1@student.uts.edu.au