tree-test.ipynb

{"cells":[{"metadata":{},"cell_type":"markdown","source":"# Testing the Tree\n\n## Importing the Basics"},{"metadata":{"trusted":true},"cell_type":"code","source":"import numpy as np\nimport random\nfrom matplotlib import pyplot as plt\nfrom rf.decisionTree import DecisionTree\nfrom rf.impurityMeasure import Gini, Entropy, MSE, MAE\nfrom rf.leafFunction import Mode, Mean, Confidence\nfrom rf.splitAlgorithm import CART, ID3, C45\nfrom metric.confusionMatrix import ConfusionMatrix\nfrom metric.regressionScores import RegressionScores\nfrom utility.modelIO import ModelIO\nfrom rf.pruning import ReducedError, CostComplexity, PessimisticError","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Generating Test Data\n\nHere I generate random test data. It's two blocks shifted very slightly in some dimensions. For classifier tasks each block gets a label, for regressor tasks each block gets the average coordinates plus some random value as a traget. It's a very simple dummy data set meant for testing the code.\n\nHere one can change the dimensionallity and amount of the data."},{"metadata":{"trusted":true},"cell_type":"code","source":"def dataShift(dims):\n    offSet = [5, 1.5, 2.5]\n    diffLen = abs(len(offSet) - dims)\n    offSet.extend([0] * diffLen)\n    random.shuffle(offSet)\n    return offSet[:dims]\n\n# Initialize some parameters\ntotalAmount = 64000\ndims = 5\nevalAmount = totalAmount // 4\ntrainAmount = totalAmount - evalAmount\noffSet = dataShift(dims)\n\n# Create covariance matrix\ncov = np.eye(dims)  # This creates a covariance matrix with variances 1 and covariances 0\n\n# Generate random multivariate data\noneData = np.random.multivariate_normal(np.zeros(dims), cov, totalAmount)\ntwoData = np.random.multivariate_normal(offSet, cov, totalAmount)\n\n# Split the data into training and evaluation sets\ntrainData = np.vstack((oneData[:trainAmount], twoData[:trainAmount]))\nvalidData = np.vstack((oneData[trainAmount:], twoData[trainAmount:]))\n\n# Labels for classification tasks\ntrainLabels = np.hstack((np.zeros(trainAmount), np.ones(trainAmount)))\nvalidLabels = np.hstack((np.zeros(evalAmount), np.ones(evalAmount)))\n\n# Targets for regression tasks\ntrainTargets = np.sum(trainData, axis=1) + np.random.normal(0, 0.1, 2*trainAmount)\nvalidTargets = np.sum(validData, axis=1) + np.random.normal(0, 0.1, 2*evalAmount)\n\n# Shuffle the training data\ntrainIndex = np.random.permutation(len(trainData))\ntrainData = trainData[trainIndex]\ntrainLabels = trainLabels[trainIndex]","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Creating the Tree\n\nHere the tree is created. One can set the maximum depth of the tree. Depending on the task, we add a different impurity function and a different leaf function. Finally we add the split algorithm and set the feature percentile. Higher numbers look at more possible splits, but decreases speed. Lower numbers look at less possible splits, speeding up the algorithm. Depending on the data set this can have a strong impact on the performance."},{"metadata":{"trusted":true},"cell_type":"code","source":"task = 'classifier' # 'classifier'/'regressor'\ntree = DecisionTree(maxDepth=5, minSamplesSplit=2)\nif task == 'regressor':\n    tree.setComponent(MSE())\n    tree.setComponent(Mean())\nelif task == 'classifier':\n    tree.setComponent(Entropy())\n    tree.setComponent(Mode())\n    #tree.setComponent(Confidence())\ntree.setComponent(CART(featurePercentile=90))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Trainining the tree\n\nAgain, depending on the task we train the tree with targets or labels. Then we make a prediction and plot the tree."},{"metadata":{"trusted":true},"cell_type":"code","source":"if task == 'regressor':\n    tree.train(trainData, trainTargets)\nelif task == 'classifier':\n    tree.train(trainData, trainLabels)\nprediction = tree.eval(validData)\nprint(tree)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Create bar plot\nplt.bar(np.arange(dims), tree.featureImportance, color='steelblue')\n\n# Add labels and title\nplt.xlabel('Feature Index')\nplt.ylabel('Importance')\nplt.title('Feature Importance')\n\n# Add grid\nplt.grid(True, linestyle='--', alpha=0.6)\n\n# Show plot\nplt.show()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Evaluating predictions\n\nDepending on the task at hand we create a confusion matrix (classification) or simple metrics (regression). Since the number of classes is fixed to two, we don't need to change anything here."},{"metadata":{"trusted":true},"cell_type":"code","source":"if task == 'regressor':\n    metrics = RegressionScores(numClasses=2)\n    metrics.calcScores(prediction, validTargets, validLabels)\n    print(metrics)\nelif task == 'classifier':\n    confusion = ConfusionMatrix(numClasses=2)\n    confusion.update(prediction, validLabels)\n    confusion.percentages()\n    confusion.calcScores()\n    print(confusion)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Saving and Loading a Tree\n\nTrees can be converted to dictionaries and then saved as a json file. This allows us to load them and re-use them. Also json is a raw text format, which is neat."},{"metadata":{"trusted":true},"cell_type":"code","source":"saver = ModelIO()\nsaver.save(tree, 'test')\nnewTree = saver.load('test')\nprint(newTree)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"prediction = newTree.eval(validData)\n\nif task == 'regressor':\n    newMetrics = RegressionScores(numClasses=2)\n    newMetrics.calcScores(prediction, validTargets, validLabels)\n    print(newMetrics)\nelif task == 'classifier':\n    newConfusion = ConfusionMatrix(numClasses=2)\n    newConfusion.update(prediction, validLabels)\n    newConfusion.percentages()\n    newConfusion.calcScores()\n    print(newConfusion)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Comment\n\nThe tree works pretty well with both regression and classification tasks. Labels shouldn't be one-hot encoded, it works but it's still rather iffy. Targets should 1D, I haven't tested with 2D, it might work. Training can be really fast with a percentile set in the split algorithm, otherwise it can be rather slow. Making predictions work fast and well enough."}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.10.4","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"vscode":{"interpreter":{"hash":"aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"}}},"nbformat":4,"nbformat_minor":2}