tree-test.ipynb

{"cells":[{"cell_type":"markdown","metadata":{},"source":["# Testing the Tree\n","\n","## Importing the Basics"]},{"cell_type":"code","execution_count":1,"metadata":{"trusted":false},"outputs":[],"source":["import numpy as np\n","from matplotlib import pyplot as plt\n","from machineLearning.metric import ConfusionMatrix, RegressionScores\n","from machineLearning.utility import ModelIO\n","from machineLearning.rf import (\n","    DecisionTree,\n","    Gini, Entropy, MSE, MAE,\n","    Mode, Mean, Confidence, Probabilities,\n","    CART, ID3, C45,\n","    ReducedError, CostComplexity, PessimisticError\n",")"]},{"cell_type":"markdown","metadata":{},"source":["## Generating Test Data\n","\n","Here I generate random test data. It's two blocks shifted very slightly in some dimensions. For classifier tasks each block gets a label, for regressor tasks each block gets the average coordinates plus some random value as a traget. It's a very simple dummy data set meant for testing the code.\n","\n","Here one can change the dimensionallity and amount of the data."]},{"cell_type":"code","execution_count":2,"metadata":{"trusted":false},"outputs":[],"source":["def dataShift(dims):\n","    offSet = [5, 1.5, 2.5]\n","    diffLen = abs(len(offSet) - dims)\n","    offSet.extend([0] * diffLen)\n","    np.random.shuffle(offSet)\n","    return offSet[:dims]\n","\n","# Initialize some parameters\n","totalAmount = 64000\n","dims = 7\n","evalAmount = totalAmount // 4\n","trainAmount = totalAmount - evalAmount\n","offSet = dataShift(dims)\n","\n","# Create covariance matrix\n","cov = np.eye(dims)  # This creates a covariance matrix with variances 1 and covariances 0\n","\n","# Generate random multivariate data\n","oneData = np.random.multivariate_normal(np.zeros(dims), cov, totalAmount)\n","twoData = np.random.multivariate_normal(offSet, cov, totalAmount)\n","\n","# Split the data into training and evaluation sets\n","trainData = np.vstack((oneData[:trainAmount], twoData[:trainAmount]))\n","validData = np.vstack((oneData[trainAmount:], twoData[trainAmount:]))\n","\n","# Labels for classification tasks\n","trainLabels = np.hstack((np.zeros(trainAmount), np.ones(trainAmount)))\n","validLabels = np.hstack((np.zeros(evalAmount), np.ones(evalAmount)))\n","\n","# Targets for regression tasks\n","trainTargets = np.sum(trainData, axis=1) + np.random.normal(0, 0.1, 2*trainAmount)\n","validTargets = np.sum(validData, axis=1) + np.random.normal(0, 0.1, 2*evalAmount)\n","\n","# Shuffle the training data\n","trainIndex = np.random.permutation(len(trainData))\n","trainData = trainData[trainIndex]\n","trainLabels = trainLabels[trainIndex]\n","trainTargets = trainTargets[trainIndex]"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[],"source":["def scatterPairwise(data, labels, size: float = 10):\n","    num_dims = data.shape[1]\n","    fig, axes = plt.subplots(num_dims, num_dims, figsize=(12, 12))\n","\n","    if len(labels.shape) > 1:\n","        labels = np.argmax(labels, axis=1)\n","    \n","    colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']\n","    point_colors = [colors[label] for label in labels]\n","\n","    for i in range(num_dims):\n","        for j in range(num_dims):\n","            if i == j:\n","                axes[i][j].axis('off')\n","            else:\n","                axes[i][j].scatter(data[:, i], data[:, j], c=point_colors, s=size, alpha=0.5,label='data')\n","                axes[i][j].set_xlabel(f\"Dim {i}\")\n","                axes[i][j].set_ylabel(f\"Dim {j}\")\n","    plt.tight_layout()\n","    plt.show()"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[],"source":["#scatterPairwise(trainData, trainLabels.astype('int'))"]},{"cell_type":"markdown","metadata":{},"source":["## Creating the Tree\n","\n","Here the tree is created. One can set the maximum depth of the tree. Depending on the task, we add a different impurity function and a different leaf function. Finally we add the split algorithm and set the feature percentile. Higher numbers look at more possible splits, but decreases speed. Lower numbers look at less possible splits, speeding up the algorithm. Depending on the data set this can have a strong impact on the performance."]},{"cell_type":"code","execution_count":5,"metadata":{"trusted":false},"outputs":[],"source":["task = 'classifier' # 'classifier'/'regressor'\n","tree = DecisionTree(maxDepth=5, minSamplesSplit=12)\n","if task == 'regressor':\n","    tree.setComponent(MSE())\n","    tree.setComponent(Mean())\n","elif task == 'classifier':\n","    tree.setComponent(Entropy())\n","    tree.setComponent(Mode())\n","    #tree.setComponent(Confidence())\n","    #tree.setComponent(Probabilities(2))\n","tree.setComponent(CART(featurePercentile=90))"]},{"cell_type":"markdown","metadata":{},"source":["## Trainining the tree\n","\n","Again, depending on the task we train the tree with targets or labels. Then we make a prediction and plot the tree."]},{"cell_type":"code","execution_count":6,"metadata":{"trusted":false},"outputs":[{"name":"stdout","output_type":"stream","text":["tree 1 |\u001b[0m\u001b[31m\u001b[0m\u001b[0m\u001b[31m \u001b[0m                                                 | 00%\r"]},{"name":"stdout","output_type":"stream","text":["tree 1 |⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿| done ✔                  | 47%\n","—————————————————————— tree: 1/1 ———————————————————————\n","split: CART, impurity: Entropy, leaf: Mode, nodes: 31\n","maxDepth: 5, reached depth: 5, minSamplesSplit: 12\n","························································\n","╴feat: 1 <= 2.22, samples: 96000\n","     ├─feat: 1 <= 1.71, samples: 47473\n","     │   ├─feat: 6 <= 2.00, samples: 45907\n","     │   │   ├─feat: 1 <= 1.18, samples: 44898\n","     │   │   │   └─╴value: 0.0\n","     │   │   │   └─╴value: 0.0\n","     │   │   └─╴feat: 1 <= 1.16, samples: 1009\n","     │   │       └─╴value: 0.0\n","     │   │       └─╴value: 0.0\n","     │   └─╴feat: 6 <= 1.79, samples: 1566\n","     │       ├─feat: 2 <= 0.72, samples: 1445\n","     │       │   └─╴value: 0.0\n","     │       │   └─╴value: 0.0\n","     │       └─╴feat: 2 <= 0.50, samples: 121\n","     │           └─╴value: 0.0\n","     │           └─╴value: 1.0\n","     └─╴feat: 1 <= 3.00, samples: 48527\n","         ├─feat: 6 <= 1.42, samples: 1600\n","         │   ├─feat: 6 <= 0.54, samples: 668\n","         │   │   └─╴value: 0.0\n","         │   │   └─╴value: 0.0\n","         │   └─╴feat: 2 <= 0.89, samples: 932\n","         │       └─╴value: 1.0\n","         │       └─╴value: 1.0\n","         └─╴feat: 1 <= 3.49, samples: 46927\n","             ├─feat: 6 <= 0.58, samples: 2063\n","             │   └─╴value: 1.0\n","             │   └─╴value: 1.0\n","             └─╴feat: 6 <= 0.89, samples: 44864\n","                 └─╴value: 1.0\n","                 └─╴value: 1.0\n"]}],"source":["if task == 'regressor':\n","    tree.train(trainData, trainTargets)\n","elif task == 'classifier':\n","    tree.train(trainData, trainLabels)\n","print(tree)"]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[],"source":["tree.bake()"]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[],"source":["prediction = tree.eval(validData)"]},{"cell_type":"markdown","metadata":{},"source":["## Evaluating predictions\n","\n","Depending on the task at hand we create a confusion matrix (classification) or simple metrics (regression). Since the number of classes is fixed to two, we don't need to change anything here."]},{"cell_type":"code","execution_count":9,"metadata":{"trusted":false},"outputs":[{"name":"stdout","output_type":"stream","text":["━━━━━━━━━━━━ evaluation ━━━━━━━━━━━━\n","————————— confusion matrix —————————\n","              Class 0     Class 1   \n","····································\n","     Class 0   15968         32     \n","                49%          0%     \n","····································\n","     Class 1     68        15932    \n","                 0%         49%     \n","\n","———————————————————————————————— scores ———————————————————————————————\n","                accuracy       precision      sensitivity      miss rate    \n","·······································································\n","     Class 0     0.997           0.996           0.998           0.002      \n","     Class 1     0.997           0.998           0.996           0.004      \n","·······································································\n","       total     0.997           0.997           0.997           0.003      \n"]}],"source":["if task == 'regressor':\n","    metrics = RegressionScores(numClasses=2)\n","    metrics.calcScores(prediction, validTargets, validLabels)\n","    print(metrics)\n","elif task == 'classifier':\n","    confusion = ConfusionMatrix(numClasses=2)\n","    confusion.update(prediction, validLabels)\n","    confusion.percentages()\n","    confusion.calcScores()\n","    print(confusion)"]},{"cell_type":"markdown","metadata":{},"source":["## Saving and Loading a Tree\n","\n","Trees can be converted to dictionaries and then saved as a json file. This allows us to load them and re-use them. Also json is a raw text format, which is neat."]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["—————————————————————— tree: 1/2 ———————————————————————\n","split: CART, impurity: Entropy, leaf: Mode, nodes: 31\n","maxDepth: 5, reached depth: 5, minSamplesSplit: 12\n","························································\n","╴feat: 1 <= 2.22, samples: 96000\n","     ├─feat: 1 <= 1.71, samples: 47473\n","     │   ├─feat: 6 <= 2.00, samples: 45907\n","     │   │   ├─feat: 1 <= 1.18, samples: 44898\n","     │   │   │   └─╴value: 0.0\n","     │   │   │   └─╴value: 0.0\n","     │   │   └─╴feat: 1 <= 1.16, samples: 1009\n","     │   │       └─╴value: 0.0\n","     │   │       └─╴value: 0.0\n","     │   └─╴feat: 6 <= 1.79, samples: 1566\n","     │       ├─feat: 2 <= 0.72, samples: 1445\n","     │       │   └─╴value: 0.0\n","     │       │   └─╴value: 0.0\n","     │       └─╴feat: 2 <= 0.50, samples: 121\n","     │           └─╴value: 0.0\n","     │           └─╴value: 1.0\n","     └─╴feat: 1 <= 3.00, samples: 48527\n","         ├─feat: 6 <= 1.42, samples: 1600\n","         │   ├─feat: 6 <= 0.54, samples: 668\n","         │   │   └─╴value: 0.0\n","         │   │   └─╴value: 0.0\n","         │   └─╴feat: 2 <= 0.89, samples: 932\n","         │       └─╴value: 1.0\n","         │       └─╴value: 1.0\n","         └─╴feat: 1 <= 3.49, samples: 46927\n","             ├─feat: 6 <= 0.58, samples: 2063\n","             │   └─╴value: 1.0\n","             │   └─╴value: 1.0\n","             └─╴feat: 6 <= 0.89, samples: 44864\n","                 └─╴value: 1.0\n","                 └─╴value: 1.0\n"]}],"source":["ModelIO.save(tree, 'tree-test')\n","newTree = ModelIO.load('tree-test')\n","print(newTree)"]},{"cell_type":"code","execution_count":11,"metadata":{"trusted":false},"outputs":[{"name":"stdout","output_type":"stream","text":["━━━━━━━━━━━━ evaluation ━━━━━━━━━━━━\n","————————— confusion matrix —————————\n","              Class 0     Class 1   \n","····································\n","     Class 0   15968         32     \n","                49%          0%     \n","····································\n","     Class 1     68        15932    \n","                 0%         49%     \n","\n","———————————————————————————————— scores ———————————————————————————————\n","                accuracy       precision      sensitivity      miss rate    \n","·······································································\n","     Class 0     0.997           0.996           0.998           0.002      \n","     Class 1     0.997           0.998           0.996           0.004      \n","·······································································\n","       total     0.997           0.997           0.997           0.003      \n"]}],"source":["prediction = newTree.eval(validData)\n","\n","if task == 'regressor':\n","    newMetrics = RegressionScores(numClasses=2)\n","    newMetrics.calcScores(prediction, validTargets, validLabels)\n","    print(newMetrics)\n","elif task == 'classifier':\n","    newConfusion = ConfusionMatrix(numClasses=2)\n","    newConfusion.update(prediction, validLabels)\n","    newConfusion.percentages()\n","    newConfusion.calcScores()\n","    print(newConfusion)"]},{"cell_type":"markdown","metadata":{},"source":["## Comment\n","\n","The tree works pretty well with both regression and classification tasks. Labels shouldn't be one-hot encoded, it works but it's still rather iffy. Targets should 1D, I haven't tested with 2D, it might work. Training can be really fast with a percentile set in the split algorithm, otherwise it can be rather slow. Making predictions work fast and well enough."]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.3"},"vscode":{"interpreter":{"hash":"aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"}}},"nbformat":4,"nbformat_minor":2}