johannes bilk
Machine Learning

Repository



Giessen-ML
A Machine Learning Framework written in pure Numpy and used for Belle II PXD Data Analysis. It contains code for neural networks, random forrests (classifier and regressor) and rudementary code for a self-organizing map (so far).
The reason for writting this code from scratch is, to create a unified and coherent framework for all of our machine learning applications, where all follow a similar design and syntax. This code is copying pytorches styles, because that's what I learned first.
Further reason is the poratbility of the code. If written in pure Numpy it can be run nearly anywhere without installing more and more libraries. Also all parts can inpteroperate more easily.
Long term I am planning to extent this Framework with the optional use of something like Numba or CuPy in order to accelerate the code. CuPy allows to run Python Code on GPUs, with the added benefit of running on AMD GPUs.

Feature List
The Following list gives an overview of features and their status. Keep in meind, that this was only recently started and that there are many bugs and a lack of optimization.

Neural Network

Linear (implemented, works great)
Dropout (implemented, works)
Bachnorm (1d works, 2d backward is wonky)
Convolution (2D implemented, 1D planned)

with 1D convolutions one can analyse time serieses
1D convolutions can be used as simplefied RNN layers


Pooling (min/max and avg are implemented)
Loss Functions (implemented, should work, but not sure)
Optimizers (implemented)

sgd, sgd with momentum and nesterov are a bit wonky
adagrad, adadelta, rmsprop and adam work properly


Activation (implemented, work)
Graph (very simple solution implemented)
Transposed Convolution (implemented, works)
LR Scheduler (implemented)
Tensors with autograd (implemented, but too slow to use)
Post Training Quantization (implemented)


Random Forrest

DecisionTree (Modular)

Impurity Measures (Gini, Entropy, MAE, MSE)
Split Algorithms (CART, ID3, C4.5)
Feature Preselection
Pruning


Regressor (implemented, untested)
Classifier (implemented)
RandomForret

Adding different trees
Voting Algorithms


Boosting (ansatz)

AdaBoosting
GradientBoosting


Pruning (ansatz)

reduce error
reduce complexity
reduce overfitting


Self-Organizing Map (minimally implemented)

Rectangular and Hexagonal Maps
finding BMUs (implemented)
Neighborhood (very simple)

Neighborhood Functions (several implemented)


Dataloader from Neural Network (works)
learning rate scheduler from Neural Network (works)
calculating umatrix (implemented)
evaluating the map (minimally implemented)


all applications can be saved/loaded as/from json files

The network code is up and running, forward and backward propagation works with Linear, Convolution and Activation, with Dropout and Pooling should work. I am pretty sure that there a bugs and mistakes. But on simple test data it already works well.
For random forrests and decision trees I found a nearly complete code. I reworked that code, first of all I renamed all variables, functions etc. to more meaningful names.
Then I refactored to code a lot and made it more Modular, so that users can assamble trees by hand and append them to forrests... makes them less random.
For the Self-Organizing Map I came up with a solution that can work with batches and reuse some code from the neural network. Next up are validation/prediction methods for the SOM.
The code is still rather bare bones in general. There is not to much internal checking within the code, thus the user needs to be rather careful.
For now I am trying to write proper doc-strings and comments. The plan is to have a more coherent naming system and make code more readable to minimize the need for comments.

Running the Code
There are now different test files, namely:

network-test.py, config-network
som-test.py, config-som
tree-test.py, config-tree
forrest-test.py, config-forrest

These can be run using the command line:

python network-test.py config-network

Keep in mind that you need at least python 3.11.
The python test scripts are intended to be used with PXD data and aren't written in any manner that could be helpful in explain how to use this code.
For that purpose there are for Jupyter Notebooks, which show the basics of how to setup and use each of applications.
They rely on dummy data that is generated inside of each notebook.
They have the same names as the python scripts. I recommend starting from here.
I will write later a proper documentation and some basic tutorials.

Jupyter Notebooks
In this repo are some Jupyter notebooks, which show in a very simple manner how to use the code.
Users can easily change application parameters, since they are set in their own boxes.
They generate dummy data, where one can set different parameters.

Getting data files
There are another tools, one can use to extract data from root files. Firstly I recommend to us uproot.
Otherwise I wrote a simple class that can extract PXD data from root files. It's based on uproot. This code can be found here.

Installing as a system wide library
One has to download the whole repo, navigate there using the terminal and run the following commands:

python3 setup.py bdist_wheel sdist

and

pip3 install .


Sources

Neural Network

ansatz for linear
more on linear
generell approach
ansatz for convolution
more on convolution
optimizers
graph layer
more on graphs
a complete implementation
batchnorm
batchnorm
on transposed convolution
focal loss
regression loss
regularization
how to add l1/l2 regularization
activation functions
hopfield layer
more on hopfield
hopflied + mlp
rnn - theory
rnn - implementation
rnn - implementation
lstm - implementation
autograd
more on autograd
implementing autograd
quantization


Random Forrest

ansatz for trees
more on trees
generel information
calc feat importance
id3 algorithm
more on id3
ansatz for forrest
about boosting/bagging
ada boosting
gradient boosting
about pruning
about pruning
isolation tree


Self Organizing Map

basic ansatz
more comprehensive guide
weight init
step by step guide
hierarchical maps
pruning and growing