Giessen-ML
A Machine Learning Framework written in pure Numpy and used for Belle II PXD Data Analysis. It contains code for neural networks, random forrests (classifier and regressor) and rudementary code for a self-organizing map (so far).
The reason for writting this code from scratch is, to create a unified and coherent framework for all of our machine learning applications, where all follow a similar design and syntax. This code is copying pytorches styles, because that's what I learned first.
Further reason is the poratbility of the code. If written in pure Numpy it can be run nearly anywhere without installing more and more libraries. Also all parts can inpteroperate more easily.
Long term I am planning to extent this Framework with the optional use of something like Numba or CuPy in order to accelerate the code. CuPy allows to run Python Code on GPUs, with the added benefit of running on AMD GPUs.
Feature List
The Following list gives an overview of features and their status. Keep in meind, that this was only recently started and that there are many bugs and a lack of optimization.
- Neural Network
- Linear (implemented, works great)
- Dropout (implemented, works)
- Bachnorm (1d works, 2d backward is wonky)
- Convolution (2D implemented, 1D planned)
- with 1D convolutions one can analyse time serieses
- 1D convolutions can be used as simplefied RNN layers
- Pooling (min/max and avg are implemented)
- Loss Functions (implemented, should work, but not sure)
- Optimizers (implemented)
- sgd, sgd with momentum and nesterov are a bit wonky
- adagrad, adadelta, rmsprop and adam work properly
- Activation (implemented, work)
- Graph (very simple solution implemented)
- Transposed Convolution (implemented, works)
- LR Scheduler (implemented)
- Tensors with autograd (implemented, but too slow to use)
- Post Training Quantization (implemented)
- Random Forrest
- DecisionTree (Modular)
- Impurity Measures (Gini, Entropy, MAE, MSE)
- Split Algorithms (CART, ID3, C4.5)
- Feature Preselection
- Pruning
- Regressor (implemented, untested)
- Classifier (implemented)
- RandomForret
- Adding different trees
- Voting Algorithms
- Boosting (ansatz)
- AdaBoosting
- GradientBoosting
- Pruning (ansatz)
- reduce error
- reduce complexity
- reduce overfitting
- DecisionTree (Modular)
- Self-Organizing Map (minimally implemented)
- Rectangular and Hexagonal Maps
- finding BMUs (implemented)
- Neighborhood (very simple)
- Neighborhood Functions (several implemented)
- Dataloader from Neural Network (works)
- learning rate scheduler from Neural Network (works)
- calculating umatrix (implemented)
- evaluating the map (minimally implemented)
- all applications can be saved/loaded as/from json files
The network code is up and running, forward and backward propagation works with Linear, Convolution and Activation, with Dropout and Pooling should work. I am pretty sure that there a bugs and mistakes. But on simple test data it already works well.
For random forrests and decision trees I found a nearly complete code. I reworked that code, first of all I renamed all variables, functions etc. to more meaningful names. Then I refactored to code a lot and made it more Modular, so that users can assamble trees by hand and append them to forrests... makes them less random.
For the Self-Organizing Map I came up with a solution that can work with batches and reuse some code from the neural network. Next up are validation/prediction methods for the SOM.
The code is still rather bare bones in general. There is not to much internal checking within the code, thus the user needs to be rather careful. For now I am trying to write proper doc-strings and comments. The plan is to have a more coherent naming system and make code more readable to minimize the need for comments.
Running the Code
There are now different test files, namely:
- network-test.py, config-network
- som-test.py, config-som
- tree-test.py, config-tree
- forrest-test.py, config-forrest
These can be run using the command line:
python network-test.py config-network
Keep in mind that you need at least python 3.11.
The python test scripts are intended to be used with PXD data and aren't written in any manner that could be helpful in explain how to use this code. For that purpose there are for Jupyter Notebooks, which show the basics of how to setup and use each of applications. They rely on dummy data that is generated inside of each notebook. They have the same names as the python scripts. I recommend starting from here. I will write later a proper documentation and some basic tutorials.
Jupyter Notebooks
In this repo are some Jupyter notebooks, which show in a very simple manner how to use the code. Users can easily change application parameters, since they are set in their own boxes. They generate dummy data, where one can set different parameters.
Getting data files
There are another tools, one can use to extract data from root files. Firstly I recommend to us uproot. Otherwise I wrote a simple class that can extract PXD data from root files. It's based on uproot. This code can be found here.
Installing as a system wide library
One has to download the whole repo, navigate there using the terminal and run the following commands:
python3 setup.py bdist_wheel sdist
and
pip3 install .
Sources
- Neural Network
- ansatz for linear
- more on linear
- generell approach
- ansatz for convolution
- more on convolution
- optimizers
- graph layer
- more on graphs
- a complete implementation
- batchnorm
- batchnorm
- on transposed convolution
- focal loss
- regression loss
- regularization
- how to add l1/l2 regularization
- activation functions
- hopfield layer
- more on hopfield
- hopflied + mlp
- rnn - theory
- rnn - implementation
- rnn - implementation
- lstm - implementation
- autograd
- more on autograd
- implementing autograd
- quantization
- Random Forrest
- Self Organizing Map