This project builds on a previous course project, implementing it with Python bindings. For each section, the new implementations will be explicitly highlighted with a further section called "What's new". The full request of this project can be found here
For each module there is a folder which separates source files from header files; the main files (the old main.cpp and the two new mains for each module) are in their folder called main; a cmake and a setup.py are provided and will be explained later on. When executing the mains, a text file with the results and a folder images
containing the plots is created, they can be found in the StatisticsModule
and NumericalIntegrationModule
folders.
The main.cpp
was written using switch cases to allow the user to decide what to compute.
- For the Statistics module, it is asked which column of the CSV file to analyse and which analysis perform.
- For the Numerical Integration module, it is asked for convergence tests, polynomial tests and if the user wants to compute the integral of a function from input.
Being compilable as standalone libraries, a#if USE_MODULEX
and#endif
delimit their includes and code in main.cpp.
The switch case in Python is done with the keyword match.
Now there are two separate mains for better readability. The computation of an integral from input was not replicated.
This module computes mean, median, standard deviation, variance, frequency count, classification, and correlation analysis on the data taken from a CSV file on NBA Players from 2003 to 2022 and outputs them on a file named player_data_03_22_analysis.txt
.
DataHandler_py.cpp
provides a Python interface and binding with Pybind11, creating a module called moduleA. Some methods are not wrapped because not useful anymore, those are: readData, read_header, getHeader, getColumn, create_output_path, writeResults.
In fact pandas was used for reading the CSV file, store the names of the columns and each of their datas in just a few lines of code, contrary to the previous C++ version.
Every statistical operation is computed both with the previous functions and with other functions implemented in Python (using NumPy). Except for Classification that is only with Python because the previous function, for some reasons, did not output the results; but since in Python there were less lines of code, it was decided to just stick with it.
Each function is decorated in order to obtain its execution time and comparing it between programming languages.
Since some columns contain non-numerical values, if the user chooses a statistic operation which can't be computed due to the type of values in the column, instead of exiting, they are warned about it and can choose a suitable operation.
When frequency is choosen, a plot of it and its distribution is showed. Before showing every plot, the user is asked to press enter, in this way they don't get suddenly overwhelmed by datas and plots.
- Various statistical calculations are provided, including mean, median, variance, standard deviation, frequency, and correlation.
These calculations can handle data stored in a vector of
std::optional<std::variant<double, std::string>>
. The definition of median is a bit altered in the case of strings in order to adapt it to strings. - Assertions (assertm) are used to check the validity of input arguments and file operations. They provide a means of catching errors during development. Exceptions are thrown for cases such as invalid arguments, missing columns, or incompatible data types.
- The Boost library is used for accumulators.
Some analysis about the age and the teams of the players are made.
For the age: mean, median and frequency are executed and a catplot is plotted to better visualize the execution time of each programming language.
Distribution of the frequency counts of Age
For the team, having it non-numerical values, only median and frequency are executed and plotted as before.
Distribution of the frequency counts of Team
This module approximates integrals with the methods midpoint, trapezoidal, Simpson's, two-point Gauss and Gauss-Legendre.
For this last integration method we used the code provided here with some edits. For example we defined some constants, like const double k = M_PI / (n + 0.5)
at line 120 of IntegrationMethods.cpp
, in order to compute the node as nodes[i] = cos(k * (i - 0.25))
, avoiding one calculus for each iteration of the for-cycle as in the code of the website. We tested it and unfortunately it has some memory leaks, so we also provided the implementation of twoPointGauss.
This module is divided in IntegrationMethods.cpp
and moduleCfunctions.tpl.hpp
with their headers. The first one has an abstract class Quadrature
and one derived class for each method. These classes take the integration extremes a and b and the number of subintervals n; in the constructor they calculate relative weights and nodes and through two const methods return them. Moreover the Gauss-Legendre class returns the integration extremes which are necessary to compute the integration, since it's a bit different from the other methods (as you can see from the above linked website). For this reason, in moduleCfunctions.tpl.hpp
we defined two templates of Integrate. They differ in the use of width and mean (where integration extremes are needed) and for the fact that the function isn't simply evaluated in the node but in
moduleCfunctions_py.cpp
provides a Python interface and binding with Pybind11, creating a module called moduleC. Everything is unchanged except for the GaussLegendre method, which is implemented with NumPy (in IntegrationMethods.py
) to avoid the above mentioned memory leaks. As numpy.polynomial.legendre module makes only possible to integrate over the interval [-1,1], when defining its test in the main, the interval is internally fixed.
When possible thanks to SciPy integrate functions, a comparison between the integration with them and the previous methods is made in terms of convergence order, result, error, absolute error and time execution (thanks to a time decorator). For Midpoint a comparison is not possible since it doesn't exist a function in Python. Although it's the same for the two-point Gauss, it's worth a comparison with scipy.integrate.quad, since it integrates between two points. Due to this last fact, it would be mathematically inconsistent to compute the convergence order of a method with two nodes, thus it will not be done.
As explained here, with SciPy's Simpson method, for an odd number of samples that are equally spaced, the method is exact if the function is a polynomial of order 3 or less; if the samples are not equally spaced, then the result is exact only if the function is a polynomial of order 2 or less. This means that the Simpson method implemented with C++ should be better, because it has no such limitations.
-
The MuParserx library has been employed to facilitate the input of functions as strings, enabling users to calculate integrals interactively.
-
Static asserts have been utilized (compile-time checks) to verify that in the functions Integrate and computeConvergenceOrder, the templates are restricted to classes derived from the abstract class Quadrature. Additionally, runtime asserts have been employed to ensure that the input files are well-defined.
-
As in the analysis conducted on slide 11 of the material, the approach involved assuming the error to conform to the Cn^p form. Subsequently, through the calculation of error across an increasing number of intervals, proportionate to 2^n, and the subsequent comparison with the preceding error on a logarithmic scale, the research determined the value of -p. This value represents the angular coefficient of the line connecting the points (2^n, E(2^n)) and (2^(n+1), E(2^(n+1))). Importantly, this coefficient is defined as the convergence order.
-
It was further verified that the Midpoint and Trapezoidal formulas are exact up to polynomials of degree 1, the Simpson formula up to polynomials of degree 3, and the Gauss formula with 2n+1 nodes up to polynomials of degree n.
Not only the convergence order of the methods are computed, but also their averages are computed (10 runs per average); in this way the result is more optimal. Unfortunately sometimes Gauss-Legendre in C++ gives some problems, ruining the plots. Clearly, even though it was integrated with NumPy, there must be something wrong.
Average Convergence of Numerical Integration Methods
Average Execution Time of Numerical Integration Methods
Execution Time along different integrals and subintervals
As it can be observed from the plots, the methods implemented with Python are faster than the ones with C++, but it must be considered that libraries as NumPy and SciPy are written in C and Fortran with Python interfaces (it provides higher performances with an easier interface). The performances are comparable, especially since they differ very minimally and the user experience is not affected by such discrepancies.
Certainly using libraries already implemented by professionals is easier, faster and less prone to errors; thus if they already exist it's better not to reinvent the wheel.
Three CMake are provided: one for each of the two modules and one to actually compile. The two modules have their own namespaces (called MODULEA and MODULEC) and can be compiled both together or independently, setting the option ON from terminal when compiling. The Statistics module also uses the namespace ba for boost::accumulators inside the StatOp.cpp.
Since the library Boost is shared, it's linked in the CMake to compile; while the library muparserx is only used in the second module and therefore only linked if that module is compiled.
Everyone in this field should have these libraries installed, if not so, Boost can be downloaded from the original webpage and muparserx from this zip file on GitHub.
Since most Boost libraries are header-only, often there is nothing to build; this is our case. While for muparserx it's required to create a folder in which to run the cmake.
The hardest part was understanding how the three CMake must be written and actually doing it.
Only one CMake is provided, which creates both modules. Also a setup.py is provided, creating a Python package easily installable via pip.
Some Python packages were used: NumPy, SciPy MatPlotLib, seaborn, pandas, tabulate. They can be installed with pip install [names] --user
.
NumPy and SciPy are used for mathematical operations, especially in the Statistics module; pandas for creating dataframes and reading a CSV file; matplotlib and seaborn for plotting; tabulate to create a table of some results (in the Numerical Integration module).
A CMake
is provided in the main directory.
Create and enter the folder build
:
mkdir build && cd build
Then run the cmake specifying, if necessary, where the libraries boost
and muparserx
are installed. This last one is only needed if you want to compile the module C. By default both modules are set OFF; if you want to set one ON, add the flag -DUSE_MODULEX=ON
where X is A for the Statistics module and C for the Numerical Integration module. So for example write:
cmake -DUSE_MODULEA=ON -DUSE_MODULEC=ON ../ -DBOOST_ROOT=/cartella/di/installazione/boost -Dmuparserx_DIR=/opt/muparserx/share/cmake/muparserx
Being careful that while for Boost the root folder is enough, for muparserx you must specify the directory containing the file muparserxConfig.cmake
so it must end with /share/cmake/muparserx
.
Lastly write:
make
After a successful building you'll be prompt to write ./homework
.
A setup.py is provided and it implicitly calls CMake, thus everything is ready writing:
python setup.py install --user
Or even better just using pip
pip install --user .
Finally to compile each main:
python main/mainA.py
python main/mainC.py