This repository hosts dataset MalnpmDB and malicious package detector MalPacDetector involved in the paper MalPacDetector: An LLM-based Malicious npm Package Detector.
- Operating System: Ubuntu 22.04
- Python: Python 3.10.12
- node.js: node.js v18.16.0
$ python3 configure.pyFollow the tooltips to configure the project. You can configure:
- datasets path: Where to find npm packages. (default: datasets/MalnpmDB)
- models path: Where to save trained models. (default: models)
- reports path: Where to save prediction result reports. (default: reports)
- features: Where to save extracted features. (default: features)
- feature-positions: Where to save code line position information of extracted features. (default: feature-positions)
And, then use the following command to setup the project.
$ ./setup.shOnce you setup the project, you will see the following folders:
- conf: containing configuration and settings files.
- datasets: containing MalnpmDB dataset.
- feature-extract: containing feature extraction code files.
- training: containing training and prediction code files.
If you using default configuration, you will see the following folders as well:
- models: containing trained machine learning models.
- reports: containing npm packages prediction reports.
- features: containing npm packages' features extracted by feature extractor.
- feature-positions: containing feature position information .
At first, you should activate python virtual environment:
$ source env/bin/activateAnd there is a main python script file:
- cli.py: for training a machine learning model and predicting npm packages. By specifying different paramaters, users can training different models or predicting different packages.
The paramaters available for performing a training or predicting task, which are listed below:
| Options | Description |
|---|---|
| -h | Show all help information. |
| extract | Extract features. |
| -h | Show help information about extracting features. |
| -d | npm dataset name. |
| train | Train model. |
| -h | Show help information about training models. |
| -m | Malicious npm dataset name. |
| -b | Benign npm dataset name. |
| -o | Model used to train. ("NB", "MLP", "RF", "SVM") |
| -p | Preprocess method. ("none", "standardlize", "min-max-scale") |
| -a | Trainging or saving model. (training, save) |
| -hs | smoothing of NB to save. |
| -hr | Learning rate of MLP to save. |
| -hl | Number of layers of MLP to save. |
| -hi | Number of iterations of MLP to save. |
| -ho | Optimization algorithm of MLP to save. |
| -ha | Activation funtion of MLP to save. |
| -he | Number of decision trees of RF to save. |
| -hd | Maxium depth of RF to save. |
| -hg | Gamma of SVM to save. |
| -hc | C of SVM to save. |
| predict | Predict npm packages. |
| -h | Show help information about predicting npm pacakges. |
| -o | Model used to predict. |
| -d | npm dataset which stored gzip formatted npm packages. |
| -p | npm package directory path. |
For convenience, use the following command to show help information.
# Show all help information.
$ python3 cli.py -h
# Show help information about extracting features.
$ python3 cli.py extract -h
# Show help information about training models.
$ python3 cli.py train -h
# Show help information about predicting npm dataset.
$ python3 cli.py predict -hThe paramater related to model settings are presented in above table's field extract. The npm dataset should obey the following structure:
dataset_name
|__ <package_name-package_version1>.tar.gz
|__ <package_name-package_version2>.tar.gz
|__ ...
|__ <package_name-package_versionn>.tar.gzThe compressed package should have the following structure which is the formal npm structure:
package_name-package_version
|__ package
|__ package.json
|__ ...Use the following command to extract features from npm dataset.
$ python3 cli.py extract -d <dataset_name>The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train. This allows user to conveniently train different models or use different datasets.
Use the following command to train a classifier.
$ python3 cli.py train -a training -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name>The paramater related to model settings are stored in conf/settings.json, and are presented in above table's field train.
Use the following command to train a classifier.
# NB
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hs <smoothing>
# MLP
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hr <learning_rate> -hl <number_of_layers> -hi <number_of_iterations> -ho <optimization_algorithm> -ha <activation_function>
# RF
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -he <number_of_decision_trees> -hd <maxium_depth>
# SVM
$ python3 cli.py train -a save -m <malicious_dataset_name> -b <benign_dataset_name> -p <preprocess_method> -o <model_name> -hg <Gamma> -hc <C>The paramater related to model settings are presented in above table's field predict.
Use the following command to predict packages.
$ python3 cli.py predict -o <model_name> -d <dataset_name>For convenience, you can just use one command to pass above steps to predict a single package.
$ python3 cli.py predict -o <model_name> -p <package_path>Hyperparameter values of the 4 classifiers, where boldface means the best hyperparameter value of the model.
| Model | Hyperparameter |
|---|---|
| NB | Smoothing terms: (1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4) |
| MLP | Learning rate: 5 values randomly selected from a uniform distribution with the interval [0.01, 0.2] (0.0505) Number of hidden units: (16, 32, 100, 150) Number of iterations: (400, 600) Optimization algorithm: (lbfgs, adam) |
| RF | Number of decision trees: (16, 32, 64, 100, 128, 256, 512) Maximum depth: (3, 5, 7, 11, 15) |
| SVM | Gamma: (scale, auto, 3 values randomly selected from a normal distribution with mean 0.2 and standard deviation 0.075) (scale) C: 3 values randomly selected from a uniform distribution with the [0.5, 2.0] (1.0704) |
- Dataset: Containing malicious dataset mal and benign dataset ben in
datasets/MalnpmDBwhich has 3258 and 4051 packages respectively. - Training and Validation Results: Model training and validation results are stored in
trainging/resultdirectory, which named***_validation.csv, where***represents model name.
Since the paper not having been published, and for security reasons, we can't place the malicious package dataset here. If you need the dataset, please send a request to [email protected].
Any bug report or improvement suggestions will be appreciated. Please kindly cite our paper if you use the code or data in your work.
Thanks!