Guide for compressing your own networks
Table of contents
Pruning
- Load your pre-trained network:
# get YourPretarinedNetwork and load pre-trained weights for it
model = YourPretarinedNetwork(args).to(device)
model.load_state_dict(checkpoint['state_dict'])
- Set
config_listand choose a suitable pruner:
from lib.algorithms.pytorch.pruning import (TaylorFOWeightFilterPruner, FPGMPruner, AGPPruner)
# choose a pruner: agp, taylor, or fpgm
if args.pruner == 'agp':
config_list = [{'sparsity': args.sparsity, 'op_types': ['Conv2d']}]
pruner = AGPPruner(
model,
config_list,
optimizer,
trainer,
criterion,
num_iterations=1,
epochs_per_iteration=1,
pruning_algorithm='taylorfo',
)
elif args.pruner == 'taylor':
config_list = [{'sparsity': args.sparsity, 'op_types': ['Conv2d']}]
pruner = TaylorFOWeightFilterPruner(
model,
config_list,
optimizer,
trainer,
criterion,
sparsifying_training_batches=1,
)
elif args.pruner == 'fpgm':
config_list = [{'sparsity': args.sparsity, 'op_types': ['Conv2d']}]
pruner = FPGMPruner(
model,
config_list,
optimizer,
dummy_input=torch.rand(1, 3, 64, 64).to(device),
)
else:
raise NotImplementedError
sparsityspecifies the pruning sparsity, ranging from 0.0 to 1.0. Larger sparsity corresponds to a more lightweight model.op_typesspecifies the type of pruned operation and can be eitherConv2dorConv3d, or both of them.optimizer,trainer, andcriterionare the same as pre-training your network.
- Use the pruner to generate the pruning mask
# generate and export the pruning mask
pruner.compress()
pruner.export_model(
os.path.join(args.save_dir, 'model_masked.pth'),
os.path.join(args.save_dir, 'mask.pth')
)
model_masked.pthincludes the model weights and the generated pruning mask.mask.pthonly includes the generated pruning mask.
- Export your pruned model:
from lib.compression.pytorch import ModelSpeedup
# initialize a new model instance and load pre-trained weights with the pruning mask
model = YourPretarinedNetwork(args).to(device)
model.load_state_dict(torch.load(os.path.join(args.save_dir, 'model_masked.pth')))
masks_file = os.path.join(args.save_dir, 'mask.pth')
# use the speedup_model() of ModelSpeedup() to automatically export the pruned model
m_speedup = ModelSpeedup(model, torch.rand(input_shape).to(device), masks_file, device)
m_speedup.speedup_model()
input_shapedenotes the shape of your model inputs withbatchsize=1.- This automatic export method is susceptible to errors when unrecognized structures are present in your model. To assist in resolving any bugs that may arise during the pruning process, we have compiled a summary of known issues in our Bug Summary.
- If there are too many errors and it’s hard to solve, we recommend you to manually export the pruned model by providing the topology structures of networks. Please refer to this link for more details.
- Fine-tune your pruned model:
- To fine-tune the pruned model, we suggest following your own pre-training process to minimize the performance drop.
- Since the pruned model has pre-trained weights and fewer parameters, a smaller
learning_ratemay be more effective during the fine-tuning.
Quantization
- Load your pre-trained network:
# get YourPretarinedNetwork and load pre-trained weights for it
model = YourPretarinedNetwork(args).to(device)
model.load_state_dict(checkpoint['state_dict'])
- Initialize the dataloader:
import torchvision.datasets as datasets
def get_data_loader(args):
train_dir = os.path.join(args.data, 'train')
train_dataset = datasets.ImageFolder(train_dir, transform=transforms.ToTensor())
train_loader = data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True)
val_dir = os.path.join(args.data, 'val')
val_dataset = datasets.ImageFolder(val_dir, transform=transforms.ToTensor())
val_loader = data.DataLoader(val_dataset, batch_size=args.batch_size, shuffle=False)
n_train = len(train_dataset)
indices = list(range(n_train))
random.shuffle(indices)
calib_sampler = torch.utils.data.sampler.SubsetRandomSampler(indices[:args.calib_num])
calib_loader = data.DataLoader(train_dataset, batch_size=args.batch_size, sampler=calib_sampler)
return train_loader, val_loader, calib_loader
train_loader, val_loader, calib_loader = get_data_loader(args)
calib_loaderuses a subset from the training dataset to calibrate during subsequent quantization.
- Specify
quan_modeand output paths of onnx, trt, and cache:
onnx_path = os.path.join(args.save_dir, '{}_{}.onnx'.format(args.model, args.quan_mode))
trt_path = os.path.join(args.save_dir, '{}_{}.trt'.format(args.model, args.quan_mode))
cache_path = os.path.join(args.save_dir, '{}_{}.cache'.format(args.model, args.quan_mode))
if args.quan_mode == "int8":
extra_layer_bit = 8
elif args.quan_mode == "fp16":
extra_layer_bit = 16
elif args.quan_mode == "best":
extra_layer_bit = -1
else:
extra_layer_bit = 32
- Define the
enginefor inference:
from lib.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
engine = ModelSpeedupTensorRT(
model,
input_shape,
config=None,
calib_data_loader=calib_loader,
batchsize=args.batch_size,
onnx_path=onnx_path,
calibration_cache=cache_path,
extra_layer_bit=extra_layer_bit,
)
if not os.path.exists(trt_path):
engine.compress()
engine.export_quantized_model(trt_path)
else:
engine.load_quantized_model(trt_path)
- Use the
enginefor inference:
loss, top1, infer_time = validate(engine, val_loader, criterion)
engineis similar to themodeland can be inferred on either GPU or TensorRT.- While the
eval()method is necessary formodelinference, it is not required forengine. - Inference with
enginewill return both the outputs and the inference time.
Pruning and Quantization
- After completing the
Pruningprocess outlined above, use the pruned model to undergo theQuantizationprocess.