2025-08-19: Paper Summary: Reproducibility Study on Network Deconvolution

The “reproducibility crisis” in scientific research refers to growing concerns over the reliability and credibility of published findings, in many fields including biomedical, behavioral, and the life sciences (Laraway et al. 2019, Fidler et al. 2021). Over the past decade, large-scale reproducibility projects revealed failures to replicate findings. For example, in 2015 the Open Science Collaboration reported that a larger portion of replicated studies produced weaker evidence for the original findings despite using the same materials. Similarly, in fields like machine learning researchers may publish impressive new methods, but if others can’t reproduce the results, it hinders progress. Therefore, reproducibility matters. It’s science’s version of fact-checking.

In this blog, we’ll break down our recent effort to reproduce the results of the paper Network Deconvolution by Ye et al. (hereafter, "original study"), published in 2020, which claimed that replacing a Batch Normalization (BN) with “deconvolution layers” boosts model performance. During spring 2024 we began this work as our “CS895 Deep Learning Fundamentals” class project, then extended and published it as a journal paper in ReScience C, a venue dedicated to making science more reliable through reproducibility.

What is Rescience C?

ReScience C (initiative article) is a platinum open-access, peer-reviewed journal dedicated to promoting reproducibility in computational science by explicitly replicating previously published research. The journal was founded in 2015 by Nicolas Rougier, a team leader at the Institute of Neurodegenerative Diseases in Bordeaux, France, and Konrad Hinsen, a researcher at the French National Centre for Scientific Research (CNRS). It addresses the reproducibility crisis by encouraging researchers to independently reimplement computational studies using open-source software to verify and validate original results.

Unlike traditional journals, ReScience C operates entirely on GitHub, where submissions are managed as issues. This platform facilitates transparent, collaborative, and open peer-review processes. Each submission includes the replication code, data, and documentation, all of which are publicly accessible and subject to community scrutiny.

The journal covers a wide range of domains within computational science, including computational neuroscience, physics, and computer science. ReScience C provides valuable insights into the robustness of scientific findings and promotes a culture of open access and critical evaluation by publishing both successful and unsuccessful replication attempts.

Our approach in reproducing “Network Deconvolution”

We evaluated the claim that the Network Deconvolution (ND) technique improves deep learning model performance when compared with BN. We re-ran the paper’s original experiments using the same software, datasets, and evaluation metrics to determine whether the reported results could be reproduced. Out of 134 test results, 116 (87%) successfully reproduced the original findings within a 10% margin, demonstrating good reproducibility. Further, we examined the consistency of reported values, documented discrepancies, and discussed the reasons why some results could not be consistently reproduced.

Introduction

BN is a widely used technique in deep learning that accelerates training and enhances prediction performance. However, recent research explores alternatives to BN to further enhance model accuracy. One such method is Network Deconvolution. In 2020, Ye et al. studied the model performance using both BN and ND and found that ND can be used as an alternative to BN, while improving performance. This technique replaces BN layers with deconvolution layers that aim to remove pixel-wise and channel-wise correlations in input data. According to their study, these correlations cause a blur effect in convolutional neural networks (CNNs), making it difficult to identify and localize objects accurately. By decorrelating the data before it enters convolutional or fully connected layers, network deconvolution improves the training of CNNs.

Figure 1: Performing convolution on this real world image using a correlative filter, such as a Gaussian kernel, adds correlations to the resulting image, which makes object recognition more difficult. The process of removing this blur is called deconvolution. Figure 1 in Ye et al. 2020.

In the original study, Ye et al. evaluated the method on 10 CNN architectures using the benchmark datasets CIFAR-10 and CIFAR-100, and later validated the results on the ImageNet dataset. They report consistent performance improvements in ND over BN. Motivated by the potential of this method, we attempted to reproduce these results. We used the same datasets and followed the same methods, but incorporated updated versions of software libraries when necessary to resolve compatibility issues, making this a “soft-reproducibility” study. Unlike “hard reproducibility”, which replicates every detail exactly, soft reproducibility offers a practical approach while still assessing the reliability of the original findings.

Methodology

The authors of the original study reported six values per architecture for the CIFAR-10 dataset as three for BN (1, 20, 100 epoch settings), and three for ND (1, 20, 100 epoch settings). They reported similarly for the CIFAR-100 dataset. To assess the reproducibility as well as consistency, we conducted three runs for each value reported for both CIFAR-10 and CIFAR-100 datasets. For instance, we repeated the experiment using the same hyperparameter settings (Table 1) for batch normalization at 1 epoch for a specific architecture three times, recording the outcome of each run. We then calculated the average of these three results and compared them to the corresponding values from the original study.

Table 1: Hyperparameter settings that we used to reproduce results in Ye et al.

Results

Table 2 shows the results from Table 1 in Ye et al. (Org. Value) and the reproduced averaged values from our study (Rep. Avg) for CIFAR‐10 dataset with 1, 20, and 100 epochs. Architectures: (1) VGG‐16, (2) ResNet‐18, (3) Preact‐18, (4) DenseNet‐121, (5) ResNext‐29, (6) MobileNet v2, (7) DPN‐92, (8) PNASNet‐18, (9) SENet‐18, (10) EfficientNet (all values are presented as percentages). Color code red represents if the reproduced result is lower than the original value by more than 10%, green represents if the reproduced value is greater than the original value, and black represents if the reproduced value is less than the original value, but the difference between the two values is no more than 10%.

Table 2: Reproduced results for CIFAR-10 dataset.

Table 3 shows the results from Table 1 in Ye et al., and the reproduced averaged values from our study for CIFAR‐100 dataset with 1, 20, and 100 epochs. Architectures: (1) VGG‐16, (2) ResNet‐18, (3) Preact‐18, (4) DenseNet‐121, (5) ResNext‐29, (6) MobileNet v2, (7) DPN‐92, (8) PNASNet‐18, (9) SENet‐18, (10) EfficientNet (all values are presented as percentages). The color codes are the same as Table 2.

Table 3: Reproduced results for CIFAR-100 dataset.

The results indicate that although network deconvolution generally enhances model performance, there are certain cases where batch normalization performs better. To assess reproducibility, we applied a 10% threshold for accuracy as our evaluation criterion. On the CIFAR-10 and CIFAR-100 datasets, 36 out of 60 values (60%) were successfully reproduced with improved outcomes. For the ImageNet dataset, 9 out of 14 values showed better reproduced performance. We identified a few instances, particularly in CIFAR-10 and CIFAR-100, where the reproduced accuracy was lower than the original report, mostly occurring when models were trained for just 1 epoch. However, for models trained over 20 and 100 epochs, the reproduced results were generally higher, closely aligning with the original study’s accuracy. One exception is the PNASNet-18 architecture, which demonstrated relatively poor performance across both batch normalization and network deconvolution methods.

We reported the reproduced top‐1 and top‐5 accuracy values for BN and ND for the VGG‐11, ResNet‐18, and DenseNet‐121 using the ImageNet dataset. All the reproduced results fall within our reproducibility threshold, and they confirm the main claim in the original study (Tables 4 and 5).

Table 4: Accuracy values reported by the original study Table 2 and the reproduced values for VGG‐11 with 90 epochs (Rep.: Reproduced value).

Table 5: Accuracy values reported by the original study’s Table 2 and the reproduced values for the architectures ResNet‐18 and DenseNet‐121 with 90 epochs (Rep.: Reproduced value).

During our reproducibility study, we observed a noticeable difference in the training time between BN and ND, which was not reported in the original paper. Training time is a critical factor to consider when building a deep learning architecture and deciding on the computing resources. Therefore, we compare the training times of the BN and ND observed when testing them on the 10 deep learning architectures (Figures 2 and 3).

Figure 2: Training times for each CNN architecture with CIFAR‐10 dataset: (a) with 1 epoch, (b) with 20 epochs, (c) with 100 epochs

Figure 3: Training times for each CNN architecture with CIFAR‐100 dataset: (a) with 1 epoch, (b) with 20 epochs, (c) with 100 epochs

There are large training time gaps visible between BN and ND in DenseNet‐121 and DPN‐92. The shortest time gap is seen in the EfficientNet for the CIFAR‐10. A similar trend can be observed in the CIFAR‐100 dataset except ResNet‐18 has the shortest time difference in 20 epochs.

Discussion

The reproducibility study on network deconvolution presents that out of 134 test results, 116 (87%) successfully reproduced the original findings within a 10% margin, demonstrating good reproducibility. Surprisingly, 80 results actually performed better than the original study, with higher accuracy scores across different models and datasets. These improvements likely stem from updated software libraries, better optimization algorithms, improved hardware, and more stable numerical calculations from the newer versions of the libraries used. While network deconvolution generally outperformed traditional batch normalization methods, there were exceptions where batch normalization was superior, such as with ResNet-18 on certain datasets. Although network deconvolution requires longer training times, the performance gains typically justify the extra computational cost. The technique has gained widespread adoption in modern applications, including image enhancement, medical imaging, and AI-generated images, confirming its practical value and reliability in real-world scenarios.

Behind the Scenes: Challenges and Solutions

Reproducing the study was not without obstacles. On the software side, we encountered PyTorch compatibility issues, dependency conflicts, and debugging challenges. Working with the ImageNet dataset also proved demanding due to its large size of over 160 GB and changes in its folder structure, which required further research to find out solutions. Hardware constraints were another factor, and we had to upgrade from 16 GB to 80 GB GPUs to handle ImageNet training efficiently. These experiences emphasize that reproducibility is not only about having access to the original code, but also about adapting to evolving tools, datasets, and computational resources.

Why this matters for the Machine Learning Community

Reproducibility studies such as this one are essential for validating foundational claims and ensuring that scientific progress builds on reliable evidence. Changes in software or hardware can unintentionally improve performance, which highlights the importance of providing context when reporting benchmarking results. By making our code and data openly available, we enable others to verify our findings and extend the work. We encourage researchers to take part in replication efforts or contribute to venues such as ReScience C to strengthen the scientific community.

Conclusions

Our study finds that the accuracy results reported in the original paper are reproducible within a threshold of 10 percent with respect to the accuracy values reported in the original paper. This verifies the authors’ primary claim that network deconvolution improves the performance of deep learning models compared with batch normalization.

Important Links