The Pitfalls of Public Datasets: A Cautionary Tale for Machine Learning

August 14, 2024, 6:09 am
ResearchGate
ResearchGate
BusinessExchangeITLearnLocalNewsPlatformReputationResearchScience
Location: Germany, Brandenburg, Lichtenow
Employees: 201-500
Founded date: 2008
Total raised: $87.6M
In the world of machine learning, data is king. It drives algorithms, shapes models, and ultimately determines success or failure. Yet, a recent revelation from MIT researchers has cast a shadow over the reliability of public datasets. They found up to 10% errors in the labeling of some of the most popular datasets used for training neural networks. This revelation is a wake-up call for researchers and practitioners alike.

For over five years, my team and I have been knee-deep in network traffic analysis and machine learning, specifically in developing models to detect cyber attacks. We often relied on public datasets, believing them to be reliable and well-curated. However, our experiences have led us to a stark conclusion: we can no longer trust public datasets.

**The Journey Begins**

Back in 2019, we embarked on a quest to enhance signature-based intrusion detection systems using machine learning. Our goal was ambitious: to create a model capable of identifying new, previously unknown attacks—so-called zero-day threats. To kickstart our research, we chose the CICIDS2017 dataset, a popular public dataset for intrusion detection.

At the time, the decision seemed logical. CICIDS2017 was frequently cited in academic literature, and we believed it would provide a solid foundation for our experiments. The dataset, developed by the Canadian Institute for Cybersecurity, contains over 50 GB of raw data and includes eight pre-processed files with labeled sessions. Each session is characterized by 85 features, ranging from source IP to flow duration.

**The Cracks Begin to Show**

As we delved deeper into our research, we encountered several issues with the CICIDS2017 dataset. Initially, we brushed off minor discrepancies, believing that no dataset is perfect. However, as we progressed, the problems became more pronounced.

First, we noticed duplicate features within the dataset. For instance, the feature "Fwd Header Length" appeared alongside "Fwd Header Length.1," both holding identical values. This redundancy raised red flags about the dataset's integrity.

Next, we discovered that session identifiers, or "Flow IDs," contained null values. Out of nearly half a million records, we were left with only a fraction after cleaning the data. This significant loss of data points compromised our model's training and testing phases.

Moreover, we found non-numeric values within features that were supposed to be strictly numerical, such as "Flow Bytes/s" and "Flow Packets/s." These inconsistencies made it increasingly difficult to trust the dataset.

**A Deeper Dive into the Dataset's Flaws**

Determined to understand the root of these issues, we decided to analyze the raw traffic data from CICIDS2017. We aimed to recreate the dataset using our own tools. However, our results did not match the original dataset. This discrepancy pointed to potential flaws in the data collection process, specifically in the CICFlowMeter tool used to generate the dataset.

One major issue we uncovered was the incorrect termination of sessions. The dataset defined a session as complete upon the first appearance of a packet with the FIN flag. This flawed logic led to many sessions being inaccurately recorded, resulting in a plethora of sessions with only one or two packets and zero payload length.

Additionally, we discovered that the CICFlowMeter tool did not account for TCP session resets when a packet with the RST flag was received. While the latest version of the tool addressed this issue, the dataset itself remained unchanged, perpetuating the errors.

**The Snowball Effect of Errors**

As we continued our investigation, we identified even more problems. The timeout value for sessions was inconsistently documented, leading to further confusion. The dataset claimed a timeout of 600 seconds, while the tool used a value of 120 seconds. Such discrepancies can lead to significant differences in session analysis.

We also found that the calculation of packet lengths was flawed. The dataset erroneously included padding bytes from Ethernet frames in the TCP packet length, skewing our results. This miscalculation extended to various statistical features, leading to inconsistencies in metrics like "Packet Length Mean" and "Average Packet Size."

The final straw came when we examined the features related to subflows. The logic used to calculate these features was convoluted and riddled with errors, further eroding our confidence in the dataset.

**The Bigger Picture**

Our experience with the CICIDS2017 dataset serves as a cautionary tale for the machine learning community. Public datasets, while convenient, can harbor significant flaws that compromise research integrity. As we strive for advancements in technology, we must also prioritize the quality of our data.

The implications of relying on flawed datasets extend beyond academic research. In real-world applications, erroneous data can lead to misguided decisions, ineffective models, and, ultimately, security vulnerabilities. The stakes are high, especially in fields like cybersecurity, where the cost of failure can be catastrophic.

As we move forward, we must adopt a more critical approach to public datasets. Rigorous validation processes, transparency in data collection methods, and continuous updates are essential to ensure data integrity.

In conclusion, while public datasets can provide a valuable starting point, they should not be taken at face value. As researchers and practitioners, we must remain vigilant, questioning the reliability of our data sources. Only then can we build robust models that stand the test of time and contribute meaningfully to our fields. The road ahead may be fraught with challenges, but with a discerning eye, we can navigate the complexities of machine learning and data science.