How to Create a Good Dataset
Preface
When starting to tackle a deep learning problem or a scenario that needs to be solved, the most common issue we encounter is "no data available." Lack of data, messy data, unlabeled data, and poor annotation quality all stand in our way before we can solve the problem. To address these issues, this section will introduce how to create a good dataset, hoping to provide help to everyone.
Before You Start
What should we do before creating a dataset? Before starting to create a dataset, we first need to answer the following questions:
- What kind of problem do we need to solve in our scenario?
- What kind of data is needed to solve this problem?
- Are there any publicly available datasets similar to our scenario?
- How much data can we collect within a unit of time?
- What is the cost of annotating a unit of data?
Creation Steps
Define the Task
The first step before solving a problem and creating a dataset is to determine what our task is. If we don't clearly understand the task we need to solve, all our subsequent work will be wasted.
When defining the task, we must first clarify our scenario: what kind of input leads to what kind of output. Once we clarify the input and output, we can roughly know what kind of problem we're facing.
After clarifying the problem, break it down into one or more algorithmic problems. Assuming we already have these algorithmic models, see if we can solve the problem according to the workflow. If yes, then start planning how to train the corresponding algorithms and create the datasets required for the algorithmic models.
Design Data Distribution
The scenarios in the dataset are determined based on the task objectives. For the recognition targets and scenarios that may appear in the task, we generally consider Chinese and English languages, black and white or color, weather conditions, and content distribution.
Split the Dataset
The training set is used to train the algorithmic model, learning the parameter weights of neural networks and other models through input and output. The validation set is used to select the best-performing model on validation data from these intermediate models. The test set is used to evaluate the effectiveness of the algorithm trained using the training dataset. Using a real-life example:
- The training set is equivalent to a student's textbook: students master knowledge based on the content in the textbook.
- The validation set is equivalent to a student's homework: through homework, we can know different students' learning progress and improvement speed.
- The test set is equivalent to the final exam: the exam questions are ones never seen before, testing students' ability to draw inferences.
After obtaining the raw dataset, we need to split it. A single sample cannot appear in the training set, validation set, and test set simultaneously. Generally, we adopt a 6:2:2 split ratio, and each part should include various data scenarios as comprehensively as possible.
Annotate Data
Data annotation tools are not the main focus of this article. Below are just some open-source projects:
Organize Dataset Format
Organize the annotated dataset by folders, and it's recommended not to place more than 1000 files in a single folder. Then create a metadata file "meta.csv".
meta.csv is the metadata file in the HyperAI data specification format. For detailed format explanation, see: Data Format Specification Introduction
Problems You May Encounter When Creating Datasets
1. How Much Data Do You Need?
Every project is unique. Ideally, the data should be 10 times the number of model parameters. The more complex the task, the more data is needed.
2. I Already Have a Dataset, What's Next?
Don't rush to start. You need to first understand the existing dataset, and you will definitely find errors, invalid entries, and messy parts - fix them first. The quality of the dataset determines every step of your machine learning project going forward.
3. What If I Don't Have Enough Data?
If there are open-source datasets with similar scenarios, you can merge and use them. If the scenario is special, first sort out the business scenario, set up proper tracking, and collect data.