The general rule is "the more the merrier" but no one likes to manually label data or pay someone else to do so. Graft recommends a minimum of 10 examples of each class (label value) within the data set and if the distribution of classes within the data is uneven that more label examples are added to mirror that distribution.
Example 1: Product catalogue of 1000 items with 5 categories evenly distributed
Label (class) value | minimum labels needed |
---|---|
Jackets |
10 |
Pants | 10 |
Shoes | 10 |
Tops | 10 |
Underwear | 10 |
Example 2: Product catalogue of 1000 items with 5 categories with Jackets and Shoes representing 35% of total items each
Label (class) value | minimum labels needed |
---|---|
Jackets |
35 |
Pants | 10 |
Shoes | 35 |
Tops | 10 |
Underwear | 10 |
If you really don't want to label data you can use Bootstrapping in Graft to create an initial set of labels which you can review using Active Learning or start with as many as you can and refine your Enrichment model over time