Grocery Self Checkout Item Detection

Model Description

Context

This model is a YOLOv11 object detection model fin-tuned from COCO pretrained weights to identify 17 grocery product categories in a retail self-checkout environment. This model is designed to detect common grocery items from an overhead/top-down camera perspective, mimicking a real view of a mounted self-checkout camera. The intended use case is for store-specific automatic item detection, where the model is able to assist with item counting, checkout verification, and loss/theft prevention. This model is best suited for stores whose inventory closely mathes the training data, as performance will degrade on unseen brands or product types not represented in the training data.

Training Data

The training dataset is a subset of the RPC-Dataset (rpc-dataset.github.io), a large-scale retail product checkout dataset consisting of 83699 images across 200 grocery product classes. The working dataset is a subset of this, consisting of 9,616 images across the same 200 classes, sourced via Roboflow (universe.roboflow.com/groceries-jxjfd/grocery-goods).

Annotation Process

The original RPC-Dataset contained 200 product-specific classes, where each class represented a specific product variant (e.g., 100_milk, 101_milk, 102_milk). These classes were collapsed into 17 broader product categories to improve generalization, reduce class imbalance, and better reflect how a self-checkout system categorizes items by type rather than specific SKU. For example, all milk classes were merged into a single milk class, reducing the total class count from 200 to 17. Random samples were reviewed after relabeling to validate annotation quality, with no corrections needed.

This is the final dataset used for training, after the annotation process (https://app.roboflow.com/bdata-497-advanced-topics-in-dv-nqagm/grocery-goods-ezyyb).

Class Distribution

Class Name Total Count Training Count Validation Count Test Count
tissue 4,813 3,369 963 481
dessert 4,372 3,060 874 437
drink 3,760 2,632 752 376
seasoner 3,199 2,239 640 320
puffed_food 3,156 2,209 631 316
chocolate 3,146 2,202 629 315
instant_noodles 3,033 2,123 607 303
canned_food 2,714 1,900 543 271
milk 2,517 1,762 503 252
candy 2,499 1,749 500 250
personal_hygiene 2,495 1,747 499 250
instant_drink 2,492 1,744 498 249
alcohol 2,381 1,667 476 238
dried_fruit 2,368 1,658 474 237
dried_food 2,222 1,555 444 222
gum 1,923 1,346 385 192
stationery 1,466 1,026 293 147

Train/Validation/Test Split

Split Ratio Count
Train 70% 36,928
Validation 20% 10,505
Test 10% 5,276

Data Augmentation

The following augmentations were applied during training to simulate real-world checkout conditions:

Augmentation Purpose
Rotation Items placed on belt in any orientation
Horizontal/Vertical Flip Additional orientation variation
Mosaic Multiple items on belt simultaneously
HSV Shift (hue, saturation, value) Simulate varied store lighting
Translation & Scale Camera height and position variation

Known Biases and Limitations

  • Dataset is predominantly composed of Chinese grocery product packaging, limiting generalizability to Western or European retail environments
  • Fresh and unpackaged produce (such as fruits or vegetables) are not represented in the dataset
  • Limited lighting variation — real checkout environments may have inconsistent lighting not well represented in training images

Training Procedure

  • Framework: Ultralytics YOLOv11n
  • Hardware: A100 GPU in Google Colab
  • Epochs: 50
  • Batch Size: 64
  • Image Size: 640x640
  • Patience: 50
  • Training Time ~36.5 minutes (2,189.69 seconds)
  • Preprocessing: Augmentations applied at training time (see Data Augmentation section)

Evaluation Results

Comprehensive Metrics

All files outputted from the 'runs\detect\train' folder are provided in the file section. The model was evaluated on a held-out test set of 1,928 images containing 9,825 instances across all 17 classes. The model demonstrates strong performance across all metrics, achieving near-perfect precision and recall with a mAP50 of 0.992.

Metric Value
Precision 0.989
Recall 0.985
mAP50 0.992
mAP50-95 0.862

Per-Class Breakdown

Class Test Images Instances Precision Recall mAP50 mAP50-95
all 1,928 9,825 0.989 0.985 0.992 0.862
alcohol 252 503 0.996 0.986 0.995 0.864
candy 257 502 0.988 0.980 0.990 0.815
canned_food 252 545 0.982 0.996 0.990 0.877
chocolate 360 700 0.982 0.984 0.993 0.833
dessert 389 819 0.995 0.991 0.995 0.881
dried_food 244 415 0.982 0.993 0.995 0.877
dried_fruit 263 516 0.986 0.986 0.995 0.887
drink 360 796 0.982 0.990 0.994 0.871
gum 183 360 0.989 0.979 0.991 0.812
instant_drink 271 554 0.984 0.982 0.994 0.886
instant_noodles 302 614 0.989 0.997 0.995 0.888
milk 256 491 0.996 0.990 0.994 0.859
personal_hygiene 255 506 0.990 0.982 0.994 0.854
puffed_food 324 654 0.996 1.000 0.995 0.907
seasoner 302 572 0.986 0.965 0.993 0.849
stationery 162 300 0.986 0.957 0.972 0.785
tissue 482 978 0.999 0.994 0.995 0.909

Visual Examples of Classes

blah blah do this later

Key Visualizations

Confusion Matrix

Confusion Matrix

F1 Confidence Curve

F1 Curve

Training & Validation Loss Curves

Results

Performance Analysis

The model performs consistently well across all 17 classes on the validation dataset, with the lowest mAP50 being stationery at 0.972. The strongest performing classes were tissue and puffed_food (mAP50-95: 0.909, 0.907), likely due to their distinct packaging shapes and high training sample counts. The weakest performing class was stationery (mAP50: 0.972, mAP50-95: 0.785), which is also the smallest class at 1,466 training images, suggesting performance is partially limited by sample size.

Limitations and Biases

When tested on the D2S dataset (wild images), performance dropped significantly. The model missed entire objects, produced low-confidence detections, and misclassified items. For example, it labeled a water bottle as instant_noodles. This suggests the model may have overfit to the specific visual patterns of the training data, or alternatively reflects a domain gap between Asian grocery packaging (training data) and the European products in D2S. Both explanations are plausible and further testing on diverse datasets would be needed to distinguish between them.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support