Computer VisionRetailObject DetectionEdge CasesBenchmarks

You Cannot Collect a Theft Dataset Without Filming the Theft

Real shoplifting footage is rare, privacy-loaded, and unevenly captured. A 571-image synthetic dataset across three theft behaviors and multiple camera views shows how to build coverage you cannot lawfully record.

Jewelry Theft Detection sample image
Jewelry Theft Detection sample image
Jewelry Theft Detection sample image
AnywayLabs9 min readDataset: Jewelry Theft Detection

To build a model that detects theft, you need footage of theft. To get footage of theft, you either wait for crimes to happen on camera and capture identifiable people committing them, or you stage them. Both routes are constrained, and the constraints are exactly why theft detection models tend to fail on the events that matter. This is the long-tail problem wearing a privacy jacket, and it is the reason synthetic data has a specific job to do here.

This post looks at that problem through a concrete artifact: an open synthetic dataset for jewellery theft detection, covering three theft-related behaviors across multiple camera viewpoints. We will work through why real theft data is structurally hard to assemble, what the dataset covers, and where synthetic generation genuinely helps versus where it does not.

The Problem Is Real, and the Data Is Not Available

Retail theft is not a marginal cost. U.S. retailers reported steep, sustained increases in shoplifting incidents through the middle of the decade, and loss-prevention surveys put the dollar impact in the tens of billions annually. According to the National Retail Federation's 2025 theft and violence report, retailers saw roughly an 18% rise in the average number of shoplifting incidents in 2024 compared with the prior year. High-value categories like jewellery sit squarely in the crosshairs of organized retail crime, where coordinated groups target specific merchandise rather than acting opportunistically.

Retail-theft signal Reported context
Change in average shoplifting incidents 18% rise in 2024
Report source NRF 2025 theft and violence reporting
Loss scale Tens of billions annually

So the demand for automated theft detection is real and growing. The trouble is the training data. A genuine theft event captured cleanly on a store camera is rare relative to the hours of ordinary shopping footage around it, and a single store may go weeks between incidents worth labeling. The events that do get captured are not evenly distributed across camera angles, lighting, store layouts, or theft techniques. The result is the familiar coverage gap: a model trained on whatever real incidents happened to be recorded learns those specific incidents, not the general shape of theft behavior.

There is a second constraint stacked on top of the first. Real surveillance footage is footage of identifiable people. Using it to train a model means collecting, storing, and reusing recognizable images of individuals, much of it of people who were never suspected of anything. That is a privacy and compliance burden before it is an engineering one. The dataset that would teach a model to recognize theft is, in its real form, expensive to label, hard to share, and legally sensitive to retain.

This is the double-bind that defines theft detection. The demand is rising, and the data needed to meet it is rare, skewed, and privacy-loaded. More careful collection does not resolve it. Generation does.

What the Dataset Covers

The synthetic jewellery theft detection dataset contains 571 annotated images, with bounding box annotations in YOLO format. Every image is fully synthetic, which is the point: it carries none of the identity or privacy load of real CCTV footage, because there are no real people in it.

Coverage axis Value
Images 571
Behavior classes 3
Resolution Mixed
Annotation format YOLO bounding boxes
Real identifiable people 0
Real CCTV footage used 0%

It defines 3 behavior classes, and the choice of classes is more thoughtful than a single "theft" label. Class 0, stealing, is the active theft action itself. Class 1, picking, is picking up an item, framed explicitly as a potential precursor to theft rather than as theft itself. Class 2, person_with_bag, captures a person carrying a bag in the scene.

Class ID Class Catalog object count Detection role
0 stealing 211 Active theft action
1 picking 191 Precursor action
2 person_with_bag 179 Contextual concealment cue

That three-way split reflects how theft detection actually works in practice. The valuable signal is rarely a single frozen moment of "stealing." It is a sequence: someone picks up an item, someone is present with a means of concealment, and an action follows. By giving the precursor (picking) and the contextual cue (person_with_bag) their own classes rather than collapsing everything into a binary theft flag, the dataset lets a model learn the structure that precedes and surrounds a theft, which is where early, actionable detection lives. A system that only fires on completed theft is a system that alerts after the merchandise is already gone.

The catalog records the dataset resolution as Mixed, with framings consistent with overhead and angled retail surveillance cameras rather than a single fixed rig. The annotations localize the behavior within the frame rather than merely tagging the image, which is what a detection pipeline driving a real-time security alert requires.

Catalog field Value
Resolution Mixed
Images 571
Classes 3

Why Multiple Camera Views Decide Whether It Works

The dataset spans multiple camera fields of view, including dedicated views labeled FOV1, FOV2, and FOV3 plus jewellery-specific viewpoints. This is the property that most directly attacks the real-world failure mode.

No two stores mount their cameras the same way. Angle, height, distance to the counter, and lens all vary, and each combination changes how a picking motion or a concealment gesture appears in the frame. A model trained on one viewpoint degrades on another, which is precisely the situation a real deployment guarantees: the camera in the store will not match the camera that captured the training data. A theft dataset built from a handful of real incidents inherits whatever angles those incidents happened to be filmed from, and inherits their blind spots wholesale.

Generating the data inverts that. The same behavior can be rendered across multiple viewpoints under controlled variation in lighting, retail background, and body positioning, so the coverage is a deliberate grid rather than an accident of what got recorded. The dataset's own design notes describe exactly this intent: rather than copying real data, it sets out to widen the behavioral distribution beyond the narrow set of real surveillance samples and to introduce controlled variation in how theft actions appear, with the explicit goal of improving robustness to unseen real-world shoplifting behavior.

That is the synthetic value proposition stated plainly and correctly. The value is not photorealism for its own sake. It is control over coverage: behaviors, viewpoints, occlusion, and lighting generated to a chosen distribution rather than scavenged from whatever footage was lawfully and luckily available.

The Privacy Advantage Is Structural, Not Cosmetic

It is worth being precise about the privacy benefit, because it is easy to overstate.

Real surveillance theft datasets carry a high privacy cost because they contain the identities of real people. A synthetic dataset carries none, because the people in it do not exist. This is not a marginal improvement in handling; it removes an entire category of obligation. There is no consent to obtain, no identifiable footage to secure, no retention timeline to enforce on images of bystanders, and no barrier to sharing the dataset openly. The dataset card frames the comparison directly: where real CCTV data brings high identity-privacy concern and manual, costly annotation, the synthetic version brings none of the privacy exposure and automated annotation.

This matters beyond compliance paperwork. The privacy load on real theft footage is part of what keeps these datasets small, siloed, and unshareable, which is itself a cause of the coverage gap. Removing the privacy constraint does not just make the data easier to handle. It makes broad, shareable, well-covered theft datasets possible at all.

What the Catalog Evaluation Reports

Evidence discipline cuts both ways. The catalog page now reports model results for the linked dataset, so the evaluation numbers can be stated directly instead of left as pending.

Evaluation metric Catalog value
Model checkpoint yolov8m.pt
mAP@0.5 0.9746
mAP@0.5:0.95 0.8532
Precision 0.9322
Recall 0.9219

The dataset also notes that performance may vary across camera placements and lighting, and that its bounding boxes do not capture fine-grained body pose the way keypoint annotations would. Those are real constraints. Bounding-box theft detection tells you that a theft-related behavior is occurring in a region of the frame; it does not give you the joint-level pose detail that some behavior-recognition approaches rely on. For many retail-security triggers that is sufficient, but it is a design tradeoff worth stating rather than hiding.

Where Synthetic Fits

The case for synthetic theft data is narrower and more durable than a claim that it replaces real footage.

Synthetic generation is the practical way to get broad, balanced, privacy-clean coverage of theft behavior across the camera viewpoints a real deployment will actually encounter. It resolves the double-bind that real collection cannot: rising demand for detection, against data that is rare, skewed, and identity-loaded. That is the structural argument, and it holds.

What synthetic data does not do is certify that a model will work in a specific store. The dataset's own recommended use case says this clearly: it is well suited to pretraining a model before fine-tuning on real surveillance data. Synthetic builds the coverage floor and removes the privacy blocker on the bulk of the training set. A measured amount of real, properly governed footage from the target environment closes the residual gap to the specific cameras and lighting that store runs. Treating synthetic as the entire solution overstates what it, or any synthetic set, can do alone.

For a problem where the demand is climbing, the real data is rare and legally sensitive, and the deployment conditions vary store to store, that division of labor is the honest one. Synthetic data makes a well-covered, shareable theft dataset possible. Real validation makes a deployed model trustworthy.

The dataset is open, YOLO-formatted, and available on Hugging Face. It is a sensible starting point for any team building retail or high-value-store theft detection that has hit the wall between the footage they need and the footage they can lawfully collect.


The Synthetic Jewellery Theft Detection dataset is available on Hugging Face and the AnywayLabs dataset catalog. Retail-crime context references the National Retail Federation's published theft and violence reporting.

TAGS
CUSTOM DATASET

Need dataset coverage for your own use case?

Configure a synthetic dataset brief around your classes, scene constraints, annotation format, and deployment target.

Customise Dataset →