Introduction
In today’s data-driven retail landscape, understanding consumer purchasing behavior at the SKU level has become crucial for brands, retailers, and marketers. Traditional point-of-sale data is often siloed within individual retailers, making it challenging to gain comprehensive insights into cross-basket analytics and brand loyalty patterns. However, with the advent of advanced OCR technology and product matching services, businesses can now extract valuable SKU-level data directly from consumer receipts to fuel downstream analytics and personalization efforts.
Veryfi’s AI-native intelligent document-processing platform offers lightning-fast OCR APIs that transform unstructured receipt data into structured, analyzable information in just 3-5 seconds. (Veryfi OCR API Platform) The platform supports 91 currencies and 38 languages, making it ideal for global retail analysis. (Veryfi Receipt OCR API) This comprehensive guide will walk developers through every step of the process, from uploading receipts to Veryfi’s OCR API to enriching line items with product matching services, ultimately creating a ready-to-query transactions table for advanced analytics.
The Power of Receipt-Based Analytics
Consumer packaged goods (CPG) purchase data has traditionally been controlled by retailers who own the point-of-sale systems. (Veryfi CPG Toolkit) However, receipt-based analytics democratizes access to this valuable information, enabling brands and marketers to understand consumer behavior across multiple retailers and channels.
Veryfi’s CPG Toolkit empowers retail manufacturers and digital marketing companies with real-time tools to unearth consumer spend behavior, brand loyalty, store insights, and much more. (Veryfi CPG Toolkit) This data can then be used to enrich consumer experiences through precisely targeted coupons, vouchers, and loyalty programs at scale.
Key Benefits of SKU-Level Receipt Analysis
- Cross-retailer insights: Understand purchasing patterns across different stores and chains
- Brand loyalty tracking: Identify switching behaviors and competitive dynamics
- Basket composition analysis: Discover product affinities and cross-selling opportunities
- Price sensitivity mapping: Analyze how promotions and discounts affect purchase decisions
- Geographic and demographic segmentation: Correlate purchasing patterns with location and customer profiles
Step 1: Setting Up Your Development Environment
Before diving into the code, you’ll need to set up your development environment and obtain the necessary API credentials from Veryfi.
Prerequisites
- Python 3.7 or higher
- Veryfi API credentials (Client ID, Username, API Key)
- Basic understanding of REST APIs and JSON handling
Installing Required Libraries
pip install requests pip install json pip install base64 pip install pandas # For data manipulation
Authentication Setup
import requests import json import base64 import hashlib import hmac import time # Veryfi API credentials CLIENT_ID = "your_client_id" USERNAME = "your_username" API_KEY = "your_api_key" BASE_URL = "https://api.veryfi.com/api/v8/partner/documents/"
Step 2: Uploading Receipts to Veryfi’s OCR API
Veryfi’s Receipt OCR API can extract data from receipts in 91 currencies and 38 languages, using optical character recognition technology to convert receipt images into machine-encoded text. (Veryfi Receipt OCR API) The API eliminates the need for manual human labor, which was previously inefficient and prone to errors.
Basic Receipt Upload Function
def upload_receipt(image_path):
"""
Upload a receipt image to Veryfi's OCR API
"""
# Read and encode the image
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
# Prepare the request payload
payload = {
'file_data': image_data,
'file_name': image_path.split('/')[-1],
'categories': ['Grocery', 'Gas Station', 'Restaurant'],
'auto_delete': False,
'boost_mode': 1, # Enable for better line item extraction
'external_id': f"receipt_{int(time.time())}"
}
# Generate authentication headers
headers = generate_headers(payload)
# Make the API request
response = requests.post(BASE_URL, json=payload, headers=headers)
if response.status_code == 201:
return response.json()
else:
raise Exception(f"API request failed: {response.status_code} - {response.text}")
def generate_headers(payload):
"""
Generate authentication headers for Veryfi API
"""
timestamp = int(time.time() * 1000)
signature = generate_signature(payload, timestamp)
return {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Client-Id': CLIENT_ID,
'Authorization': f'apikey {USERNAME}:{API_KEY}',
'X-Veryfi-Request-Timestamp': str(timestamp),
'X-Veryfi-Request-Signature': signature
}
def generate_signature(payload, timestamp):
"""
Generate HMAC signature for API authentication
"""
payload_string = json.dumps(payload, separators=(',', ':'))
message = f"{timestamp},{payload_string}"
signature = hmac.new(
API_KEY.encode('utf-8'),
message.encode('utf-8'),
hashlib.sha256
).hexdigest()
return signature
Handling Long CPG Receipts
Veryfi’s CPG receipt support allows customers to capture long receipts in one snap, just like taking a panoramic photo, yielding a single stitched photo of your CPG receipt with ease. (Veryfi CPG Receipts) This feature works with any type of CPG receipt, from Safeway and Tesco to Wegmans, Giant Eagle, Smart & Final, Five Below, Whole Foods Market, Coles, and other retailers – even those extremely long receipts from CVS.
def process_long_receipt(image_path):
"""
Process long CPG receipts with enhanced settings
"""
payload = {
'file_data': base64.b64encode(open(image_path, 'rb').read()).decode('utf-8'),
'file_name': image_path.split('/')[-1],
'categories': ['Grocery'],
'boost_mode': 1,
'auto_delete': False,
'max_pages_to_process': 10, # Handle multi-page receipts
'line_items': True, # Ensure line item extraction
'external_id': f"long_receipt_{int(time.time())}"
}
headers = generate_headers(payload)
response = requests.post(BASE_URL, json=payload, headers=headers)
return response.json() if response.status_code == 201 else None
Step 3: Extracting Line Item Data
Once the receipt is processed, you can retrieve detailed line item information using Veryfi’s “Get a Line Item” endpoint. The API allows users to assign their own ID to documents, useful for mapping the document to an external service or resource. (Veryfi API Documentation)
Retrieving Line Items
def get_line_items(document_id):
"""
Retrieve line items from a processed receipt
"""
url = f"{BASE_URL}{document_id}/"
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Client-Id': CLIENT_ID,
'Authorization': f'apikey {USERNAME}:{API_KEY}'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
document_data = response.json()
return extract_line_item_details(document_data)
else:
raise Exception(f"Failed to retrieve document: {response.status_code}")
def extract_line_item_details(document_data):
"""
Extract relevant line item fields for analysis
"""
line_items = []
for item in document_data.get('line_items', []):
line_item = {
'description': item.get('description', ''),
'quantity': item.get('quantity', 1),
'unit_price': item.get('unit_price', 0),
'total': item.get('total', 0),
'discount': item.get('discount', 0),
'upc': item.get('upc', ''),
'sku': item.get('sku', ''),
'category': item.get('category', ''),
'brand': item.get('brand', ''),
'size': item.get('size', ''),
'weight': item.get('weight', ''),
'date': document_data.get('date', ''),
'vendor': document_data.get('vendor', {}).get('name', ''),
'store_number': document_data.get('store_number', ''),
'receipt_id': document_data.get('id', '')
}
line_items.append(line_item)
return line_items
cURL Example for Line Item Retrieval
For developers who prefer cURL, here’s how to retrieve line item data:
curl -X GET "https://api.veryfi.com/api/v8/partner/documents/{document_id}/" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Client-Id: your_client_id" \
-H "Authorization: apikey your_username:your_api_key"
Step 4: Product Matching and Data Enrichment
Veryfi’s modern RESTful JSON API is ready-to-go out of the box, with no need for training or setup, as it’s trained on millions of CPG receipts for superior accuracy down to line-items and wealth of data. (Veryfi CPG Toolkit) The platform automatically corrects image distortions like pin and barrel, detects blur, and adjusts perspective issues.
UPC Product Matching Service
def enrich_with_product_matching(line_items):
"""
Enrich line items with product matching data
"""
enriched_items = []
for item in line_items:
if item.get('upc'):
# Call product matching service
product_data = match_product_by_upc(item['upc'])
if product_data:
item.update({
'normalized_brand': product_data.get('brand', ''),
'normalized_size': product_data.get('size', ''),
'normalized_flavor': product_data.get('flavor', ''),
'product_category': product_data.get('category', ''),
'manufacturer': product_data.get('manufacturer', ''),
'ingredients': product_data.get('ingredients', []),
'nutritional_info': product_data.get('nutrition', {})
})
enriched_items.append(item)
return enriched_items
def match_product_by_upc(upc):
"""
Match product information using UPC code
"""
# This would call Veryfi's product matching service
# Implementation depends on specific API endpoint
product_match_url = f"https://api.veryfi.com/api/v8/partner/products/upc/{upc}/"
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
'Client-Id': CLIENT_ID,
'Authorization': f'apikey {USERNAME}:{API_KEY}'
}
response = requests.get(product_match_url, headers=headers)
if response.status_code == 200:
return response.json()
else:
return None
Step 5: Building Your Transactions Database
With enriched line item data, you can now build a comprehensive transactions database that supports advanced analytics and cross-basket analysis.
Database Schema Design
import pandas as pd
from datetime import datetime
def create_transactions_table(enriched_line_items):
"""
Create a structured transactions table from enriched line items
"""
transactions = []
for item in enriched_line_items:
transaction = {
'transaction_id': f"{item['receipt_id']}_{item.get('line_number', 0)}",
'receipt_id': item['receipt_id'],
'date': item['date'],
'store_name': item['vendor'],
'store_number': item['store_number'],
'product_description': item['description'],
'upc': item['upc'],
'sku': item['sku'],
'brand': item.get('normalized_brand', item.get('brand', '')),
'category': item.get('product_category', item.get('category', '')),
'size': item.get('normalized_size', item.get('size', '')),
'flavor': item.get('normalized_flavor', ''),
'manufacturer': item.get('manufacturer', ''),
'quantity': item['quantity'],
'unit_price': item['unit_price'],
'total_price': item['total'],
'discount_amount': item['discount'],
'final_price': item['total'] - item['discount'],
'created_at': datetime.now().isoformat()
}
transactions.append(transaction)
return pd.DataFrame(transactions)
# Example usage
def process_receipt_to_database(image_path):
"""
Complete pipeline from receipt image to database record
"""
# Step 1: Upload and process receipt
receipt_data = upload_receipt(image_path)
document_id = receipt_data['id']
# Step 2: Extract line items
line_items = get_line_items(document_id)
# Step 3: Enrich with product matching
enriched_items = enrich_with_product_matching(line_items)
# Step 4: Create transactions table
transactions_df = create_transactions_table(enriched_items)
return transactions_df
Step 6: Advanced Analytics and Insights
With your transactions database in place, you can now perform sophisticated cross-basket analytics to understand consumer behavior patterns.
Brand Loyalty Analysis
def analyze_brand_loyalty(transactions_df):
"""
Analyze brand switching and loyalty patterns
"""
# Group by customer (if available) or receipt patterns
brand_analysis = transactions_df.groupby(['brand', 'category']).agg({
'quantity': 'sum',
'final_price': 'sum',
'receipt_id': 'nunique'
}).reset_index()
brand_analysis.columns = ['brand', 'category', 'total_quantity',
'total_spend', 'purchase_occasions']
# Calculate market share within categories
brand_analysis['category_share'] = brand_analysis.groupby('category')['total_spend'].transform(
lambda x: x / x.sum() * 100
)
return brand_analysis
def basket_composition_analysis(transactions_df):
"""
Analyze product affinities and basket composition
"""
# Group by receipt to analyze basket composition
basket_analysis = transactions_df.groupby('receipt_id').agg({
'category': lambda x: list(x.unique()),
'brand': lambda x: list(x.unique()),
'final_price': 'sum',
'quantity': 'sum'
}).reset_index()
# Find common category combinations
from itertools import combinations
category_pairs = []
for categories in basket_analysis['category']:
if len(categories) > 1:
for pair in combinations(categories, 2):
category_pairs.append(sorted(pair))
# Count frequency of category pairs
pair_counts = pd.Series(category_pairs).value_counts()
return pair_counts
Price Sensitivity and Promotion Analysis
def analyze_price_sensitivity(transactions_df):
"""
Analyze how discounts affect purchase behavior
"""
# Calculate discount percentage
transactions_df['discount_percentage'] = (
transactions_df['discount_amount'] /
(transactions_df['total_price'] + transactions_df['discount_amount']) * 100
)
# Group by discount ranges
discount_ranges = pd.cut(transactions_df['discount_percentage'],
bins=[0, 10, 25, 50, 100],
labels=['0-10%', '10-25%', '25-50%', '50%+'])
promotion_analysis = transactions_df.groupby(discount_ranges).agg({
'quantity': 'mean',
'final_price': 'mean',
'receipt_id': 'nunique'
}).reset_index()
return promotion_analysis
Step 7: Data Quality and Duplicate Detection
Veryfi’s platform can identify duplicate receipts used to claim coupons, vouchers, or cash back, helping protect campaign investments and distribution volume. (Veryfi CPG Toolkit)
Duplicate Receipt Detection
def detect_duplicate_receipts(transactions_df):
"""
Identify potential duplicate receipts for fraud prevention
"""
# Group by key identifying fields
duplicate_candidates = transactions_df.groupby([
'store_name', 'date', 'final_price'
]).agg({
'receipt_id': 'count',
'transaction_id': list
}).reset_index()
# Flag potential duplicates
duplicates = duplicate_candidates[duplicate_candidates['receipt_id'] > 1]
return duplicates
def validate_data_quality(transactions_df):
"""
Perform data quality checks on extracted data
"""
quality_report = {
'total_transactions': len(transactions_df),
'missing_upc': transactions_df['upc'].isna().sum(),
'missing_brand': transactions_df['brand'].isna().sum(),
'zero_price': (transactions_df['final_price'] == 0).sum(),
'negative_quantity': (transactions_df['quantity'] < 0).sum(),
'date_range': {
'earliest': transactions_df['date'].min(),
'latest': transactions_df['date'].max()
}
}
return quality_report
Step 8: Integration with Analytics Platforms
Your enriched transactions data can now be integrated with various analytics and business intelligence platforms for deeper insights.
Export to Common Formats
def export_for_analytics(transactions_df, format_type='csv'):
"""
Export transactions data for external analytics platforms
"""
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
if format_type == 'csv':
filename = f'transactions_{timestamp}.csv'
transactions_df.to_csv(filename, index=False)
elif format_type == 'json':
filename = f'transactions_{timestamp}.json'
transactions_df.to_json(filename, orient='records', date_format='iso')
elif format_type == 'parquet':
filename = f'transactions_{timestamp}.parquet'
transactions_df.to_parquet(filename, index=False)
return filename
def create_analytics_summary(transactions_df):
"""
Create summary statistics for dashboard consumption
"""
summary = {
'total_spend': transactions_df['final_price'].sum(),
'total_transactions': len(transactions_df),
'unique_products': transactions_df['upc'].nunique(),
'unique_brands': transactions_df['brand'].nunique(),
'unique_stores': transactions_df['store_name'].nunique(),
'average_basket_size': transactions_df.groupby('receipt_id')['final_price'].sum().mean(),
'top_categories': transactions_df.groupby('category')['final_price'].sum().nlargest(10).to_dict(),
'top_brands': transactions_df.groupby('brand')['final_price'].sum().nlargest(10).to_dict()
}
return summary
Real-World Applications and Use Cases
The SKU-level basket data extracted through this process enables numerous real-world applications across different industries and use cases.
Retail and CPG Manufacturers
Veryfi’s technology provides insights into what consumers buy, where they shop, at what frequency, how much they spend, and much more. (Veryfi CPG Toolkit) This comprehensive view of consumer behavior has proven valuable in understanding market dynamics.
FAQ
What is SKU-level basket data and why is it important for retailers?
SKU-level basket data refers to detailed information about individual products (Stock Keeping Units) purchased together in a single transaction. This data is crucial for understanding consumer purchasing behavior, cross-basket analytics, and brand loyalty patterns. It enables retailers and brands to optimize product placement, develop targeted marketing campaigns, and improve inventory management strategies.
How does Veryfi’s OCR technology extract data from receipts?
Veryfi’s Receipt OCR API uses advanced machine learning and artificial intelligence to convert receipt images into machine-encoded text. The technology can process receipts in 91 currencies and 38 languages, extracting data 200x faster and 10x more accurately than manual human processing. It eliminates the need for manual labor while providing real-time data extraction from unstructured documents.
What is product matching and how does it enhance receipt data?
Product matching is the process of linking extracted receipt line items to standardized product databases using SKU codes, UPC numbers, or product descriptions. This enhancement transforms raw receipt text into structured, actionable data that can be used for market analysis, competitive intelligence, and consumer behavior insights. It enables brands to track their products across multiple retailers and understand market share.
How can Veryfi’s CPG Toolkit help with consumer spending analysis?
Veryfi’s CPG Toolkit provides real-time tools for retail manufacturers and digital marketing companies to understand consumer spending behavior and brand loyalty. The toolkit analyzes consumer packaged goods purchases including food, beverages, toiletries, and cleaning products. This data can be used to create precisely targeted coupons, vouchers, and loyalty programs at scale, enriching the consumer experience.
What types of documents can Veryfi’s OCR APIs process besides receipts?
Veryfi’s OCR APIs can extract data from a wide variety of documents including invoices, W2s, W9s, bank checks, business cards, purchase orders, bills of lading, hotel folios, bank statements, credit cards, and ID cards. The platform uses deterministic, day-1 ready AI models that provide accurate data extraction across multiple document types and formats.
How can businesses integrate Veryfi’s document capture capabilities into their applications?
Businesses can integrate Veryfi’s document capture through Veryfi Lens, a software solution that can be embedded into mobile and web applications. Built in native code and optimized for performance, Veryfi Lens handles complexities like frame processing, asset preprocessing, and machine vision challenges. It provides a clean user experience with low memory usage while delivering fast, accurate document capture capabilities.