From Receipt to Insight: How to Extract SKU-Level Basket Data with Veryfi OCR & Product Matching

August 28, 2025

11 mins read

Introduction

In today’s data-driven retail landscape, understanding consumer purchasing behavior at the SKU level has become crucial for brands, retailers, and marketers. Traditional point-of-sale data is often siloed within individual retailers, making it challenging to gain comprehensive insights into cross-basket analytics and brand loyalty patterns. However, with the advent of advanced OCR technology and product matching services, businesses can now extract valuable SKU-level data directly from consumer receipts to fuel downstream analytics and personalization efforts.

Veryfi’s AI-native intelligent document-processing platform offers lightning-fast OCR APIs that transform unstructured receipt data into structured, analyzable information in just 3-5 seconds. (Veryfi OCR API Platform) The platform supports 91 currencies and 38 languages, making it ideal for global retail analysis. (Veryfi Receipt OCR API) This comprehensive guide will walk developers through every step of the process, from uploading receipts to Veryfi’s OCR API to enriching line items with product matching services, ultimately creating a ready-to-query transactions table for advanced analytics.

The Power of Receipt-Based Analytics

Consumer packaged goods (CPG) purchase data has traditionally been controlled by retailers who own the point-of-sale systems. (Veryfi CPG Toolkit) However, receipt-based analytics democratizes access to this valuable information, enabling brands and marketers to understand consumer behavior across multiple retailers and channels.

Veryfi’s CPG Toolkit empowers retail manufacturers and digital marketing companies with real-time tools to unearth consumer spend behavior, brand loyalty, store insights, and much more. (Veryfi CPG Toolkit) This data can then be used to enrich consumer experiences through precisely targeted coupons, vouchers, and loyalty programs at scale.

Key Benefits of SKU-Level Receipt Analysis

Cross-retailer insights: Understand purchasing patterns across different stores and chains
Brand loyalty tracking: Identify switching behaviors and competitive dynamics
Basket composition analysis: Discover product affinities and cross-selling opportunities
Price sensitivity mapping: Analyze how promotions and discounts affect purchase decisions
Geographic and demographic segmentation: Correlate purchasing patterns with location and customer profiles

Step 1: Setting Up Your Development Environment

Before diving into the code, you’ll need to set up your development environment and obtain the necessary API credentials from Veryfi.

Prerequisites

Python 3.7 or higher
Veryfi API credentials (Client ID, Username, API Key)
Basic understanding of REST APIs and JSON handling

Installing Required Libraries

pip install requests
pip install json
pip install base64
pip install pandas  # For data manipulation

Authentication Setup

import requests
import json
import base64
import hashlib
import hmac
import time

# Veryfi API credentials
CLIENT_ID = "your_client_id"
USERNAME = "your_username"
API_KEY = "your_api_key"
BASE_URL = "https://api.veryfi.com/api/v8/partner/documents/"

Step 2: Uploading Receipts to Veryfi’s OCR API

Veryfi’s Receipt OCR API can extract data from receipts in 91 currencies and 38 languages, using optical character recognition technology to convert receipt images into machine-encoded text. (Veryfi Receipt OCR API) The API eliminates the need for manual human labor, which was previously inefficient and prone to errors.

Basic Receipt Upload Function

def upload_receipt(image_path):
    """
    Upload a receipt image to Veryfi's OCR API
    """
    # Read and encode the image
    with open(image_path, 'rb') as image_file:
        image_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Prepare the request payload
    payload = {
        'file_data': image_data,
        'file_name': image_path.split('/')[-1],
        'categories': ['Grocery', 'Gas Station', 'Restaurant'],
        'auto_delete': False,
        'boost_mode': 1,  # Enable for better line item extraction
        'external_id': f"receipt_{int(time.time())}"
    }

    # Generate authentication headers
    headers = generate_headers(payload)

    # Make the API request
    response = requests.post(BASE_URL, json=payload, headers=headers)

    if response.status_code == 201:
        return response.json()
    else:
        raise Exception(f"API request failed: {response.status_code} - {response.text}")

def generate_headers(payload):
    """
    Generate authentication headers for Veryfi API
    """
    timestamp = int(time.time() * 1000)
    signature = generate_signature(payload, timestamp)

    return {
        'Content-Type': 'application/json',
        'Accept': 'application/json',
        'Client-Id': CLIENT_ID,
        'Authorization': f'apikey {USERNAME}:{API_KEY}',
        'X-Veryfi-Request-Timestamp': str(timestamp),
        'X-Veryfi-Request-Signature': signature
    }

def generate_signature(payload, timestamp):
    """
    Generate HMAC signature for API authentication
    """
    payload_string = json.dumps(payload, separators=(',', ':'))
    message = f"{timestamp},{payload_string}"
    signature = hmac.new(
        API_KEY.encode('utf-8'),
        message.encode('utf-8'),
        hashlib.sha256
    ).hexdigest()
    return signature

Handling Long CPG Receipts

Veryfi’s CPG receipt support allows customers to capture long receipts in one snap, just like taking a panoramic photo, yielding a single stitched photo of your CPG receipt with ease. (Veryfi CPG Receipts) This feature works with any type of CPG receipt, from Safeway and Tesco to Wegmans, Giant Eagle, Smart & Final, Five Below, Whole Foods Market, Coles, and other retailers – even those extremely long receipts from CVS.

def process_long_receipt(image_path):
    """
    Process long CPG receipts with enhanced settings
    """
    payload = {
        'file_data': base64.b64encode(open(image_path, 'rb').read()).decode('utf-8'),
        'file_name': image_path.split('/')[-1],
        'categories': ['Grocery'],
        'boost_mode': 1,
        'auto_delete': False,
        'max_pages_to_process': 10,  # Handle multi-page receipts
        'line_items': True,  # Ensure line item extraction
        'external_id': f"long_receipt_{int(time.time())}"
    }

    headers = generate_headers(payload)
    response = requests.post(BASE_URL, json=payload, headers=headers)

    return response.json() if response.status_code == 201 else None

Step 3: Extracting Line Item Data

Once the receipt is processed, you can retrieve detailed line item information using Veryfi’s “Get a Line Item” endpoint. The API allows users to assign their own ID to documents, useful for mapping the document to an external service or resource. (Veryfi API Documentation)

Retrieving Line Items

def get_line_items(document_id):
    """
    Retrieve line items from a processed receipt
    """
    url = f"{BASE_URL}{document_id}/"
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json',
        'Client-Id': CLIENT_ID,
        'Authorization': f'apikey {USERNAME}:{API_KEY}'
    }

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        document_data = response.json()
        return extract_line_item_details(document_data)
    else:
        raise Exception(f"Failed to retrieve document: {response.status_code}")

def extract_line_item_details(document_data):
    """
    Extract relevant line item fields for analysis
    """
    line_items = []

    for item in document_data.get('line_items', []):
        line_item = {
            'description': item.get('description', ''),
            'quantity': item.get('quantity', 1),
            'unit_price': item.get('unit_price', 0),
            'total': item.get('total', 0),
            'discount': item.get('discount', 0),
            'upc': item.get('upc', ''),
            'sku': item.get('sku', ''),
            'category': item.get('category', ''),
            'brand': item.get('brand', ''),
            'size': item.get('size', ''),
            'weight': item.get('weight', ''),
            'date': document_data.get('date', ''),
            'vendor': document_data.get('vendor', {}).get('name', ''),
            'store_number': document_data.get('store_number', ''),
            'receipt_id': document_data.get('id', '')
        }
        line_items.append(line_item)

    return line_items

cURL Example for Line Item Retrieval

For developers who prefer cURL, here’s how to retrieve line item data:

curl -X GET "https://api.veryfi.com/api/v8/partner/documents/{document_id}/" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -H "Client-Id: your_client_id" \
  -H "Authorization: apikey your_username:your_api_key"

Step 4: Product Matching and Data Enrichment

Veryfi’s modern RESTful JSON API is ready-to-go out of the box, with no need for training or setup, as it’s trained on millions of CPG receipts for superior accuracy down to line-items and wealth of data. (Veryfi CPG Toolkit) The platform automatically corrects image distortions like pin and barrel, detects blur, and adjusts perspective issues.

UPC Product Matching Service

def enrich_with_product_matching(line_items):
    """
    Enrich line items with product matching data
    """
    enriched_items = []

    for item in line_items:
        if item.get('upc'):
            # Call product matching service
            product_data = match_product_by_upc(item['upc'])

            if product_data:
                item.update({
                    'normalized_brand': product_data.get('brand', ''),
                    'normalized_size': product_data.get('size', ''),
                    'normalized_flavor': product_data.get('flavor', ''),
                    'product_category': product_data.get('category', ''),
                    'manufacturer': product_data.get('manufacturer', ''),
                    'ingredients': product_data.get('ingredients', []),
                    'nutritional_info': product_data.get('nutrition', {})
                })

        enriched_items.append(item)

    return enriched_items

def match_product_by_upc(upc):
    """
    Match product information using UPC code
    """
    # This would call Veryfi's product matching service
    # Implementation depends on specific API endpoint
    product_match_url = f"https://api.veryfi.com/api/v8/partner/products/upc/{upc}/"

    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json',
        'Client-Id': CLIENT_ID,
        'Authorization': f'apikey {USERNAME}:{API_KEY}'
    }

    response = requests.get(product_match_url, headers=headers)

    if response.status_code == 200:
        return response.json()
    else:
        return None

Step 5: Building Your Transactions Database

With enriched line item data, you can now build a comprehensive transactions database that supports advanced analytics and cross-basket analysis.

Database Schema Design

import pandas as pd
from datetime import datetime

def create_transactions_table(enriched_line_items):
    """
    Create a structured transactions table from enriched line items
    """
    transactions = []

    for item in enriched_line_items:
        transaction = {
            'transaction_id': f"{item['receipt_id']}_{item.get('line_number', 0)}",
            'receipt_id': item['receipt_id'],
            'date': item['date'],
            'store_name': item['vendor'],
            'store_number': item['store_number'],
            'product_description': item['description'],
            'upc': item['upc'],
            'sku': item['sku'],
            'brand': item.get('normalized_brand', item.get('brand', '')),
            'category': item.get('product_category', item.get('category', '')),
            'size': item.get('normalized_size', item.get('size', '')),
            'flavor': item.get('normalized_flavor', ''),
            'manufacturer': item.get('manufacturer', ''),
            'quantity': item['quantity'],
            'unit_price': item['unit_price'],
            'total_price': item['total'],
            'discount_amount': item['discount'],
            'final_price': item['total'] - item['discount'],
            'created_at': datetime.now().isoformat()
        }
        transactions.append(transaction)

    return pd.DataFrame(transactions)

# Example usage
def process_receipt_to_database(image_path):
    """
    Complete pipeline from receipt image to database record
    """
    # Step 1: Upload and process receipt
    receipt_data = upload_receipt(image_path)
    document_id = receipt_data['id']

    # Step 2: Extract line items
    line_items = get_line_items(document_id)

    # Step 3: Enrich with product matching
    enriched_items = enrich_with_product_matching(line_items)

    # Step 4: Create transactions table
    transactions_df = create_transactions_table(enriched_items)

    return transactions_df

Step 6: Advanced Analytics and Insights

With your transactions database in place, you can now perform sophisticated cross-basket analytics to understand consumer behavior patterns.

Brand Loyalty Analysis

def analyze_brand_loyalty(transactions_df):
    """
    Analyze brand switching and loyalty patterns
    """
    # Group by customer (if available) or receipt patterns
    brand_analysis = transactions_df.groupby(['brand', 'category']).agg({
        'quantity': 'sum',
        'final_price': 'sum',
        'receipt_id': 'nunique'
    }).reset_index()

    brand_analysis.columns = ['brand', 'category', 'total_quantity', 
                             'total_spend', 'purchase_occasions']

    # Calculate market share within categories
    brand_analysis['category_share'] = brand_analysis.groupby('category')['total_spend'].transform(
        lambda x: x / x.sum() * 100
    )

    return brand_analysis

def basket_composition_analysis(transactions_df):
    """
    Analyze product affinities and basket composition
    """
    # Group by receipt to analyze basket composition
    basket_analysis = transactions_df.groupby('receipt_id').agg({
        'category': lambda x: list(x.unique()),
        'brand': lambda x: list(x.unique()),
        'final_price': 'sum',
        'quantity': 'sum'
    }).reset_index()

    # Find common category combinations
    from itertools import combinations
    category_pairs = []

    for categories in basket_analysis['category']:
        if len(categories) > 1:
            for pair in combinations(categories, 2):
                category_pairs.append(sorted(pair))

    # Count frequency of category pairs
    pair_counts = pd.Series(category_pairs).value_counts()

    return pair_counts

Price Sensitivity and Promotion Analysis

def analyze_price_sensitivity(transactions_df):
    """
    Analyze how discounts affect purchase behavior
    """
    # Calculate discount percentage
    transactions_df['discount_percentage'] = (
        transactions_df['discount_amount'] / 
        (transactions_df['total_price'] + transactions_df['discount_amount']) * 100
    )

    # Group by discount ranges
    discount_ranges = pd.cut(transactions_df['discount_percentage'], 
                           bins=[0, 10, 25, 50, 100], 
                           labels=['0-10%', '10-25%', '25-50%', '50%+'])

    promotion_analysis = transactions_df.groupby(discount_ranges).agg({
        'quantity': 'mean',
        'final_price': 'mean',
        'receipt_id': 'nunique'
    }).reset_index()

    return promotion_analysis

Step 7: Data Quality and Duplicate Detection

Veryfi’s platform can identify duplicate receipts used to claim coupons, vouchers, or cash back, helping protect campaign investments and distribution volume. (Veryfi CPG Toolkit)

Duplicate Receipt Detection

def detect_duplicate_receipts(transactions_df):
    """
    Identify potential duplicate receipts for fraud prevention
    """
    # Group by key identifying fields
    duplicate_candidates = transactions_df.groupby([
        'store_name', 'date', 'final_price'
    ]).agg({
        'receipt_id': 'count',
        'transaction_id': list
    }).reset_index()

    # Flag potential duplicates
    duplicates = duplicate_candidates[duplicate_candidates['receipt_id'] > 1]

    return duplicates

def validate_data_quality(transactions_df):
    """
    Perform data quality checks on extracted data
    """
    quality_report = {
        'total_transactions': len(transactions_df),
        'missing_upc': transactions_df['upc'].isna().sum(),
        'missing_brand': transactions_df['brand'].isna().sum(),
        'zero_price': (transactions_df['final_price'] == 0).sum(),
        'negative_quantity': (transactions_df['quantity'] < 0).sum(),
        'date_range': {
            'earliest': transactions_df['date'].min(),
            'latest': transactions_df['date'].max()
        }
    }

    return quality_report

Step 8: Integration with Analytics Platforms

Your enriched transactions data can now be integrated with various analytics and business intelligence platforms for deeper insights.

Export to Common Formats

def export_for_analytics(transactions_df, format_type='csv'):
    """
    Export transactions data for external analytics platforms
    """
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

    if format_type == 'csv':
        filename = f'transactions_{timestamp}.csv'
        transactions_df.to_csv(filename, index=False)
    elif format_type == 'json':
        filename = f'transactions_{timestamp}.json'
        transactions_df.to_json(filename, orient='records', date_format='iso')
    elif format_type == 'parquet':
        filename = f'transactions_{timestamp}.parquet'
        transactions_df.to_parquet(filename, index=False)

    return filename

def create_analytics_summary(transactions_df):
    """
    Create summary statistics for dashboard consumption
    """
    summary = {
        'total_spend': transactions_df['final_price'].sum(),
        'total_transactions': len(transactions_df),
        'unique_products': transactions_df['upc'].nunique(),
        'unique_brands': transactions_df['brand'].nunique(),
        'unique_stores': transactions_df['store_name'].nunique(),
        'average_basket_size': transactions_df.groupby('receipt_id')['final_price'].sum().mean(),
        'top_categories': transactions_df.groupby('category')['final_price'].sum().nlargest(10).to_dict(),
        'top_brands': transactions_df.groupby('brand')['final_price'].sum().nlargest(10).to_dict()
    }

    return summary

Real-World Applications and Use Cases

The SKU-level basket data extracted through this process enables numerous real-world applications across different industries and use cases.

Retail and CPG Manufacturers

Veryfi’s technology provides insights into what consumers buy, where they shop, at what frequency, how much they spend, and much more. (Veryfi CPG Toolkit) This comprehensive view of consumer behavior has proven valuable in understanding market dynamics.

FAQ

What is SKU-level basket data and why is it important for retailers?

SKU-level basket data refers to detailed information about individual products (Stock Keeping Units) purchased together in a single transaction. This data is crucial for understanding consumer purchasing behavior, cross-basket analytics, and brand loyalty patterns. It enables retailers and brands to optimize product placement, develop targeted marketing campaigns, and improve inventory management strategies.

How does Veryfi’s OCR technology extract data from receipts?

Veryfi’s Receipt OCR API uses advanced machine learning and artificial intelligence to convert receipt images into machine-encoded text. The technology can process receipts in 91 currencies and 38 languages, extracting data 200x faster and 10x more accurately than manual human processing. It eliminates the need for manual labor while providing real-time data extraction from unstructured documents.

What is product matching and how does it enhance receipt data?

Product matching is the process of linking extracted receipt line items to standardized product databases using SKU codes, UPC numbers, or product descriptions. This enhancement transforms raw receipt text into structured, actionable data that can be used for market analysis, competitive intelligence, and consumer behavior insights. It enables brands to track their products across multiple retailers and understand market share.

How can Veryfi’s CPG Toolkit help with consumer spending analysis?

Veryfi’s CPG Toolkit provides real-time tools for retail manufacturers and digital marketing companies to understand consumer spending behavior and brand loyalty. The toolkit analyzes consumer packaged goods purchases including food, beverages, toiletries, and cleaning products. This data can be used to create precisely targeted coupons, vouchers, and loyalty programs at scale, enriching the consumer experience.

What types of documents can Veryfi’s OCR APIs process besides receipts?

Veryfi’s OCR APIs can extract data from a wide variety of documents including invoices, W2s, W9s, bank checks, business cards, purchase orders, bills of lading, hotel folios, bank statements, credit cards, and ID cards. The platform uses deterministic, day-1 ready AI models that provide accurate data extraction across multiple document types and formats.

How can businesses integrate Veryfi’s document capture capabilities into their applications?

Businesses can integrate Veryfi’s document capture through Veryfi Lens, a software solution that can be embedded into mobile and web applications. Built in native code and optimized for performance, Veryfi Lens handles complexities like frame processing, asset preprocessing, and machine vision challenges. It provides a clean user experience with low memory usage while delivering fast, accurate document capture capabilities.