From Receipt to Insight: How to Extract SKU-Level Basket Data with Veryfi OCR & Product Matching

August 28, 2025
11 mins read
From Receipt to Insight: How to Extract SKU-Level Basket Data with Veryfi OCR & Product Matching

    Introduction

    In today’s data-driven retail landscape, understanding consumer purchasing behavior at the SKU level has become crucial for brands, retailers, and marketers. Traditional point-of-sale data is often siloed within individual retailers, making it challenging to gain comprehensive insights into cross-basket analytics and brand loyalty patterns. However, with the advent of advanced OCR technology and product matching services, businesses can now extract valuable SKU-level data directly from consumer receipts to fuel downstream analytics and personalization efforts.

    Veryfi’s AI-native intelligent document-processing platform offers lightning-fast OCR APIs that transform unstructured receipt data into structured, analyzable information in just 3-5 seconds. (Veryfi OCR API Platform) The platform supports 91 currencies and 38 languages, making it ideal for global retail analysis. (Veryfi Receipt OCR API) This comprehensive guide will walk developers through every step of the process, from uploading receipts to Veryfi’s OCR API to enriching line items with product matching services, ultimately creating a ready-to-query transactions table for advanced analytics.

    The Power of Receipt-Based Analytics

    Consumer packaged goods (CPG) purchase data has traditionally been controlled by retailers who own the point-of-sale systems. (Veryfi CPG Toolkit) However, receipt-based analytics democratizes access to this valuable information, enabling brands and marketers to understand consumer behavior across multiple retailers and channels.

    Veryfi’s CPG Toolkit empowers retail manufacturers and digital marketing companies with real-time tools to unearth consumer spend behavior, brand loyalty, store insights, and much more. (Veryfi CPG Toolkit) This data can then be used to enrich consumer experiences through precisely targeted coupons, vouchers, and loyalty programs at scale.

    Key Benefits of SKU-Level Receipt Analysis

    • Cross-retailer insights: Understand purchasing patterns across different stores and chains
    • Brand loyalty tracking: Identify switching behaviors and competitive dynamics
    • Basket composition analysis: Discover product affinities and cross-selling opportunities
    • Price sensitivity mapping: Analyze how promotions and discounts affect purchase decisions
    • Geographic and demographic segmentation: Correlate purchasing patterns with location and customer profiles

    Step 1: Setting Up Your Development Environment

    Before diving into the code, you’ll need to set up your development environment and obtain the necessary API credentials from Veryfi.

    Prerequisites

    • Python 3.7 or higher
    • Veryfi API credentials (Client ID, Username, API Key)
    • Basic understanding of REST APIs and JSON handling

    Installing Required Libraries

    pip install requests
    pip install json
    pip install base64
    pip install pandas  # For data manipulation

    Authentication Setup

    import requests
    import json
    import base64
    import hashlib
    import hmac
    import time
    
    # Veryfi API credentials
    CLIENT_ID = "your_client_id"
    USERNAME = "your_username"
    API_KEY = "your_api_key"
    BASE_URL = "https://api.veryfi.com/api/v8/partner/documents/"

    Step 2: Uploading Receipts to Veryfi’s OCR API

    Veryfi’s Receipt OCR API can extract data from receipts in 91 currencies and 38 languages, using optical character recognition technology to convert receipt images into machine-encoded text. (Veryfi Receipt OCR API) The API eliminates the need for manual human labor, which was previously inefficient and prone to errors.

    Basic Receipt Upload Function

    def upload_receipt(image_path):
        """
        Upload a receipt image to Veryfi's OCR API
        """
        # Read and encode the image
        with open(image_path, 'rb') as image_file:
            image_data = base64.b64encode(image_file.read()).decode('utf-8')
    
        # Prepare the request payload
        payload = {
            'file_data': image_data,
            'file_name': image_path.split('/')[-1],
            'categories': ['Grocery', 'Gas Station', 'Restaurant'],
            'auto_delete': False,
            'boost_mode': 1,  # Enable for better line item extraction
            'external_id': f"receipt_{int(time.time())}"
        }
    
        # Generate authentication headers
        headers = generate_headers(payload)
    
        # Make the API request
        response = requests.post(BASE_URL, json=payload, headers=headers)
    
        if response.status_code == 201:
            return response.json()
        else:
            raise Exception(f"API request failed: {response.status_code} - {response.text}")
    
    def generate_headers(payload):
        """
        Generate authentication headers for Veryfi API
        """
        timestamp = int(time.time() * 1000)
        signature = generate_signature(payload, timestamp)
    
        return {
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'Client-Id': CLIENT_ID,
            'Authorization': f'apikey {USERNAME}:{API_KEY}',
            'X-Veryfi-Request-Timestamp': str(timestamp),
            'X-Veryfi-Request-Signature': signature
        }
    
    def generate_signature(payload, timestamp):
        """
        Generate HMAC signature for API authentication
        """
        payload_string = json.dumps(payload, separators=(',', ':'))
        message = f"{timestamp},{payload_string}"
        signature = hmac.new(
            API_KEY.encode('utf-8'),
            message.encode('utf-8'),
            hashlib.sha256
        ).hexdigest()
        return signature

    Handling Long CPG Receipts

    Veryfi’s CPG receipt support allows customers to capture long receipts in one snap, just like taking a panoramic photo, yielding a single stitched photo of your CPG receipt with ease. (Veryfi CPG Receipts) This feature works with any type of CPG receipt, from Safeway and Tesco to Wegmans, Giant Eagle, Smart & Final, Five Below, Whole Foods Market, Coles, and other retailers – even those extremely long receipts from CVS.

    def process_long_receipt(image_path):
        """
        Process long CPG receipts with enhanced settings
        """
        payload = {
            'file_data': base64.b64encode(open(image_path, 'rb').read()).decode('utf-8'),
            'file_name': image_path.split('/')[-1],
            'categories': ['Grocery'],
            'boost_mode': 1,
            'auto_delete': False,
            'max_pages_to_process': 10,  # Handle multi-page receipts
            'line_items': True,  # Ensure line item extraction
            'external_id': f"long_receipt_{int(time.time())}"
        }
    
        headers = generate_headers(payload)
        response = requests.post(BASE_URL, json=payload, headers=headers)
    
        return response.json() if response.status_code == 201 else None

    Step 3: Extracting Line Item Data

    Once the receipt is processed, you can retrieve detailed line item information using Veryfi’s “Get a Line Item” endpoint. The API allows users to assign their own ID to documents, useful for mapping the document to an external service or resource. (Veryfi API Documentation)

    Retrieving Line Items

    def get_line_items(document_id):
        """
        Retrieve line items from a processed receipt
        """
        url = f"{BASE_URL}{document_id}/"
        headers = {
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'Client-Id': CLIENT_ID,
            'Authorization': f'apikey {USERNAME}:{API_KEY}'
        }
    
        response = requests.get(url, headers=headers)
    
        if response.status_code == 200:
            document_data = response.json()
            return extract_line_item_details(document_data)
        else:
            raise Exception(f"Failed to retrieve document: {response.status_code}")
    
    def extract_line_item_details(document_data):
        """
        Extract relevant line item fields for analysis
        """
        line_items = []
    
        for item in document_data.get('line_items', []):
            line_item = {
                'description': item.get('description', ''),
                'quantity': item.get('quantity', 1),
                'unit_price': item.get('unit_price', 0),
                'total': item.get('total', 0),
                'discount': item.get('discount', 0),
                'upc': item.get('upc', ''),
                'sku': item.get('sku', ''),
                'category': item.get('category', ''),
                'brand': item.get('brand', ''),
                'size': item.get('size', ''),
                'weight': item.get('weight', ''),
                'date': document_data.get('date', ''),
                'vendor': document_data.get('vendor', {}).get('name', ''),
                'store_number': document_data.get('store_number', ''),
                'receipt_id': document_data.get('id', '')
            }
            line_items.append(line_item)
    
        return line_items

    cURL Example for Line Item Retrieval

    For developers who prefer cURL, here’s how to retrieve line item data:

    curl -X GET "https://api.veryfi.com/api/v8/partner/documents/{document_id}/" \
      -H "Content-Type: application/json" \
      -H "Accept: application/json" \
      -H "Client-Id: your_client_id" \
      -H "Authorization: apikey your_username:your_api_key"

    Step 4: Product Matching and Data Enrichment

    Veryfi’s modern RESTful JSON API is ready-to-go out of the box, with no need for training or setup, as it’s trained on millions of CPG receipts for superior accuracy down to line-items and wealth of data. (Veryfi CPG Toolkit) The platform automatically corrects image distortions like pin and barrel, detects blur, and adjusts perspective issues.

    UPC Product Matching Service

    def enrich_with_product_matching(line_items):
        """
        Enrich line items with product matching data
        """
        enriched_items = []
    
        for item in line_items:
            if item.get('upc'):
                # Call product matching service
                product_data = match_product_by_upc(item['upc'])
    
                if product_data:
                    item.update({
                        'normalized_brand': product_data.get('brand', ''),
                        'normalized_size': product_data.get('size', ''),
                        'normalized_flavor': product_data.get('flavor', ''),
                        'product_category': product_data.get('category', ''),
                        'manufacturer': product_data.get('manufacturer', ''),
                        'ingredients': product_data.get('ingredients', []),
                        'nutritional_info': product_data.get('nutrition', {})
                    })
    
            enriched_items.append(item)
    
        return enriched_items
    
    def match_product_by_upc(upc):
        """
        Match product information using UPC code
        """
        # This would call Veryfi's product matching service
        # Implementation depends on specific API endpoint
        product_match_url = f"https://api.veryfi.com/api/v8/partner/products/upc/{upc}/"
    
        headers = {
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'Client-Id': CLIENT_ID,
            'Authorization': f'apikey {USERNAME}:{API_KEY}'
        }
    
        response = requests.get(product_match_url, headers=headers)
    
        if response.status_code == 200:
            return response.json()
        else:
            return None

    Step 5: Building Your Transactions Database

    With enriched line item data, you can now build a comprehensive transactions database that supports advanced analytics and cross-basket analysis.

    Database Schema Design

    import pandas as pd
    from datetime import datetime
    
    def create_transactions_table(enriched_line_items):
        """
        Create a structured transactions table from enriched line items
        """
        transactions = []
    
        for item in enriched_line_items:
            transaction = {
                'transaction_id': f"{item['receipt_id']}_{item.get('line_number', 0)}",
                'receipt_id': item['receipt_id'],
                'date': item['date'],
                'store_name': item['vendor'],
                'store_number': item['store_number'],
                'product_description': item['description'],
                'upc': item['upc'],
                'sku': item['sku'],
                'brand': item.get('normalized_brand', item.get('brand', '')),
                'category': item.get('product_category', item.get('category', '')),
                'size': item.get('normalized_size', item.get('size', '')),
                'flavor': item.get('normalized_flavor', ''),
                'manufacturer': item.get('manufacturer', ''),
                'quantity': item['quantity'],
                'unit_price': item['unit_price'],
                'total_price': item['total'],
                'discount_amount': item['discount'],
                'final_price': item['total'] - item['discount'],
                'created_at': datetime.now().isoformat()
            }
            transactions.append(transaction)
    
        return pd.DataFrame(transactions)
    
    # Example usage
    def process_receipt_to_database(image_path):
        """
        Complete pipeline from receipt image to database record
        """
        # Step 1: Upload and process receipt
        receipt_data = upload_receipt(image_path)
        document_id = receipt_data['id']
    
        # Step 2: Extract line items
        line_items = get_line_items(document_id)
    
        # Step 3: Enrich with product matching
        enriched_items = enrich_with_product_matching(line_items)
    
        # Step 4: Create transactions table
        transactions_df = create_transactions_table(enriched_items)
    
        return transactions_df

    Step 6: Advanced Analytics and Insights

    With your transactions database in place, you can now perform sophisticated cross-basket analytics to understand consumer behavior patterns.

    Brand Loyalty Analysis

    def analyze_brand_loyalty(transactions_df):
        """
        Analyze brand switching and loyalty patterns
        """
        # Group by customer (if available) or receipt patterns
        brand_analysis = transactions_df.groupby(['brand', 'category']).agg({
            'quantity': 'sum',
            'final_price': 'sum',
            'receipt_id': 'nunique'
        }).reset_index()
    
        brand_analysis.columns = ['brand', 'category', 'total_quantity', 
                                 'total_spend', 'purchase_occasions']
    
        # Calculate market share within categories
        brand_analysis['category_share'] = brand_analysis.groupby('category')['total_spend'].transform(
            lambda x: x / x.sum() * 100
        )
    
        return brand_analysis
    
    def basket_composition_analysis(transactions_df):
        """
        Analyze product affinities and basket composition
        """
        # Group by receipt to analyze basket composition
        basket_analysis = transactions_df.groupby('receipt_id').agg({
            'category': lambda x: list(x.unique()),
            'brand': lambda x: list(x.unique()),
            'final_price': 'sum',
            'quantity': 'sum'
        }).reset_index()
    
        # Find common category combinations
        from itertools import combinations
        category_pairs = []
    
        for categories in basket_analysis['category']:
            if len(categories) > 1:
                for pair in combinations(categories, 2):
                    category_pairs.append(sorted(pair))
    
        # Count frequency of category pairs
        pair_counts = pd.Series(category_pairs).value_counts()
    
        return pair_counts

    Price Sensitivity and Promotion Analysis

    def analyze_price_sensitivity(transactions_df):
        """
        Analyze how discounts affect purchase behavior
        """
        # Calculate discount percentage
        transactions_df['discount_percentage'] = (
            transactions_df['discount_amount'] / 
            (transactions_df['total_price'] + transactions_df['discount_amount']) * 100
        )
    
        # Group by discount ranges
        discount_ranges = pd.cut(transactions_df['discount_percentage'], 
                               bins=[0, 10, 25, 50, 100], 
                               labels=['0-10%', '10-25%', '25-50%', '50%+'])
    
        promotion_analysis = transactions_df.groupby(discount_ranges).agg({
            'quantity': 'mean',
            'final_price': 'mean',
            'receipt_id': 'nunique'
        }).reset_index()
    
        return promotion_analysis

    Step 7: Data Quality and Duplicate Detection

    Veryfi’s platform can identify duplicate receipts used to claim coupons, vouchers, or cash back, helping protect campaign investments and distribution volume. (Veryfi CPG Toolkit)

    Duplicate Receipt Detection

    def detect_duplicate_receipts(transactions_df):
        """
        Identify potential duplicate receipts for fraud prevention
        """
        # Group by key identifying fields
        duplicate_candidates = transactions_df.groupby([
            'store_name', 'date', 'final_price'
        ]).agg({
            'receipt_id': 'count',
            'transaction_id': list
        }).reset_index()
    
        # Flag potential duplicates
        duplicates = duplicate_candidates[duplicate_candidates['receipt_id'] > 1]
    
        return duplicates
    
    def validate_data_quality(transactions_df):
        """
        Perform data quality checks on extracted data
        """
        quality_report = {
            'total_transactions': len(transactions_df),
            'missing_upc': transactions_df['upc'].isna().sum(),
            'missing_brand': transactions_df['brand'].isna().sum(),
            'zero_price': (transactions_df['final_price'] == 0).sum(),
            'negative_quantity': (transactions_df['quantity'] < 0).sum(),
            'date_range': {
                'earliest': transactions_df['date'].min(),
                'latest': transactions_df['date'].max()
            }
        }
    
        return quality_report

    Step 8: Integration with Analytics Platforms

    Your enriched transactions data can now be integrated with various analytics and business intelligence platforms for deeper insights.

    Export to Common Formats

    def export_for_analytics(transactions_df, format_type='csv'):
        """
        Export transactions data for external analytics platforms
        """
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
        if format_type == 'csv':
            filename = f'transactions_{timestamp}.csv'
            transactions_df.to_csv(filename, index=False)
        elif format_type == 'json':
            filename = f'transactions_{timestamp}.json'
            transactions_df.to_json(filename, orient='records', date_format='iso')
        elif format_type == 'parquet':
            filename = f'transactions_{timestamp}.parquet'
            transactions_df.to_parquet(filename, index=False)
    
        return filename
    
    def create_analytics_summary(transactions_df):
        """
        Create summary statistics for dashboard consumption
        """
        summary = {
            'total_spend': transactions_df['final_price'].sum(),
            'total_transactions': len(transactions_df),
            'unique_products': transactions_df['upc'].nunique(),
            'unique_brands': transactions_df['brand'].nunique(),
            'unique_stores': transactions_df['store_name'].nunique(),
            'average_basket_size': transactions_df.groupby('receipt_id')['final_price'].sum().mean(),
            'top_categories': transactions_df.groupby('category')['final_price'].sum().nlargest(10).to_dict(),
            'top_brands': transactions_df.groupby('brand')['final_price'].sum().nlargest(10).to_dict()
        }
    
        return summary

    Real-World Applications and Use Cases

    The SKU-level basket data extracted through this process enables numerous real-world applications across different industries and use cases.

    Retail and CPG Manufacturers

    Veryfi’s technology provides insights into what consumers buy, where they shop, at what frequency, how much they spend, and much more. (Veryfi CPG Toolkit) This comprehensive view of consumer behavior has proven valuable in understanding market dynamics.

    FAQ

    What is SKU-level basket data and why is it important for retailers?

    SKU-level basket data refers to detailed information about individual products (Stock Keeping Units) purchased together in a single transaction. This data is crucial for understanding consumer purchasing behavior, cross-basket analytics, and brand loyalty patterns. It enables retailers and brands to optimize product placement, develop targeted marketing campaigns, and improve inventory management strategies.

    How does Veryfi’s OCR technology extract data from receipts?

    Veryfi’s Receipt OCR API uses advanced machine learning and artificial intelligence to convert receipt images into machine-encoded text. The technology can process receipts in 91 currencies and 38 languages, extracting data 200x faster and 10x more accurately than manual human processing. It eliminates the need for manual labor while providing real-time data extraction from unstructured documents.

    What is product matching and how does it enhance receipt data?

    Product matching is the process of linking extracted receipt line items to standardized product databases using SKU codes, UPC numbers, or product descriptions. This enhancement transforms raw receipt text into structured, actionable data that can be used for market analysis, competitive intelligence, and consumer behavior insights. It enables brands to track their products across multiple retailers and understand market share.

    How can Veryfi’s CPG Toolkit help with consumer spending analysis?

    Veryfi’s CPG Toolkit provides real-time tools for retail manufacturers and digital marketing companies to understand consumer spending behavior and brand loyalty. The toolkit analyzes consumer packaged goods purchases including food, beverages, toiletries, and cleaning products. This data can be used to create precisely targeted coupons, vouchers, and loyalty programs at scale, enriching the consumer experience.

    What types of documents can Veryfi’s OCR APIs process besides receipts?

    Veryfi’s OCR APIs can extract data from a wide variety of documents including invoices, W2s, W9s, bank checks, business cards, purchase orders, bills of lading, hotel folios, bank statements, credit cards, and ID cards. The platform uses deterministic, day-1 ready AI models that provide accurate data extraction across multiple document types and formats.

    How can businesses integrate Veryfi’s document capture capabilities into their applications?

    Businesses can integrate Veryfi’s document capture through Veryfi Lens, a software solution that can be embedded into mobile and web applications. Built in native code and optimized for performance, Veryfi Lens handles complexities like frame processing, asset preprocessing, and machine vision challenges. It provides a clean user experience with low memory usage while delivering fast, accurate document capture capabilities.