# PDF Database Structure Guide

## Overview

This guide explains how to handle multiple PDFs per application ID in your database. The solution creates a proper one-to-many relationship between development applications and their associated PDF documents.

## Database Structure

### 1. DevelopmentApplication (Existing)
- Stores CSV data for each application
- One record per application ID
- Contains basic application information (council, decision, dates, etc.)

### 2. PDFDocument (New)
- Stores information about each PDF file
- Links to DevelopmentApplication via ForeignKey
- One record per PDF file
- Contains metadata about the PDF (type, confidence, extraction status)

### 3. ExtractedPDFData (New)
- Stores structured data extracted from PDFs
- Links to PDFDocument via OneToOneField
- Contains all the extracted fields (land description, applicant info, etc.)

## Relationships

```
DevelopmentApplication (1) ←→ (Many) PDFDocument (1) ←→ (1) ExtractedPDFData
```

- One application can have multiple PDFs
- Each PDF has one set of extracted data
- If PDF extraction fails, ExtractedPDFData record won't be created

## How to Use

### 1. Process PDFs and Save to Database

Use the updated `extractPdfData` endpoint:

```bash
GET /files/extract-pdf-data/
```

This will:
- Read the `enriched_with_name.csv` file
- Process each PDF file
- Save extracted data to the database
- Return processing statistics

### 2. Query PDF Data from Database

#### Get all applications with PDF data:
```bash
GET /database/pdf-data/
```

#### Get specific application's PDF data:
```bash
GET /database/pdf-data/?application_id=D-101-2023
```

### 3. Use Django Admin

Access the Django admin interface to view and manage PDF data:

1. Go to `/admin/`
2. Navigate to "ShaoApp" section
3. View:
   - Development Applications
   - PDF Documents
   - Extracted PDF Data

### 4. Programmatic Access

#### Process a single PDF:
```python
from shaoApp.functions import process_and_save_pdf_data

result = process_and_save_pdf_data('D-101-2023', '/path/to/pdf/file.pdf')
print(result)
```

#### Query PDF data:
```python
from shaoApp.models import DevelopmentApplication, PDFDocument, ExtractedPDFData

# Get application with all its PDFs
app = DevelopmentApplication.objects.get(application_id='D-101-2023')
pdfs = app.pdf_documents.all()

for pdf in pdfs:
    print(f"PDF: {pdf.file_name}")
    print(f"Type: {pdf.document_type}")
    print(f"Status: {pdf.extraction_status}")
    
    if hasattr(pdf, 'extracted_data'):
        data = pdf.extracted_data
        print(f"Land: {data.land_description}")
        print(f"Applicant: {data.applicant_name}")
```

## Database Fields

### PDFDocument Fields
- `application`: ForeignKey to DevelopmentApplication
- `file_path`: Path to the PDF file
- `file_name`: Name of the PDF file
- `document_type`: Type of document (Title Search, Application Form, etc.)
- `pdf_type`: PDF type (digital, scanned, mixed)
- `confidence`: Extraction confidence (high, medium, low)
- `text_length`: Length of extracted text
- `pages_processed`: Number of pages processed
- `extraction_status`: Status (pending, success, failed)
- `error_message`: Error message if extraction failed

### ExtractedPDFData Fields
- `pdf_document`: OneToOneField to PDFDocument
- `land_description`: Property address
- `registered_proprietor`: Legal owner
- `encumbrances`: Property restrictions
- `activity_last_125_days`: Recent title activity
- `administrative_notices`: Official notices
- `proposed_use`: Intended use of development
- `description`: Development description
- `applicant_name`: Applicant's name
- `contact_name`: Contact person's name
- `contact_address`: Contact address
- `contact_email`: Contact email
- `contact_phone`: Contact phone
- `applicant_address`: Applicant address
- `applicant_email`: Applicant email
- `applicant_phone`: Applicant phone
- `lot_size`: Lot size in m²
- `site_coverage`: Site coverage percentage
- `total_area`: Total building area
- `ground_floor_area`: Ground floor area
- `first_floor_area`: First floor area
- `pos`: Private Open Space
- `spos`: Secluded Private Open Space
- `raw_extracted_data`: JSON backup of extracted data

## Example Workflow

1. **Import CSV Data**: Use existing CSV import functionality
2. **Process PDFs**: Run PDF extraction endpoint
3. **View Results**: Check Django admin or use API endpoints
4. **Query Data**: Use programmatic access for analysis

## Testing

Run the test script to verify the database structure:

```bash
cd shao
python test_pdf_database.py
```

## Benefits

1. **Proper Relationships**: One-to-many relationship between applications and PDFs
2. **Data Integrity**: Structured storage of extracted data
3. **Error Handling**: Tracks failed extractions
4. **Flexibility**: Can handle any number of PDFs per application
5. **Queryability**: Easy to query and analyze data
6. **Admin Interface**: Visual management through Django admin

## Migration Notes

- Existing CSV data remains unchanged
- New tables are created alongside existing ones
- No data loss during migration
- Backward compatible with existing functionality

## Troubleshooting

### Common Issues

1. **PDF not found**: Check file paths in CSV
2. **Extraction failed**: Check PDF type and content
3. **Database errors**: Verify migrations are applied
4. **Memory issues**: Reduce batch size in processing

### Debug Commands

```python
# Check database tables
from shaoApp.models import DevelopmentApplication, PDFDocument, ExtractedPDFData
print(f"Applications: {DevelopmentApplication.objects.count()}")
print(f"PDF Documents: {PDFDocument.objects.count()}")
print(f"Extracted Data: {ExtractedPDFData.objects.count()}")

# Check specific application
app = DevelopmentApplication.objects.get(application_id='D-101-2023')
print(f"PDFs for {app.application_id}: {app.pdf_documents.count()}")
```

## Next Steps

1. Test with your existing PDF files
2. Customize extraction prompts if needed
3. Add additional fields to models if required
4. Implement data validation rules
5. Add reporting and analytics features 