Overview
I participated in the “Law x Digital” Hackathon (3rd edition) hosted by Japan’s Digital Agency and developed a cross-source legal document search product called “Lawve.” Lawve enables natural language search across e-Gov legal data and user-uploaded documents.
This article focuses on the backend architecture design that I was responsible for. The backend’s primary responsibility is to “automatically sync the state of Gemini File Search API whenever files are added to or removed from Google Cloud Storage (GCS).”
System Architecture
Here is the overall system architecture of Lawve.
flowchart TB
subgraph User
U[Browser]
end
subgraph GCP
subgraph Frontend
FE[Cloud Run
Next.js]
end
subgraph Backend
CR[Cloud Run
FastAPI]
end
subgraph Storage & Data
GCS[(Cloud Storage)]
FS[(Firestore)]
end
subgraph Event Infrastructure
EA[Eventarc]
end
subgraph AI
GEMINI[Gemini File
Search API]
end
end
subgraph External API
EGOV[e-Gov
Law API]
end
U --> FE
FE --> GEMINI
FE --> FS
FE --> EGOV
FE -->|File Upload| GCS
GCS -->|Event Notification| EA
EA -->|CloudEvents| CR
CR -->|Register/Delete| GEMINI
CR -->|Download| GCS
γ―γͺγγ―γ§ζ‘ε€§Component Roles
| Component | Role |
|---|---|
| Cloud Run (Next.js) | Frontend: search UI, file upload, search result display |
| Cloud Run (FastAPI) | Backend: syncs Gemini File Search API in response to GCS events |
| Cloud Storage | Document storage, serves as the Single Source of Truth |
| Eventarc | Routes GCS file change events to Cloud Run |
| Gemini File Search API | Provides full-text and semantic search for documents |
| Firestore | Manages document metadata and comments |
| e-Gov Law API | Retrieves legal data |
This article focuses on the backend Cloud Run (FastAPI) design.
Event-Driven Architecture Design
Why Event-Driven?
In a legal document search product, documents need to be automatically registered with Gemini File Search API when placed in GCS. While a synchronous API approach from the frontend was an option, I chose an event-driven architecture for the following reasons:
- Loose coupling: The frontend only needs to place files in GCS without knowing about the backend
- Reliability: Eventarc reliably detects GCS events and delivers them to Cloud Run
- Tool compatibility: Files placed via CLI or scripts are synced in the same way
Event Routing with Eventarc
Eventarc monitors two types of GCS events and routes them to the Cloud Run endpoint.
| Event | Trigger Condition |
|---|---|
google.cloud.storage.object.v1.finalized | When a file is uploaded to GCS |
google.cloud.storage.object.v1.deleted | When a file is deleted from GCS |
The Cloud Run endpoint receives events in CloudEvents format.
| |
File Upload Processing Flow
When a file is placed in GCS, it is processed through the following flow:
- Event reception and path filtering: Only files with the
file-search/archive/prefix are processed - Metadata extraction: Automatically extract
law_idandsource_typefrom the GCS path - Delete existing documents: Remove documents with the same
source_pathto prevent duplicates - File download: Download from GCS to a local temporary file
- Register with File Search Store: Upload to Gemini File Search API with metadata
| |
File Deletion Processing Flow
When a file is deleted from GCS, the corresponding document is also deleted from Gemini File Search API.
- Event reception and path filtering
- Search File Search Store by metadata: Identify the matching document using the
source_pathmetadata - Delete document: Remove the document from File Search Store
Integration with Gemini File Search API
Unified GCP Ecosystem
Lawve adopts a policy of unifying the entire infrastructure on GCP. While other RAG services such as OpenAI and Pinecone were options, I chose Gemini File Search API for its seamless integration with Cloud Run, GCS, and Eventarc.
File Search Store Overview
Gemini File Search Store is a managed service that vectorizes and indexes registered documents to provide semantic search. The backend performs the following operations:
- Document registration: Upload files with metadata
- Document search: Search existing documents by metadata
- Document deletion: Remove documents that are no longer needed
Metadata-Based Management
Each document is tagged with three metadata fields for management purposes.
| Metadata | Purpose | Example |
|---|---|---|
law_id | Legal document ID for unique identification | 323AC0000000205 |
source_type | Document classification | user, doc, admin |
source_path | Full GCS path for unique document identification | gs://bucket/file-search/archive/user/323AC0000000205/medical-law.txt |
Since source_path is the GCS path itself, it uniquely maps GCS files to File Search Store documents.
| |
Design Patterns for Extensibility
Path-Based Metadata Extraction
The backend is designed to process any file placed in a specific GCS path, not limited to legal documents. The path convention is as follows:
| |
The source_type and law_id are automatically extracted from the path using a regular expression.
| |
This design allows any type of document, such as internal documents or technical documentation, to be managed through the same mechanism. You can add new classifications simply by changing the source_type, and no backend code changes are required as long as the path convention is followed.
Guaranteeing State Synchronization Between GCS and Gemini File Search API
GCS serves as the Single Source of Truth, and the design principle is to always keep the File Search Store state in sync with GCS.
Idempotent uploads: When the same file is re-uploaded, existing documents are deleted before re-registration. This prevents duplicates while supporting content updates.
| |
Cascading deletes: When a file is deleted from GCS, the corresponding document is automatically removed from the File Search Store. The system searches by source_path metadata and deletes the matching document.
| |
Local Download Design
Registration from GCS to Gemini File Search API is done by first downloading to a local temporary file.
| |
While direct upload would be simpler, downloading locally first provides the following benefits:
- Extensibility: Enables future file transformation steps such as converting Excel to CSV before upload
- MIME detection: Preserving the file extension enables accurate MIME type detection
- Debugging: Allows inspection of local file contents when issues occur
Error Handling and Operational Design
Always Return 200 OK
The Cloud Run endpoint always returns 200 OK to Eventarc requests, even when errors occur.
| |
This design prevents Eventarc retries. If the endpoint returns 4xx/5xx, Eventarc resends the event, causing the same error to repeat. This results in duplicate Slack notifications and polluted logs. By returning 200 OK and handling errors through Slack notifications and logging, operational issues are avoided.
Error Monitoring via Slack Notifications
When a File Search Store upload fails, an error notification is sent via Slack Webhook. The notification includes the GCS path and error details, enabling operators to quickly identify the problem.
While this Slack-based monitoring is sufficient for hackathon-scale development, a production environment would require more robust monitoring and recovery mechanisms, such as Cloud Monitoring alerts, Cloud Logging integration, and Dead Letter Queues for reprocessing failed events.
Lazy Initialization for Client Management
Gemini API clients and GCS clients use lazy initialization.
| |
This allows the application to start even without API keys configured, making it easier to work in test environments or local development without providing all environment variables.
Summary
The Lawve backend was built on the following design principles:
- Loosely coupled automatic sync via event-driven architecture: Using Eventarc to keep the frontend and backend decoupled while automatically reflecting GCS changes in Gemini File Search API
- Metadata-driven extensible design: Automatically extracting metadata from path conventions to support documents beyond legal data
- GCS as Single Source of Truth: Idempotent uploads and cascading deletes ensure the GCS and File Search Store states always match
By combining an event-driven architecture with GCP managed services, I was able to build a reliable document synchronization platform with minimal code. The simplicity of making documents searchable just by placing files in GCS was a significant advantage during the time-limited hackathon development.
The source code is available on GitHub.
Related Articles
- Setting Up a Python Development Environment on Mac with UV (Python environment setup)
- How to Use the OpenAI Response API (using OpenAI API)