A well-designed schema is the foundation of a successful dataset. This guide covers best practices and common patterns.
Core Principles
1. Be Descriptive
Every table, property, and relationship needs clear descriptions. Our AI agents use these to understand what data to extract.
Property(
name="revenue",
description="Annual revenue in USD for the most recent fiscal year",
prop_type="Money"
)
2. Use Strong Typing
Leverage our type system to ensure data quality:
from structify.types.property_type import Enum
# Available types
Property(name="website", prop_type="URL")
Property(name="employee_count", prop_type="Integer")
Property(name="valuation", prop_type="Money")
Property(name="founded_date", prop_type="Date")
Property(name="is_public", prop_type="Boolean")
Property(name="revenue", prop_type="Float")
Property(name="logo", prop_type="Image")
# Enums for controlled vocabularies
Property(
name="status",
description="Current company status",
prop_type=Enum(Enum=["Active", "Acquired", "Closed", "IPO"])
)
3. Model Relationships Thoughtfully
Relationships should represent meaningful connections:
Relationship(
name="acquired_by",
description="The acquiring company in an M&A transaction",
source_table="company",
target_table="company",
properties=[
RelationshipProperty(
name="acquisition_date",
description="Date the acquisition was completed",
prop_type="Date"
),
RelationshipProperty(
name="price",
description="Acquisition price in USD",
prop_type="Money"
)
]
)
Common Patterns
Hierarchical Relationships
For parent-child structures:
# Company subsidiaries
Relationship(
name="subsidiary_of",
description="Child company owned by parent",
source_table="company",
target_table="company",
properties=[
RelationshipProperty(
name="ownership_percentage",
prop_type="Float"
)
]
)
Time-Series Data
For tracking changes over time:
Table(
name="funding_round",
description="A funding event for a company",
properties=[
Property(name="round_type", prop_type=Enum(Enum=["Seed", "Series A", "Series B", "Series C+"])),
Property(name="amount", prop_type="Money"),
Property(name="date", prop_type="Date"),
Property(name="valuation", prop_type="Money")
]
)
Relationship(
name="raised_in",
description="Company that raised funds in this round",
source_table="funding_round",
target_table="company"
)
Many-to-Many Relationships
When entities can have multiple connections:
# Board members serving multiple companies
Table(
name="person",
description="An individual person",
properties=[
Property(name="name", description="Full name"),
Property(name="title", description="Current professional title")
]
)
Relationship(
name="board_member_of",
description="Person serves on company board",
source_table="person",
target_table="company",
properties=[
RelationshipProperty(name="start_date", prop_type="Date"),
RelationshipProperty(name="end_date", prop_type="Date"),
RelationshipProperty(name="role", prop_type=Enum(Enum=["Director", "Chairman", "Observer"]))
]
)
Schema Examples by Industry
Financial Services
tables = [
Table(
name="fund",
description="Investment fund or vehicle",
properties=[
Property(name="name"),
Property(name="aum", description="Assets under management", prop_type="Money"),
Property(name="strategy", prop_type=Enum(Enum=["Equity", "Debt", "Hybrid", "Crypto"]))
]
),
Table(
name="portfolio_company",
description="Company in fund's portfolio",
properties=[
Property(name="name"),
Property(name="sector"),
Property(name="entry_valuation", prop_type="Money")
]
)
]
Healthcare
tables = [
Table(
name="clinical_trial",
description="Medical research study",
properties=[
Property(name="trial_id", description="ClinicalTrials.gov ID"),
Property(name="phase", prop_type=Enum(Enum=["Phase 1", "Phase 2", "Phase 3", "Phase 4"])),
Property(name="status", prop_type=Enum(Enum=["Recruiting", "Active", "Completed", "Terminated"])),
Property(name="start_date", prop_type="Date"),
Property(name="primary_outcome")
]
)
]
E-Commerce
tables = [
Table(
name="product",
description="Item for sale",
properties=[
Property(name="sku", description="Stock keeping unit"),
Property(name="name"),
Property(name="price", prop_type="Money"),
Property(name="in_stock", prop_type="Boolean"),
Property(name="category", prop_type=Enum(Enum=["Electronics", "Clothing", "Home", "Books"]))
]
),
Table(
name="vendor",
description="Product supplier",
properties=[
Property(name="name"),
Property(name="rating", prop_type="Float"),
Property(name="location")
]
)
]
Advanced Tips
1. Plan for Growth
Design schemas that can evolve:
- Start with core properties
- Add detail incrementally
- Use consistent naming conventions
2. Balance Normalization
Find the right level of detail:
- Too normalized: Complex to query
- Too denormalized: Redundant data
- Just right: Natural entity boundaries
3. Consider Your Sources
Design for the data you can actually get:
- Public web data: Keep it simple
- Internal documents: Can be detailed
- APIs: Match their structure
4. Validate Early
Test your schema with sample data:
# Create a test dataset
client.datasets.create(
name="test_schema",
description="Testing schema design",
tables=tables,
relationships=relationships
)
# Add sample entities
test_entity = client.entities.add(
dataset="test_schema",
kg=KnowledgeGraphParam(
entities=[EntityParam(
id=0,
type="company",
properties={"name": "Test Corp"}
)]
)
)
# Try enrichment
client.structure.enhance_property(
entity_id=test_entity.id,
property_name="description"
)
Common Mistakes to Avoid
Don’t forget descriptions - Properties without clear descriptions produce poor extraction results
Don’t over-constrain enums - Leave room for edge cases with an “Other” option
Don’t create circular dependencies - Be careful with self-referential relationships
Next Steps