A well-designed schema is the foundation of a successful dataset. This guide covers best practices and common patterns.

Core Principles

1. Be Descriptive

Every table, property, and relationship needs clear descriptions. Our AI agents use these to understand what data to extract.
Property(
    name="revenue",
    description="Annual revenue in USD for the most recent fiscal year",
    prop_type="Money"
)

2. Use Strong Typing

Leverage our type system to ensure data quality:
from structify.types.property_type import Enum

# Available types
Property(name="website", prop_type="URL")
Property(name="employee_count", prop_type="Integer")
Property(name="valuation", prop_type="Money")
Property(name="founded_date", prop_type="Date")
Property(name="is_public", prop_type="Boolean")
Property(name="revenue", prop_type="Float")
Property(name="logo", prop_type="Image")

# Enums for controlled vocabularies
Property(
    name="status",
    description="Current company status",
    prop_type=Enum(Enum=["Active", "Acquired", "Closed", "IPO"])
)

3. Model Relationships Thoughtfully

Relationships should represent meaningful connections:
Relationship(
    name="acquired_by",
    description="The acquiring company in an M&A transaction",
    source_table="company",
    target_table="company",
    properties=[
        RelationshipProperty(
            name="acquisition_date",
            description="Date the acquisition was completed",
            prop_type="Date"
        ),
        RelationshipProperty(
            name="price",
            description="Acquisition price in USD",
            prop_type="Money"
        )
    ]
)

Common Patterns

Hierarchical Relationships

For parent-child structures:
# Company subsidiaries
Relationship(
    name="subsidiary_of",
    description="Child company owned by parent",
    source_table="company",
    target_table="company",
    properties=[
        RelationshipProperty(
            name="ownership_percentage",
            prop_type="Float"
        )
    ]
)

Time-Series Data

For tracking changes over time:
Table(
    name="funding_round",
    description="A funding event for a company",
    properties=[
        Property(name="round_type", prop_type=Enum(Enum=["Seed", "Series A", "Series B", "Series C+"])),
        Property(name="amount", prop_type="Money"),
        Property(name="date", prop_type="Date"),
        Property(name="valuation", prop_type="Money")
    ]
)

Relationship(
    name="raised_in",
    description="Company that raised funds in this round",
    source_table="funding_round",
    target_table="company"
)

Many-to-Many Relationships

When entities can have multiple connections:
# Board members serving multiple companies
Table(
    name="person",
    description="An individual person",
    properties=[
        Property(name="name", description="Full name"),
        Property(name="title", description="Current professional title")
    ]
)

Relationship(
    name="board_member_of",
    description="Person serves on company board",
    source_table="person",
    target_table="company",
    properties=[
        RelationshipProperty(name="start_date", prop_type="Date"),
        RelationshipProperty(name="end_date", prop_type="Date"),
        RelationshipProperty(name="role", prop_type=Enum(Enum=["Director", "Chairman", "Observer"]))
    ]
)

Schema Examples by Industry

Financial Services

tables = [
    Table(
        name="fund",
        description="Investment fund or vehicle",
        properties=[
            Property(name="name"),
            Property(name="aum", description="Assets under management", prop_type="Money"),
            Property(name="strategy", prop_type=Enum(Enum=["Equity", "Debt", "Hybrid", "Crypto"]))
        ]
    ),
    Table(
        name="portfolio_company",
        description="Company in fund's portfolio",
        properties=[
            Property(name="name"),
            Property(name="sector"),
            Property(name="entry_valuation", prop_type="Money")
        ]
    )
]

Healthcare

tables = [
    Table(
        name="clinical_trial",
        description="Medical research study",
        properties=[
            Property(name="trial_id", description="ClinicalTrials.gov ID"),
            Property(name="phase", prop_type=Enum(Enum=["Phase 1", "Phase 2", "Phase 3", "Phase 4"])),
            Property(name="status", prop_type=Enum(Enum=["Recruiting", "Active", "Completed", "Terminated"])),
            Property(name="start_date", prop_type="Date"),
            Property(name="primary_outcome")
        ]
    )
]

E-Commerce

tables = [
    Table(
        name="product",
        description="Item for sale",
        properties=[
            Property(name="sku", description="Stock keeping unit"),
            Property(name="name"),
            Property(name="price", prop_type="Money"),
            Property(name="in_stock", prop_type="Boolean"),
            Property(name="category", prop_type=Enum(Enum=["Electronics", "Clothing", "Home", "Books"]))
        ]
    ),
    Table(
        name="vendor",
        description="Product supplier",
        properties=[
            Property(name="name"),
            Property(name="rating", prop_type="Float"),
            Property(name="location")
        ]
    )
]

Advanced Tips

1. Plan for Growth

Design schemas that can evolve:
  • Start with core properties
  • Add detail incrementally
  • Use consistent naming conventions

2. Balance Normalization

Find the right level of detail:
  • Too normalized: Complex to query
  • Too denormalized: Redundant data
  • Just right: Natural entity boundaries

3. Consider Your Sources

Design for the data you can actually get:
  • Public web data: Keep it simple
  • Internal documents: Can be detailed
  • APIs: Match their structure

4. Validate Early

Test your schema with sample data:
# Create a test dataset
client.datasets.create(
    name="test_schema",
    description="Testing schema design",
    tables=tables,
    relationships=relationships
)

# Add sample entities
test_entity = client.entities.add(
    dataset="test_schema",
    kg=KnowledgeGraphParam(
        entities=[EntityParam(
            id=0,
            type="company",
            properties={"name": "Test Corp"}
        )]
    )
)

# Try enrichment
client.structure.enhance_property(
    entity_id=test_entity.id,
    property_name="description"
)

Common Mistakes to Avoid

Don’t forget descriptions - Properties without clear descriptions produce poor extraction results
Don’t over-constrain enums - Leave room for edge cases with an “Other” option
Don’t create circular dependencies - Be careful with self-referential relationships

Next Steps