Primitive Type Spelling
Question
If Sigil moves away from Unicode primitive type glyphs, should primitive types use:
- lowercase ASCII:
int,float,bool,string,char,unit,never - capitalized ASCII:
Int,Float,Bool,String,Char,Unit,Never
Current findings
Token cost
Using the repo tokenizer harness:
- lowercase and capitalized ASCII primitive names are tied on token cost
- this is true in standalone form and in simple typed contexts like
x:intvsx:Int - the meaningful token-cost win is ASCII over Unicode, not lowercase over capitalized
Representative results:
| Pair | Standalone | Typed context | |---|---|---| | int vs Int | tied across all local tokenizers | tied across all local tokenizers | | float vs Float | tied across all local tokenizers | tied across all local tokenizers | | bool vs Bool | tied across all local tokenizers | tied across all local tokenizers | | string vs String | tied across all local tokenizers | tied across all local tokenizers | | char vs Char | tied across all local tokenizers | tied across all local tokenizers | | unit vs Unit | tied across all local tokenizers | tied across all local tokenizers | | never vs Never | tied across all local tokenizers | tied across all local tokenizers |
Notes:
openai_cl100k_baseis exact in the local benchmark harness- the Llama and Anthropic numbers are local heuristic proxies, so they are directional only
Language-design tradeoff
Once Unicode is removed, this becomes a canonicality and readability decision rather than a tokenizer decision.
Lowercase benefits:
- matches the rest of Sigil's keyword-heavy surface
- visually distinguishes built-in primitives from user-defined types
- aligns with many languages where primitive/builtin types are keyword-like
- avoids making primitives look nominal
Capitalized benefits:
- makes all types visually uniform
- reinforces "types look like types" in prose and examples
- aligns with the internal AST names already used by the compiler (
PrimitiveName::Int,Bool, etc.)
Recommendation
Prefer lowercase ASCII primitive names:
int, float, bool, string, char, unit, never
Reasoning:
- Token cost does not justify capitalized forms.
- Lowercase better communicates that these are built-in language primitives, not user-defined nominal types.
- It preserves a useful visual distinction between:
- built-in primitive vocabulary - user-defined types like User, Todo, Result
- It keeps Sigil's surface more keyword-like and regular.
If the project instead decides that "all types should look alike" is the stronger aesthetic rule, then capitalized forms are acceptable from a tokenizer perspective. This is a style choice, not a cost choice.
Migration scope
This rename is mechanically straightforward but broad.
Expected touchpoints:
- lexer token definitions for primitive type spellings
- parser primitive-type matching
- typechecker error formatting
- syntax docs and spec examples
- stdlib signatures
- examples and projects
- test fixtures and parser/typechecker tests
- benchmarks that currently assume Unicode primitive spellings
Current codebase indicators:
- lexer currently hard-codes Unicode primitive tokens
- parser maps those tokens to
PrimitiveName::{Int,Float,Bool,String,Char,Unit} - typechecker diagnostics currently render Unicode names
- there are more than 1500 Unicode primitive occurrences outside compiler internals that would need source updates
Suggested implementation order
- Decide the canonical ASCII spellings.
- Update lexer tokens and parser acceptance.
- Update diagnostic rendering.
- Update docs/spec/reference material.
- Rewrite stdlib, examples, fixtures, and projects.
- Re-run benchmark and compiler tests.
Non-goal
Do not support both Unicode and ASCII primitive spellings long-term.
That would directly weaken Sigil's canonical-syntax goal.