Primitive Type Spelling

Question

If Sigil moves away from Unicode primitive type glyphs, should primitive types use:

lowercase ASCII: int, float, bool, string, char, unit, never
capitalized ASCII: Int, Float, Bool, String, Char, Unit, Never

Current findings

Token cost

Using the repo tokenizer harness:

lowercase and capitalized ASCII primitive names are tied on token cost
this is true in standalone form and in simple typed contexts like x:int vs x:Int
the meaningful token-cost win is ASCII over Unicode, not lowercase over capitalized

Representative results:

| Pair | Standalone | Typed context | |---|---|---| | int vs Int | tied across all local tokenizers | tied across all local tokenizers | | float vs Float | tied across all local tokenizers | tied across all local tokenizers | | bool vs Bool | tied across all local tokenizers | tied across all local tokenizers | | string vs String | tied across all local tokenizers | tied across all local tokenizers | | char vs Char | tied across all local tokenizers | tied across all local tokenizers | | unit vs Unit | tied across all local tokenizers | tied across all local tokenizers | | never vs Never | tied across all local tokenizers | tied across all local tokenizers |

Notes:

openai_cl100k_base is exact in the local benchmark harness
the Llama and Anthropic numbers are local heuristic proxies, so they are directional only

Language-design tradeoff

Once Unicode is removed, this becomes a canonicality and readability decision rather than a tokenizer decision.

Lowercase benefits:

matches the rest of Sigil's keyword-heavy surface
visually distinguishes built-in primitives from user-defined types
aligns with many languages where primitive/builtin types are keyword-like
avoids making primitives look nominal

Capitalized benefits:

makes all types visually uniform
reinforces "types look like types" in prose and examples
aligns with the internal AST names already used by the compiler (PrimitiveName::Int, Bool, etc.)

Recommendation

Prefer lowercase ASCII primitive names:

int, float, bool, string, char, unit, never

Reasoning:

Token cost does not justify capitalized forms.
Lowercase better communicates that these are built-in language primitives, not user-defined nominal types.
It preserves a useful visual distinction between:

- built-in primitive vocabulary - user-defined types like User, Todo, Result

It keeps Sigil's surface more keyword-like and regular.

If the project instead decides that "all types should look alike" is the stronger aesthetic rule, then capitalized forms are acceptable from a tokenizer perspective. This is a style choice, not a cost choice.

Migration scope

This rename is mechanically straightforward but broad.

Expected touchpoints:

lexer token definitions for primitive type spellings
parser primitive-type matching
typechecker error formatting
syntax docs and spec examples
stdlib signatures
examples and projects
test fixtures and parser/typechecker tests
benchmarks that currently assume Unicode primitive spellings

Current codebase indicators:

lexer currently hard-codes Unicode primitive tokens
parser maps those tokens to PrimitiveName::{Int,Float,Bool,String,Char,Unit}
typechecker diagnostics currently render Unicode names
there are more than 1500 Unicode primitive occurrences outside compiler internals that would need source updates

Suggested implementation order

Decide the canonical ASCII spellings.
Update lexer tokens and parser acceptance.
Update diagnostic rendering.
Update docs/spec/reference material.
Rewrite stdlib, examples, fixtures, and projects.
Re-run benchmark and compiler tests.

Non-goal

Do not support both Unicode and ASCII primitive spellings long-term.

That would directly weaken Sigil's canonical-syntax goal.