Skip to content

Conversation

@LunaStev
Copy link
Member

@LunaStev LunaStev commented Jan 5, 2026

Following the modularization of the parser and codegen, this PR breaks down the monolithic lexer.rs into specialized submodules within front/lexer/src/lexer/. This reorganization separates low-level source navigation from high-level token dispatch and literal parsing, making the lexer significantly easier to maintain and extend.

Key Changes

1. Lexer Modularization

The lexer logic has been split into the following functional components:

  • core.rs: Definitions for the Lexer and Token structures, serving as the foundational types.
  • cursor.rs: Implements low-level source navigation methods such as advance(), peek(), peek_next(), and match_next().
  • scan.rs: The primary entry point for tokenization, containing the main next_token() dispatch logic and character-level matching.
  • ident.rs: Logic for scanning identifiers and mapping them to language keywords.
  • literals.rs: Specialized scanning for string and character literals, including escape sequence handling.
  • trivia.rs: Logic for skipping non-token "trivia" such as whitespace and various comment styles.
  • common.rs: Internal shared imports and utilities used across the lexer submodules.

2. Integration & API Cleanup

  • Exposed Structure: Updated front/lexer/src/lib.rs and mod.rs to correctly export the new modular structure while maintaining a clean public API.
  • External Updates: Adjusted imports in the front/parser crate to align with the new lexer paths, specifically ensuring TokenType and Token are correctly referenced.

3. Behavioral Consistency

  • This is a pure structural refactor. The tokenization logic, keyword recognition, and literal parsing remain behaviorally identical to the previous implementation.
  • The Lexer public interface remains stable to prevent breaking changes in the compiler runner.

Impact

  • Maintainability: Concerns are now clearly separated. For example, changing how numbers are peeked only requires touching cursor.rs, while adding new keywords only involves ident.rs.
  • Readability: Individual files are now focused and significantly smaller, reducing the overhead for new contributors.
  • Architecture: Completes the project-wide goal of modularizing the frontend crates.

Break down the lexer implementation into logical components to improve
code organization and readability.

Changes:
- **New Module Structure**:
  - `core.rs`: `Lexer` and `Token` struct definitions and entry points.
  - `cursor.rs`: Low-level source navigation (`advance`, `peek`, `match_next`).
  - `scan.rs`: Main token dispatch logic (`next_token`).
  - `ident.rs`: Identifier scanning and keyword mapping.
  - `literals.rs`: String and character literal parsing.
  - `trivia.rs`: Whitespace and comment skipping.
  - `common.rs`: Internal shared imports.
- **Integration**:
  - Updated `front/lexer/src/lib.rs` and `mod.rs` to expose the new structure.
  - Updated imports in `front/parser` to align with the refactored lexer API (explicit `use lexer::token::TokenType` where necessary).

This modularization separates concerns, making the lexer easier to maintain
and extend.

Signed-off-by: LunaStev <luna@lunastev.org>
@LunaStev LunaStev merged commit a8216e6 into wavefnd:master Jan 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant