Compare commits

...

2 Commits

Author SHA1 Message Date
dl92
01d5532823 Add expression-evaluator: DAGs & state machines tutorial project
Educational calculator teaching FSMs (explicit transition table tokenizer)
and DAGs (recursive descent parser with AST evaluation). Includes CLI with
REPL, graphviz visualization, and 61 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 18:09:42 +00:00
dl92
3a8705ece8 Fix bugs, N+1 queries, and wire settings in persian-tutor
- Replace inline __import__("datetime").timedelta hack with proper import
- Remove unused import random in anki_export.py
- Add error handling for Claude CLI subprocess failures in ai.py
- Fix hardcoded absolute path in stt.py with relative Path resolution
- Fix N+1 DB queries in vocab.get_flashcard_batch and dashboard.get_category_breakdown
  by adding db.get_all_word_progress() batch query
- Wire Ollama model and Whisper size settings to actually update config
  via ai.set_ollama_model() and stt.set_whisper_size()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 15:40:24 +00:00
18 changed files with 1614 additions and 12 deletions

View File

@@ -0,0 +1,42 @@
# Expression Evaluator
## Overview
Educational project teaching DAGs and state machines through a calculator.
Pure Python, no external dependencies.
## Running
```bash
python main.py "3 + 4 * 2" # single expression
python main.py # REPL mode
python main.py --show-tokens --show-ast --trace "expr" # show internals
python main.py --dot "3+4*2" | dot -Tpng -o ast.png # AST diagram
python main.py --dot-fsm | dot -Tpng -o fsm.png # FSM diagram
```
## Testing
```bash
python -m pytest tests/ -v
```
## Architecture
- `tokenizer.py` -- Explicit finite state machine (Mealy machine) tokenizer
- `parser.py` -- Recursive descent parser building an AST (DAG)
- `evaluator.py` -- Post-order tree walker (topological sort evaluation)
- `visualize.py` -- Graphviz dot generation for AST and FSM diagrams
- `main.py` -- CLI entry point with argparse, REPL mode
## Key Design Decisions
- State machine uses an explicit transition table (dict), not implicit if/else
- Unary minus resolved by examining previous token context
- Power operator (`^`) is right-associative (grammar uses right-recursion)
- AST nodes are dataclasses; evaluation uses structural pattern matching
- Graphviz output is raw dot strings (no graphviz Python package needed)
## Grammar
```
expression ::= term ((PLUS | MINUS) term)*
term ::= unary ((MULTIPLY | DIVIDE) unary)*
unary ::= UNARY_MINUS unary | power
power ::= atom (POWER power)?
atom ::= NUMBER | LPAREN expression RPAREN
```

View File

@@ -0,0 +1,87 @@
# Expression Evaluator -- DAGs & State Machines Tutorial
A calculator that teaches two fundamental CS patterns by building them from scratch:
1. **Finite State Machine** -- the tokenizer processes input character-by-character using an explicit transition table
2. **Directed Acyclic Graph (DAG)** -- the parser builds an expression tree, evaluated bottom-up in topological order
## What You'll Learn
| File | CS Concept | What it does |
|------|-----------|-------------|
| `tokenizer.py` | **State Machine** (Mealy machine) | Converts `"3 + 4 * 2"` into tokens using a transition table |
| `parser.py` | **DAG construction** | Builds an expression tree with operator precedence |
| `evaluator.py` | **Topological evaluation** | Walks the tree bottom-up (leaves before parents) |
| `visualize.py` | **Visualization** | Generates graphviz diagrams of both the FSM and AST |
## Quick Start
```bash
# Evaluate an expression
python main.py "3 + 4 * 2"
# => 11
# Interactive REPL
python main.py
# See how the state machine tokenizes
python main.py --show-tokens "(2 + 3) * -4"
# See the expression tree (DAG)
python main.py --show-ast "(2 + 3) * 4"
# *
# +-- +
# | +-- 2
# | `-- 3
# `-- 4
# Watch evaluation in topological order
python main.py --trace "(2 + 3) * 4"
# Step 1: 2 => 2
# Step 2: 3 => 3
# Step 3: 2 + 3 => 5
# Step 4: 4 => 4
# Step 5: 5 * 4 => 20
# Generate graphviz diagrams
python main.py --dot "(2 + 3) * 4" | dot -Tpng -o ast.png
python main.py --dot-fsm | dot -Tpng -o fsm.png
```
## Features
- Arithmetic: `+`, `-`, `*`, `/`, `^` (power)
- Parentheses: `(2 + 3) * 4`
- Unary minus: `-3`, `-(2 + 1)`, `2 * -3`
- Decimals: `3.14`, `.5`
- Standard precedence: parens > `^` > `*`/`/` > `+`/`-`
- Right-associative power: `2^3^4` = `2^(3^4)`
- Correct unary minus: `-3^2` = `-(3^2)` = `-9`
## Running Tests
```bash
python -m pytest tests/ -v
```
## How the State Machine Works
The tokenizer in `tokenizer.py` uses an **explicit transition table** -- a dictionary mapping `(current_state, character_class)` to `(next_state, action)`. This is the same pattern used in network protocol parsers, regex engines, and compiler lexers.
The three states are:
- `START` -- between tokens, dispatching based on the next character
- `INTEGER` -- accumulating digits (e.g., `"12"` so far)
- `DECIMAL` -- accumulating digits after a decimal point (e.g., `"12.3"`)
Use `--dot-fsm` to generate a visual diagram of the state machine.
## How the DAG Works
The parser in `parser.py` builds an **expression tree** (AST) where:
- **Leaf nodes** are numbers (no dependencies)
- **Interior nodes** are operators with edges to their operands
- **Edges** represent "depends on" relationships
Evaluation in `evaluator.py` walks this tree **bottom-up** -- children before parents. This is exactly a **topological sort** of the DAG: you can only compute a node after all its dependencies are resolved.
Use `--show-ast` to see the tree structure, or `--dot` to generate a graphviz diagram.

View File

@@ -0,0 +1,147 @@
"""
Part 3: DAG Evaluation -- Tree Walker
=======================================
Evaluating the AST bottom-up is equivalent to topological-sort
evaluation of a DAG. We must evaluate a node's children before
the node itself -- just like in any dependency graph.
For a tree, post-order traversal gives a topological ordering.
The recursive evaluate() function naturally does this:
1. Recursively evaluate all children (dependencies)
2. Combine the results (compute this node's value)
3. Return the result (make it available to the parent)
This is the same pattern as:
- make: build dependencies before the target
- pip/npm install: install dependencies before the package
- Spreadsheet recalculation: compute referenced cells first
"""
from parser import NumberNode, BinOpNode, UnaryOpNode, Node
from tokenizer import TokenType
# ---------- Errors ----------
class EvalError(Exception):
pass
# ---------- Evaluator ----------
OP_SYMBOLS = {
TokenType.PLUS: '+',
TokenType.MINUS: '-',
TokenType.MULTIPLY: '*',
TokenType.DIVIDE: '/',
TokenType.POWER: '^',
TokenType.UNARY_MINUS: 'neg',
}
def evaluate(node):
"""
Evaluate an AST by walking it bottom-up (post-order traversal).
This is a recursive function that mirrors the DAG structure:
each recursive call follows a DAG edge to a child node.
Children are evaluated before parents -- topological order.
"""
match node:
case NumberNode(value=v):
return v
case UnaryOpNode(op=TokenType.UNARY_MINUS, operand=child):
return -evaluate(child)
case BinOpNode(op=op, left=left, right=right):
left_val = evaluate(left)
right_val = evaluate(right)
match op:
case TokenType.PLUS:
return left_val + right_val
case TokenType.MINUS:
return left_val - right_val
case TokenType.MULTIPLY:
return left_val * right_val
case TokenType.DIVIDE:
if right_val == 0:
raise EvalError("division by zero")
return left_val / right_val
case TokenType.POWER:
return left_val ** right_val
raise EvalError(f"unknown node type: {type(node)}")
def evaluate_traced(node):
"""
Like evaluate(), but records each step for educational display.
Returns (result, list_of_trace_lines).
The trace shows the topological evaluation order -- how the DAG
is evaluated from leaves to root. Each step shows a node being
evaluated after all its dependencies are resolved.
"""
steps = []
counter = [0] # mutable counter for step numbering
def _walk(node, depth):
indent = " " * depth
counter[0] += 1
step = counter[0]
match node:
case NumberNode(value=v):
result = v
display = _format_number(v)
steps.append(f"{indent}Step {step}: {display} => {_format_number(result)}")
return result
case UnaryOpNode(op=TokenType.UNARY_MINUS, operand=child):
child_val = _walk(child, depth + 1)
result = -child_val
counter[0] += 1
step = counter[0]
steps.append(
f"{indent}Step {step}: neg({_format_number(child_val)}) "
f"=> {_format_number(result)}"
)
return result
case BinOpNode(op=op, left=left, right=right):
left_val = _walk(left, depth + 1)
right_val = _walk(right, depth + 1)
sym = OP_SYMBOLS[op]
match op:
case TokenType.PLUS:
result = left_val + right_val
case TokenType.MINUS:
result = left_val - right_val
case TokenType.MULTIPLY:
result = left_val * right_val
case TokenType.DIVIDE:
if right_val == 0:
raise EvalError("division by zero")
result = left_val / right_val
case TokenType.POWER:
result = left_val ** right_val
counter[0] += 1
step = counter[0]
steps.append(
f"{indent}Step {step}: {_format_number(left_val)} {sym} "
f"{_format_number(right_val)} => {_format_number(result)}"
)
return result
raise EvalError(f"unknown node type: {type(node)}")
result = _walk(node, 0)
return result, steps
def _format_number(v):
"""Display a number as integer when possible."""
if isinstance(v, float) and v == int(v):
return str(int(v))
return str(v)

View File

@@ -0,0 +1,163 @@
"""
Expression Evaluator -- Learn DAGs & State Machines
====================================================
CLI entry point and interactive REPL.
Usage:
python main.py "3 + 4 * 2" # evaluate
python main.py # REPL mode
python main.py --show-tokens --show-ast --trace "expr" # show internals
python main.py --dot "3 + 4 * 2" | dot -Tpng -o ast.png
python main.py --dot-fsm | dot -Tpng -o fsm.png
"""
import argparse
import sys
from tokenizer import tokenize, TokenError
from parser import Parser, ParseError
from evaluator import evaluate, evaluate_traced, EvalError
from visualize import ast_to_dot, fsm_to_dot, ast_to_text
def process_expression(expr, args):
"""Tokenize, parse, and evaluate a single expression."""
try:
tokens = tokenize(expr)
except TokenError as e:
_print_error(expr, e)
return
if args.show_tokens:
print("\nTokens:")
for tok in tokens:
print(f" {tok}")
try:
ast = Parser(tokens).parse()
except ParseError as e:
_print_error(expr, e)
return
if args.show_ast:
print("\nAST (text tree):")
print(ast_to_text(ast))
if args.dot:
print(ast_to_dot(ast))
return # dot output goes to stdout, skip numeric result
if args.trace:
try:
result, steps = evaluate_traced(ast)
except EvalError as e:
print(f"Eval error: {e}")
return
print("\nEvaluation trace (topological order):")
for step in steps:
print(step)
print(f"\nResult: {_format_result(result)}")
else:
try:
result = evaluate(ast)
except EvalError as e:
print(f"Eval error: {e}")
return
print(_format_result(result))
def repl(args):
"""Interactive read-eval-print loop."""
print("Expression Evaluator REPL")
print("Type an expression, or 'quit' to exit.")
flags = []
if args.show_tokens:
flags.append("--show-tokens")
if args.show_ast:
flags.append("--show-ast")
if args.trace:
flags.append("--trace")
if flags:
print(f"Active flags: {' '.join(flags)}")
print()
while True:
try:
line = input(">>> ").strip()
except (EOFError, KeyboardInterrupt):
print()
break
if line.lower() in ("quit", "exit", "q"):
break
if not line:
continue
process_expression(line, args)
print()
def _print_error(expr, error):
"""Print an error with a caret pointing to the position."""
print(f"Error: {error}")
if hasattr(error, 'position') and error.position is not None:
print(f" {expr}")
print(f" {' ' * error.position}^")
def _format_result(v):
"""Format a numeric result: show as int when possible."""
if isinstance(v, float) and v == int(v) and abs(v) < 1e15:
return str(int(v))
return str(v)
def main():
arg_parser = argparse.ArgumentParser(
description="Expression Evaluator -- learn DAGs and state machines",
epilog="Examples:\n"
" python main.py '3 + 4 * 2'\n"
" python main.py --show-tokens --trace '-(3 + 4) ^ 2'\n"
" python main.py --dot '(2+3)*4' | dot -Tpng -o ast.png\n"
" python main.py --dot-fsm | dot -Tpng -o fsm.png",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
arg_parser.add_argument(
"expression", nargs="?",
help="Expression to evaluate (omit for REPL mode)",
)
arg_parser.add_argument(
"--show-tokens", action="store_true",
help="Display tokenizer output",
)
arg_parser.add_argument(
"--show-ast", action="store_true",
help="Display AST as indented text tree",
)
arg_parser.add_argument(
"--trace", action="store_true",
help="Show step-by-step evaluation trace",
)
arg_parser.add_argument(
"--dot", action="store_true",
help="Output AST as graphviz dot (pipe to: dot -Tpng -o ast.png)",
)
arg_parser.add_argument(
"--dot-fsm", action="store_true",
help="Output tokenizer FSM as graphviz dot",
)
args = arg_parser.parse_args()
# Special mode: just print the FSM diagram and exit
if args.dot_fsm:
print(fsm_to_dot())
return
# REPL mode if no expression given
if args.expression is None:
repl(args)
else:
process_expression(args.expression, args)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,217 @@
"""
Part 2: DAG Construction -- Recursive Descent Parser
=====================================================
A parser converts a flat list of tokens into a tree structure (AST).
The AST is a DAG (Directed Acyclic Graph) where:
- Nodes are operations (BinOpNode) or values (NumberNode)
- Edges point from parent operations to their operands
- The graph is acyclic because an operation's inputs are always
"simpler" sub-expressions (no circular dependencies)
- It is a tree (a special case of DAG) because no node is shared
This is the same structure as:
- Spreadsheet dependency graphs (cell A1 depends on B1, B2...)
- Build systems (Makefile targets depend on other targets)
- Task scheduling (some tasks must finish before others start)
- Neural network computation graphs (forward pass is a DAG)
Key DAG concepts demonstrated:
- Nodes: operations and values
- Directed edges: from operation to its inputs (dependencies)
- Acyclic: no circular dependencies
- Topological ordering: natural evaluation order (leaves first)
Grammar (BNF) -- precedence is encoded by nesting depth:
expression ::= term ((PLUS | MINUS) term)* # lowest precedence
term ::= unary ((MULTIPLY | DIVIDE) unary)*
unary ::= UNARY_MINUS unary | power
power ::= atom (POWER power)? # right-associative
atom ::= NUMBER | LPAREN expression RPAREN # highest precedence
Call chain: expression -> term -> unary -> power -> atom
This means: +/- binds loosest, then *//, then unary -, then ^, then parens.
So -3^2 = -(3^2) = -9, matching standard math convention.
"""
from dataclasses import dataclass
from tokenizer import Token, TokenType
# ---------- AST node types ----------
# These are the nodes of our DAG. Each node is either a leaf (NumberNode)
# or an interior node with edges pointing to its children (operands).
@dataclass
class NumberNode:
"""Leaf node: a numeric literal. In DAG terms, a node with no outgoing edges."""
value: float
def __repr__(self):
if self.value == int(self.value):
return f"NumberNode({int(self.value)})"
return f"NumberNode({self.value})"
@dataclass
class BinOpNode:
"""
Interior node: a binary operation with two children.
DAG edges: this node -> left, this node -> right
The edges represent "depends on": to compute this node's value,
we must first compute left and right.
"""
op: TokenType
left: 'NumberNode | BinOpNode | UnaryOpNode'
right: 'NumberNode | BinOpNode | UnaryOpNode'
def __repr__(self):
return f"BinOpNode({self.op.name}, {self.left}, {self.right})"
@dataclass
class UnaryOpNode:
"""Interior node: a unary operation (negation) with one child."""
op: TokenType
operand: 'NumberNode | BinOpNode | UnaryOpNode'
def __repr__(self):
return f"UnaryOpNode({self.op.name}, {self.operand})"
# Union type for any AST node
Node = NumberNode | BinOpNode | UnaryOpNode
# ---------- Errors ----------
class ParseError(Exception):
def __init__(self, message, position=None):
self.position = position
pos_info = f" at position {position}" if position is not None else ""
super().__init__(f"Parse error{pos_info}: {message}")
# ---------- Recursive descent parser ----------
class Parser:
"""
Converts a list of tokens into an AST (expression tree / DAG).
Each grammar rule becomes a method. The call tree mirrors the shape
of the AST being built. When a deeper method returns a node, it
becomes a child of the node built by the caller -- this is how
the DAG edges form.
Precedence is encoded by nesting: lower-precedence operators are
parsed at higher (outer) levels, so they become closer to the root
of the tree and are evaluated last.
"""
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
"""Look at the current token without consuming it."""
return self.tokens[self.pos]
def consume(self, expected=None):
"""Consume and return the current token, optionally asserting its type."""
token = self.tokens[self.pos]
if expected is not None and token.type != expected:
raise ParseError(
f"expected {expected.name}, got {token.type.name}",
token.position,
)
self.pos += 1
return token
def parse(self):
"""Entry point: parse the full expression and verify we consumed everything."""
if self.peek().type == TokenType.EOF:
raise ParseError("empty expression")
node = self.expression()
self.consume(TokenType.EOF)
return node
# --- Grammar rules ---
# Each method corresponds to one production in the grammar.
# The nesting encodes operator precedence.
def expression(self):
"""expression ::= term ((PLUS | MINUS) term)*"""
node = self.term()
while self.peek().type in (TokenType.PLUS, TokenType.MINUS):
op_token = self.consume()
right = self.term()
# Build a new BinOpNode -- this creates a DAG edge from
# the new node to both 'node' (left) and 'right'
node = BinOpNode(op_token.type, node, right)
return node
def term(self):
"""term ::= unary ((MULTIPLY | DIVIDE) unary)*"""
node = self.unary()
while self.peek().type in (TokenType.MULTIPLY, TokenType.DIVIDE):
op_token = self.consume()
right = self.unary()
node = BinOpNode(op_token.type, node, right)
return node
def unary(self):
"""
unary ::= UNARY_MINUS unary | power
Unary minus is parsed here, between term and power, so it binds
looser than ^ but tighter than * and /. This gives the standard
math behavior: -3^2 = -(3^2) = -9.
The recursion (unary calls itself) handles double negation: --3 = 3.
"""
if self.peek().type == TokenType.UNARY_MINUS:
op_token = self.consume()
operand = self.unary()
return UnaryOpNode(op_token.type, operand)
return self.power()
def power(self):
"""
power ::= atom (POWER power)?
Right-recursive for right-associativity: 2^3^4 = 2^(3^4) = 2^81.
Compare with term() which uses a while loop for LEFT-associativity.
"""
node = self.atom()
if self.peek().type == TokenType.POWER:
op_token = self.consume()
right = self.power() # recurse (not loop) for right-associativity
node = BinOpNode(op_token.type, node, right)
return node
def atom(self):
"""
atom ::= NUMBER | LPAREN expression RPAREN
The base case: either a literal number or a parenthesized
sub-expression. Parentheses work by recursing back to
expression(), which restarts precedence parsing from the top.
"""
token = self.peek()
if token.type == TokenType.NUMBER:
self.consume()
return NumberNode(float(token.value))
if token.type == TokenType.LPAREN:
self.consume()
node = self.expression()
self.consume(TokenType.RPAREN)
return node
raise ParseError(
f"expected number or '(', got {token.type.name}",
token.position,
)

View File

@@ -0,0 +1,120 @@
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import pytest
from tokenizer import tokenize
from parser import Parser
from evaluator import evaluate, evaluate_traced, EvalError
def eval_expr(expr):
"""Helper: tokenize -> parse -> evaluate in one step."""
tokens = tokenize(expr)
ast = Parser(tokens).parse()
return evaluate(ast)
# ---------- Basic arithmetic ----------
def test_addition():
assert eval_expr("3 + 4") == 7.0
def test_subtraction():
assert eval_expr("10 - 3") == 7.0
def test_multiplication():
assert eval_expr("3 * 4") == 12.0
def test_division():
assert eval_expr("10 / 4") == 2.5
def test_power():
assert eval_expr("2 ^ 10") == 1024.0
# ---------- Precedence ----------
def test_standard_precedence():
assert eval_expr("3 + 4 * 2") == 11.0
def test_parentheses():
assert eval_expr("(3 + 4) * 2") == 14.0
def test_power_precedence():
assert eval_expr("2 * 3 ^ 2") == 18.0
def test_right_associative_power():
# 2^(2^3) = 2^8 = 256
assert eval_expr("2 ^ 2 ^ 3") == 256.0
# ---------- Unary minus ----------
def test_negation():
assert eval_expr("-5") == -5.0
def test_double_negation():
assert eval_expr("--5") == 5.0
def test_negation_with_power():
# -(3^2) = -9, not (-3)^2 = 9
assert eval_expr("-3 ^ 2") == -9.0
def test_negation_in_parens():
assert eval_expr("(-3) ^ 2") == 9.0
# ---------- Decimals ----------
def test_decimal_addition():
assert eval_expr("0.1 + 0.2") == pytest.approx(0.3)
def test_leading_dot():
assert eval_expr(".5 + .5") == 1.0
# ---------- Edge cases ----------
def test_nested_parens():
assert eval_expr("((((3))))") == 3.0
def test_complex_expression():
assert eval_expr("(2 + 3) * (7 - 2) / 5 ^ 1") == 5.0
def test_long_chain():
assert eval_expr("1 + 2 + 3 + 4 + 5") == 15.0
def test_mixed_operations():
assert eval_expr("2 + 3 * 4 - 6 / 2") == 11.0
# ---------- Division by zero ----------
def test_division_by_zero():
with pytest.raises(EvalError):
eval_expr("1 / 0")
def test_division_by_zero_in_expression():
with pytest.raises(EvalError):
eval_expr("5 + 3 / (2 - 2)")
# ---------- Traced evaluation ----------
def test_traced_returns_correct_result():
tokens = tokenize("3 + 4 * 2")
ast = Parser(tokens).parse()
result, steps = evaluate_traced(ast)
assert result == 11.0
assert len(steps) > 0
def test_traced_step_count():
"""A simple binary op has 3 evaluation events: left, right, combine."""
tokens = tokenize("3 + 4")
ast = Parser(tokens).parse()
result, steps = evaluate_traced(ast)
assert result == 7.0
# NumberNode(3), NumberNode(4), BinOp(+)
assert len(steps) == 3

View File

@@ -0,0 +1,136 @@
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import pytest
from tokenizer import tokenize, TokenType
from parser import Parser, ParseError, NumberNode, BinOpNode, UnaryOpNode
def parse(expr):
"""Helper: tokenize and parse in one step."""
return Parser(tokenize(expr)).parse()
# ---------- Basic parsing ----------
def test_parse_number():
ast = parse("42")
assert isinstance(ast, NumberNode)
assert ast.value == 42.0
def test_parse_decimal():
ast = parse("3.14")
assert isinstance(ast, NumberNode)
assert ast.value == 3.14
def test_parse_addition():
ast = parse("3 + 4")
assert isinstance(ast, BinOpNode)
assert ast.op == TokenType.PLUS
assert isinstance(ast.left, NumberNode)
assert isinstance(ast.right, NumberNode)
# ---------- Precedence ----------
def test_multiply_before_add():
"""3 + 4 * 2 should parse as 3 + (4 * 2)."""
ast = parse("3 + 4 * 2")
assert ast.op == TokenType.PLUS
assert isinstance(ast.right, BinOpNode)
assert ast.right.op == TokenType.MULTIPLY
def test_power_before_multiply():
"""2 * 3 ^ 4 should parse as 2 * (3 ^ 4)."""
ast = parse("2 * 3 ^ 4")
assert ast.op == TokenType.MULTIPLY
assert isinstance(ast.right, BinOpNode)
assert ast.right.op == TokenType.POWER
def test_parentheses_override_precedence():
"""(3 + 4) * 2 should parse as (3 + 4) * 2."""
ast = parse("(3 + 4) * 2")
assert ast.op == TokenType.MULTIPLY
assert isinstance(ast.left, BinOpNode)
assert ast.left.op == TokenType.PLUS
# ---------- Associativity ----------
def test_left_associative_subtraction():
"""10 - 3 - 2 should parse as (10 - 3) - 2."""
ast = parse("10 - 3 - 2")
assert ast.op == TokenType.MINUS
assert isinstance(ast.left, BinOpNode)
assert ast.left.op == TokenType.MINUS
assert isinstance(ast.right, NumberNode)
def test_power_right_associative():
"""2 ^ 3 ^ 4 should parse as 2 ^ (3 ^ 4)."""
ast = parse("2 ^ 3 ^ 4")
assert ast.op == TokenType.POWER
assert isinstance(ast.left, NumberNode)
assert isinstance(ast.right, BinOpNode)
assert ast.right.op == TokenType.POWER
# ---------- Unary minus ----------
def test_unary_minus():
ast = parse("-3")
assert isinstance(ast, UnaryOpNode)
assert ast.operand.value == 3.0
def test_double_negation():
ast = parse("--3")
assert isinstance(ast, UnaryOpNode)
assert isinstance(ast.operand, UnaryOpNode)
assert ast.operand.operand.value == 3.0
def test_unary_minus_precedence():
"""-3^2 should parse as -(3^2), not (-3)^2."""
ast = parse("-3 ^ 2")
assert isinstance(ast, UnaryOpNode)
assert isinstance(ast.operand, BinOpNode)
assert ast.operand.op == TokenType.POWER
def test_unary_minus_in_expression():
"""2 * -3 should parse as 2 * (-(3))."""
ast = parse("2 * -3")
assert ast.op == TokenType.MULTIPLY
assert isinstance(ast.right, UnaryOpNode)
# ---------- Nested parentheses ----------
def test_nested_parens():
ast = parse("((3))")
assert isinstance(ast, NumberNode)
assert ast.value == 3.0
def test_complex_nesting():
"""((2 + 3) * (7 - 2))"""
ast = parse("((2 + 3) * (7 - 2))")
assert isinstance(ast, BinOpNode)
assert ast.op == TokenType.MULTIPLY
# ---------- Errors ----------
def test_missing_rparen():
with pytest.raises(ParseError):
parse("(3 + 4")
def test_empty_expression():
with pytest.raises(ParseError):
parse("")
def test_trailing_operator():
with pytest.raises(ParseError):
parse("3 +")
def test_empty_parens():
with pytest.raises(ParseError):
parse("()")

View File

@@ -0,0 +1,139 @@
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import pytest
from tokenizer import tokenize, TokenType, Token, TokenError
# ---------- Basic tokens ----------
def test_single_integer():
tokens = tokenize("42")
assert tokens[0].type == TokenType.NUMBER
assert tokens[0].value == "42"
def test_decimal_number():
tokens = tokenize("3.14")
assert tokens[0].type == TokenType.NUMBER
assert tokens[0].value == "3.14"
def test_leading_dot():
tokens = tokenize(".5")
assert tokens[0].type == TokenType.NUMBER
assert tokens[0].value == ".5"
def test_all_operators():
"""Operators between numbers are all binary."""
tokens = tokenize("1 + 1 - 1 * 1 / 1 ^ 1")
ops = [t.type for t in tokens if t.type not in (TokenType.NUMBER, TokenType.EOF)]
assert ops == [
TokenType.PLUS, TokenType.MINUS, TokenType.MULTIPLY,
TokenType.DIVIDE, TokenType.POWER,
]
def test_operators_between_numbers():
tokens = tokenize("1 + 2 - 3 * 4 / 5 ^ 6")
ops = [t.type for t in tokens if t.type not in (TokenType.NUMBER, TokenType.EOF)]
assert ops == [
TokenType.PLUS, TokenType.MINUS, TokenType.MULTIPLY,
TokenType.DIVIDE, TokenType.POWER,
]
def test_parentheses():
tokens = tokenize("()")
assert tokens[0].type == TokenType.LPAREN
assert tokens[1].type == TokenType.RPAREN
# ---------- Unary minus ----------
def test_unary_minus_at_start():
tokens = tokenize("-3")
assert tokens[0].type == TokenType.UNARY_MINUS
assert tokens[1].type == TokenType.NUMBER
def test_unary_minus_after_lparen():
tokens = tokenize("(-3)")
assert tokens[1].type == TokenType.UNARY_MINUS
def test_unary_minus_after_operator():
tokens = tokenize("2 * -3")
assert tokens[2].type == TokenType.UNARY_MINUS
def test_binary_minus():
tokens = tokenize("5 - 3")
assert tokens[1].type == TokenType.MINUS
def test_double_unary_minus():
tokens = tokenize("--3")
assert tokens[0].type == TokenType.UNARY_MINUS
assert tokens[1].type == TokenType.UNARY_MINUS
assert tokens[2].type == TokenType.NUMBER
# ---------- Whitespace handling ----------
def test_no_spaces():
tokens = tokenize("3+4")
non_eof = [t for t in tokens if t.type != TokenType.EOF]
assert len(non_eof) == 3
def test_extra_spaces():
tokens = tokenize(" 3 + 4 ")
non_eof = [t for t in tokens if t.type != TokenType.EOF]
assert len(non_eof) == 3
# ---------- Position tracking ----------
def test_positions():
tokens = tokenize("3 + 4")
assert tokens[0].position == 0 # '3'
assert tokens[1].position == 2 # '+'
assert tokens[2].position == 4 # '4'
# ---------- Errors ----------
def test_invalid_character():
with pytest.raises(TokenError):
tokenize("3 & 4")
def test_double_dot():
with pytest.raises(TokenError):
tokenize("3.14.15")
# ---------- EOF token ----------
def test_eof_always_present():
tokens = tokenize("42")
assert tokens[-1].type == TokenType.EOF
def test_empty_input():
tokens = tokenize("")
assert len(tokens) == 1
assert tokens[0].type == TokenType.EOF
# ---------- Complex expressions ----------
def test_complex_expression():
tokens = tokenize("(3 + 4.5) * -2 ^ 3")
types = [t.type for t in tokens if t.type != TokenType.EOF]
assert types == [
TokenType.LPAREN, TokenType.NUMBER, TokenType.PLUS,
TokenType.NUMBER, TokenType.RPAREN, TokenType.MULTIPLY,
TokenType.UNARY_MINUS, TokenType.NUMBER, TokenType.POWER,
TokenType.NUMBER,
]
def test_adjacent_parens():
tokens = tokenize("(3)(4)")
types = [t.type for t in tokens if t.type != TokenType.EOF]
assert types == [
TokenType.LPAREN, TokenType.NUMBER, TokenType.RPAREN,
TokenType.LPAREN, TokenType.NUMBER, TokenType.RPAREN,
]

View File

@@ -0,0 +1,306 @@
"""
Part 1: State Machine Tokenizer
================================
A tokenizer (lexer) converts raw text into a stream of tokens.
This implementation uses an EXPLICIT finite state machine (FSM):
- States are named values (an enum), not implicit control flow
- A transition table maps (current_state, input_class) -> (next_state, action)
- The main loop reads one character at a time and consults the table
This is the same pattern used in:
- Network protocol parsers (HTTP, TCP state machines)
- Regular expression engines
- Compiler front-ends (lexers for C, Python, etc.)
- Game AI (enemy behavior states)
Key FSM concepts demonstrated:
- States: the "memory" of what we're currently building
- Transitions: rules for moving between states based on input
- Actions: side effects (emit a token, accumulate a character)
- Mealy machine: outputs depend on both state AND input
"""
from dataclasses import dataclass
from enum import Enum
# ---------- Token types ----------
class TokenType(Enum):
NUMBER = "NUMBER"
PLUS = "PLUS"
MINUS = "MINUS"
MULTIPLY = "MULTIPLY"
DIVIDE = "DIVIDE"
POWER = "POWER"
LPAREN = "LPAREN"
RPAREN = "RPAREN"
UNARY_MINUS = "UNARY_MINUS"
EOF = "EOF"
@dataclass
class Token:
type: TokenType
value: str # raw text: "42", "+", "(", etc.
position: int # character offset in original expression
def __repr__(self):
return f"Token({self.type.name}, {self.value!r}, pos={self.position})"
OPERATOR_MAP = {
'+': TokenType.PLUS,
'-': TokenType.MINUS,
'*': TokenType.MULTIPLY,
'/': TokenType.DIVIDE,
'^': TokenType.POWER,
}
# ---------- FSM state definitions ----------
class State(Enum):
"""
The tokenizer's finite set of states.
START -- idle / between tokens, deciding what comes next
INTEGER -- accumulating digits of an integer (e.g. "12" so far)
DECIMAL -- accumulating digits after a decimal point (e.g. "12.3" so far)
"""
START = "START"
INTEGER = "INTEGER"
DECIMAL = "DECIMAL"
class CharClass(Enum):
"""
Character classification -- groups raw characters into categories
so the transition table stays small and readable.
"""
DIGIT = "DIGIT"
DOT = "DOT"
OPERATOR = "OPERATOR"
LPAREN = "LPAREN"
RPAREN = "RPAREN"
SPACE = "SPACE"
EOF = "EOF"
UNKNOWN = "UNKNOWN"
class Action(Enum):
"""
What the FSM does on a transition. In a Mealy machine, the output
(action) depends on both the current state AND the input.
"""
ACCUMULATE = "ACCUMULATE"
EMIT_NUMBER = "EMIT_NUMBER"
EMIT_OPERATOR = "EMIT_OPERATOR"
EMIT_LPAREN = "EMIT_LPAREN"
EMIT_RPAREN = "EMIT_RPAREN"
EMIT_NUMBER_THEN_OP = "EMIT_NUMBER_THEN_OP"
EMIT_NUMBER_THEN_LPAREN = "EMIT_NUMBER_THEN_LPAREN"
EMIT_NUMBER_THEN_RPAREN = "EMIT_NUMBER_THEN_RPAREN"
EMIT_NUMBER_THEN_DONE = "EMIT_NUMBER_THEN_DONE"
SKIP = "SKIP"
DONE = "DONE"
ERROR = "ERROR"
@dataclass(frozen=True)
class Transition:
next_state: State
action: Action
# ---------- Transition table ----------
# This is the heart of the state machine. Every (state, char_class) pair
# maps to exactly one transition: a next state and an action to perform.
# Making this a data structure (not nested if/else) means we can:
# 1. Inspect it programmatically (e.g. to generate a diagram)
# 2. Verify completeness (every combination is covered)
# 3. Understand the FSM at a glance
TRANSITIONS = {
# --- START: between tokens, dispatch based on character class ---
(State.START, CharClass.DIGIT): Transition(State.INTEGER, Action.ACCUMULATE),
(State.START, CharClass.DOT): Transition(State.DECIMAL, Action.ACCUMULATE),
(State.START, CharClass.OPERATOR): Transition(State.START, Action.EMIT_OPERATOR),
(State.START, CharClass.LPAREN): Transition(State.START, Action.EMIT_LPAREN),
(State.START, CharClass.RPAREN): Transition(State.START, Action.EMIT_RPAREN),
(State.START, CharClass.SPACE): Transition(State.START, Action.SKIP),
(State.START, CharClass.EOF): Transition(State.START, Action.DONE),
# --- INTEGER: accumulating digits like "123" ---
(State.INTEGER, CharClass.DIGIT): Transition(State.INTEGER, Action.ACCUMULATE),
(State.INTEGER, CharClass.DOT): Transition(State.DECIMAL, Action.ACCUMULATE),
(State.INTEGER, CharClass.OPERATOR): Transition(State.START, Action.EMIT_NUMBER_THEN_OP),
(State.INTEGER, CharClass.LPAREN): Transition(State.START, Action.EMIT_NUMBER_THEN_LPAREN),
(State.INTEGER, CharClass.RPAREN): Transition(State.START, Action.EMIT_NUMBER_THEN_RPAREN),
(State.INTEGER, CharClass.SPACE): Transition(State.START, Action.EMIT_NUMBER),
(State.INTEGER, CharClass.EOF): Transition(State.START, Action.EMIT_NUMBER_THEN_DONE),
# --- DECIMAL: accumulating digits after "." like "123.45" ---
(State.DECIMAL, CharClass.DIGIT): Transition(State.DECIMAL, Action.ACCUMULATE),
(State.DECIMAL, CharClass.DOT): Transition(State.START, Action.ERROR),
(State.DECIMAL, CharClass.OPERATOR): Transition(State.START, Action.EMIT_NUMBER_THEN_OP),
(State.DECIMAL, CharClass.LPAREN): Transition(State.START, Action.EMIT_NUMBER_THEN_LPAREN),
(State.DECIMAL, CharClass.RPAREN): Transition(State.START, Action.EMIT_NUMBER_THEN_RPAREN),
(State.DECIMAL, CharClass.SPACE): Transition(State.START, Action.EMIT_NUMBER),
(State.DECIMAL, CharClass.EOF): Transition(State.START, Action.EMIT_NUMBER_THEN_DONE),
}
# ---------- Errors ----------
class TokenError(Exception):
def __init__(self, message, position):
self.position = position
super().__init__(f"Token error at position {position}: {message}")
# ---------- Character classification ----------
def classify(ch):
"""Map a single character to its CharClass."""
if ch.isdigit():
return CharClass.DIGIT
if ch == '.':
return CharClass.DOT
if ch in OPERATOR_MAP:
return CharClass.OPERATOR
if ch == '(':
return CharClass.LPAREN
if ch == ')':
return CharClass.RPAREN
if ch.isspace():
return CharClass.SPACE
return CharClass.UNKNOWN
# ---------- Main tokenize function ----------
def tokenize(expression):
"""
Process an expression string through the state machine, producing tokens.
The main loop:
1. Classify the current character
2. Look up (state, char_class) in the transition table
3. Execute the action (accumulate, emit, skip, etc.)
4. Move to the next state
5. Advance to the next character
After all tokens are emitted, a post-processing step resolves
unary minus: if a MINUS token appears at the start, after an operator,
or after LPAREN, it is re-classified as UNARY_MINUS.
"""
state = State.START
buffer = [] # characters accumulated for the current token
buffer_start = 0 # position where the current buffer started
tokens = []
pos = 0
# Append a sentinel so EOF is handled uniformly in the loop
chars = expression + '\0'
while pos <= len(expression):
ch = chars[pos]
char_class = CharClass.EOF if pos == len(expression) else classify(ch)
if char_class == CharClass.UNKNOWN:
raise TokenError(f"unexpected character {ch!r}", pos)
# Look up the transition
key = (state, char_class)
transition = TRANSITIONS.get(key)
if transition is None:
raise TokenError(f"no transition for state={state.name}, input={char_class.name}", pos)
action = transition.action
next_state = transition.next_state
# --- Execute the action ---
if action == Action.ACCUMULATE:
if not buffer:
buffer_start = pos
buffer.append(ch)
elif action == Action.EMIT_NUMBER:
tokens.append(Token(TokenType.NUMBER, ''.join(buffer), buffer_start))
buffer.clear()
elif action == Action.EMIT_OPERATOR:
tokens.append(Token(OPERATOR_MAP[ch], ch, pos))
elif action == Action.EMIT_LPAREN:
tokens.append(Token(TokenType.LPAREN, ch, pos))
elif action == Action.EMIT_RPAREN:
tokens.append(Token(TokenType.RPAREN, ch, pos))
elif action == Action.EMIT_NUMBER_THEN_OP:
tokens.append(Token(TokenType.NUMBER, ''.join(buffer), buffer_start))
buffer.clear()
tokens.append(Token(OPERATOR_MAP[ch], ch, pos))
elif action == Action.EMIT_NUMBER_THEN_LPAREN:
tokens.append(Token(TokenType.NUMBER, ''.join(buffer), buffer_start))
buffer.clear()
tokens.append(Token(TokenType.LPAREN, ch, pos))
elif action == Action.EMIT_NUMBER_THEN_RPAREN:
tokens.append(Token(TokenType.NUMBER, ''.join(buffer), buffer_start))
buffer.clear()
tokens.append(Token(TokenType.RPAREN, ch, pos))
elif action == Action.EMIT_NUMBER_THEN_DONE:
tokens.append(Token(TokenType.NUMBER, ''.join(buffer), buffer_start))
buffer.clear()
elif action == Action.SKIP:
pass
elif action == Action.DONE:
pass
elif action == Action.ERROR:
raise TokenError(f"unexpected {ch!r} in state {state.name}", pos)
state = next_state
pos += 1
# --- Post-processing: resolve unary minus ---
# A MINUS is unary if it appears:
# - at the very start of the token stream
# - immediately after an operator (+, -, *, /, ^) or LPAREN
# This context-sensitivity cannot be captured by the FSM alone --
# it requires looking at previously emitted tokens.
_resolve_unary_minus(tokens)
tokens.append(Token(TokenType.EOF, '', len(expression)))
return tokens
def _resolve_unary_minus(tokens):
"""
Convert binary MINUS tokens to UNARY_MINUS where appropriate.
Why this isn't in the FSM: the FSM processes characters one at a time
and only tracks what kind of token it's currently building (its state).
But whether '-' is unary or binary depends on the PREVIOUS TOKEN --
information the FSM doesn't track. This is a common real-world pattern:
the lexer handles most work, then a lightweight post-pass adds context.
"""
unary_predecessor = {
TokenType.PLUS, TokenType.MINUS, TokenType.MULTIPLY,
TokenType.DIVIDE, TokenType.POWER, TokenType.LPAREN,
TokenType.UNARY_MINUS,
}
for i, token in enumerate(tokens):
if token.type != TokenType.MINUS:
continue
if i == 0 or tokens[i - 1].type in unary_predecessor:
tokens[i] = Token(TokenType.UNARY_MINUS, token.value, token.position)

View File

@@ -0,0 +1,200 @@
"""
Part 4: Visualization -- Graphviz Dot Output
==============================================
Generate graphviz dot-format strings for:
1. The tokenizer's finite state machine (FSM)
2. Any expression's AST (DAG)
3. Text-based tree rendering for the terminal
No external dependencies -- outputs raw dot strings that can be piped
to the 'dot' command: python main.py --dot "3+4*2" | dot -Tpng -o ast.png
"""
from parser import NumberNode, BinOpNode, UnaryOpNode, Node
from tokenizer import TRANSITIONS, State, CharClass, Action, TokenType
# ---------- FSM diagram ----------
# Human-readable labels for character classes
_CHAR_LABELS = {
CharClass.DIGIT: "digit",
CharClass.DOT: "'.'",
CharClass.OPERATOR: "op",
CharClass.LPAREN: "'('",
CharClass.RPAREN: "')'",
CharClass.SPACE: "space",
CharClass.EOF: "EOF",
}
# Short labels for actions
_ACTION_LABELS = {
Action.ACCUMULATE: "accum",
Action.EMIT_NUMBER: "emit num",
Action.EMIT_OPERATOR: "emit op",
Action.EMIT_LPAREN: "emit '('",
Action.EMIT_RPAREN: "emit ')'",
Action.EMIT_NUMBER_THEN_OP: "emit num+op",
Action.EMIT_NUMBER_THEN_LPAREN: "emit num+'('",
Action.EMIT_NUMBER_THEN_RPAREN: "emit num+')'",
Action.EMIT_NUMBER_THEN_DONE: "emit num, done",
Action.SKIP: "skip",
Action.DONE: "done",
Action.ERROR: "ERROR",
}
def fsm_to_dot():
"""
Generate a graphviz dot diagram of the tokenizer's state machine.
Reads the TRANSITIONS table directly -- because the FSM is data (a dict),
we can programmatically inspect and visualize it. This is a key advantage
of explicit state machines over implicit if/else control flow.
"""
lines = [
'digraph FSM {',
' rankdir=LR;',
' node [shape=circle, fontname="Helvetica"];',
' edge [fontname="Helvetica", fontsize=10];',
'',
' // Start indicator',
' __start__ [shape=point, width=0.2];',
' __start__ -> START;',
'',
]
# Collect edges grouped by (src, dst) to merge labels
edge_labels = {}
for (state, char_class), transition in TRANSITIONS.items():
src = state.name
dst = transition.next_state.name
char_label = _CHAR_LABELS.get(char_class, char_class.name)
action_label = _ACTION_LABELS.get(transition.action, transition.action.name)
label = f"{char_label} / {action_label}"
edge_labels.setdefault((src, dst), []).append(label)
# Emit edges
for (src, dst), labels in sorted(edge_labels.items()):
combined = "\\n".join(labels)
lines.append(f' {src} -> {dst} [label="{combined}"];')
lines.append('}')
return '\n'.join(lines)
# ---------- AST diagram ----------
_OP_LABELS = {
TokenType.PLUS: '+',
TokenType.MINUS: '-',
TokenType.MULTIPLY: '*',
TokenType.DIVIDE: '/',
TokenType.POWER: '^',
TokenType.UNARY_MINUS: 'neg',
}
def ast_to_dot(node):
"""
Generate a graphviz dot diagram of an AST (expression tree / DAG).
Each node gets a unique ID. Edges go from parent to children,
showing the directed acyclic structure. Leaves are boxed,
operators are ellipses.
"""
lines = [
'digraph AST {',
' node [fontname="Helvetica"];',
' edge [fontname="Helvetica"];',
'',
]
counter = [0]
def _visit(node):
nid = f"n{counter[0]}"
counter[0] += 1
match node:
case NumberNode(value=v):
label = _format_number(v)
lines.append(f' {nid} [label="{label}", shape=box, style=rounded];')
return nid
case UnaryOpNode(op=op, operand=child):
label = _OP_LABELS.get(op, op.name)
lines.append(f' {nid} [label="{label}", shape=ellipse];')
child_id = _visit(child)
lines.append(f' {nid} -> {child_id};')
return nid
case BinOpNode(op=op, left=left, right=right):
label = _OP_LABELS.get(op, op.name)
lines.append(f' {nid} [label="{label}", shape=ellipse];')
left_id = _visit(left)
right_id = _visit(right)
lines.append(f' {nid} -> {left_id} [label="L"];')
lines.append(f' {nid} -> {right_id} [label="R"];')
return nid
_visit(node)
lines.append('}')
return '\n'.join(lines)
# ---------- Text-based tree ----------
def ast_to_text(node, prefix="", connector=""):
"""
Render the AST as an indented text tree for terminal display.
Example output for (2 + 3) * 4:
*
+-- +
| +-- 2
| +-- 3
+-- 4
"""
match node:
case NumberNode(value=v):
label = _format_number(v)
case UnaryOpNode(op=op):
label = _OP_LABELS.get(op, op.name)
case BinOpNode(op=op):
label = _OP_LABELS.get(op, op.name)
lines = [f"{prefix}{connector}{label}"]
children = _get_children(node)
for i, child in enumerate(children):
is_last_child = (i == len(children) - 1)
if connector:
# Extend the prefix: if we used "+-- " then next children
# see "| " (continuing) or " " (last child)
child_prefix = prefix + ("| " if connector == "+-- " else " ")
else:
child_prefix = prefix
child_connector = "+-- " if is_last_child else "+-- "
# Use a different lead for non-last: the vertical bar continues
child_connector = "`-- " if is_last_child else "+-- "
child_lines = ast_to_text(child, child_prefix, child_connector)
lines.append(child_lines)
return '\n'.join(lines)
def _get_children(node):
match node:
case NumberNode():
return []
case UnaryOpNode(operand=child):
return [child]
case BinOpNode(left=left, right=right):
return [left, right]
return []
def _format_number(v):
if isinstance(v, float) and v == int(v):
return str(int(v))
return str(v)

View File

@@ -6,9 +6,18 @@ import ollama
DEFAULT_OLLAMA_MODEL = "qwen2.5:7b"
_ollama_model = DEFAULT_OLLAMA_MODEL
def ask_ollama(prompt, system=None, model=DEFAULT_OLLAMA_MODEL):
def set_ollama_model(model):
"""Change the Ollama model used for fast queries."""
global _ollama_model
_ollama_model = model
def ask_ollama(prompt, system=None, model=None):
"""Query Ollama with an optional system prompt."""
model = model or _ollama_model
messages = []
if system:
messages.append({"role": "system", "content": system})
@@ -24,6 +33,8 @@ def ask_claude(prompt):
capture_output=True,
text=True,
)
if result.returncode != 0:
raise RuntimeError(f"Claude CLI failed (exit {result.returncode}): {result.stderr.strip()}")
return result.stdout.strip()
@@ -34,8 +45,9 @@ def ask(prompt, system=None, quality="fast"):
return ask_ollama(prompt, system=system)
def chat_ollama(messages, system=None, model=DEFAULT_OLLAMA_MODEL):
def chat_ollama(messages, system=None, model=None):
"""Multi-turn conversation with Ollama."""
model = model or _ollama_model
all_messages = []
if system:
all_messages.append({"role": "system", "content": system})

View File

@@ -1,7 +1,6 @@
"""Generate Anki .apkg decks from vocabulary data."""
import genanki
import random
# Stable model/deck IDs (generated once, kept constant)
_MODEL_ID = 1607392319

View File

@@ -7,6 +7,7 @@ import time
import gradio as gr
import ai
import db
from modules import vocab, dashboard, essay, tutor, idioms
from modules.essay import GCSE_THEMES
@@ -214,6 +215,15 @@ def do_anki_export(cats_selected):
return path
def update_ollama_model(model):
ai.set_ollama_model(model)
def update_whisper_size(size):
from stt import set_whisper_size
set_whisper_size(size)
def reset_progress():
conn = db.get_connection()
conn.execute("DELETE FROM word_progress")
@@ -491,6 +501,10 @@ with gr.Blocks(title="Persian Language Tutor") as app:
export_btn.click(fn=do_anki_export, inputs=[export_cats], outputs=[export_file])
# Wire model settings
ollama_model.change(fn=update_ollama_model, inputs=[ollama_model])
whisper_size.change(fn=update_whisper_size, inputs=[whisper_size])
gr.Markdown("### Reset")
reset_btn = gr.Button("Reset All Progress", variant="stop")
reset_status = gr.Markdown("")

View File

@@ -2,7 +2,7 @@
import json
import sqlite3
from datetime import datetime, timezone
from datetime import datetime, timedelta, timezone
from pathlib import Path
import fsrs
@@ -148,6 +148,13 @@ def get_word_counts(total_vocab_size=0):
}
def get_all_word_progress():
"""Return all word progress as a dict of word_id -> progress dict."""
conn = get_connection()
rows = conn.execute("SELECT * FROM word_progress").fetchall()
return {row["word_id"]: dict(row) for row in rows}
def record_quiz_session(category, total_questions, correct, duration_seconds):
"""Log a completed flashcard session."""
conn = get_connection()
@@ -203,7 +210,7 @@ def get_stats():
today = datetime.now(timezone.utc).date()
for i, row in enumerate(days):
day = datetime.fromisoformat(row["d"]).date() if isinstance(row["d"], str) else row["d"]
expected = today - __import__("datetime").timedelta(days=i)
expected = today - timedelta(days=i)
if day == expected:
streak += 1
else:

View File

@@ -19,17 +19,17 @@ def get_category_breakdown():
"""Return progress per category as list of dicts."""
vocab = load_vocab()
categories = get_categories()
all_progress = db.get_all_word_progress()
breakdown = []
for cat in categories:
cat_words = [e for e in vocab if e["category"] == cat]
cat_ids = {e["id"] for e in cat_words}
total = len(cat_words)
seen = 0
mastered = 0
for wid in cat_ids:
progress = db.get_word_progress(wid)
for e in cat_words:
progress = all_progress.get(e["id"])
if progress:
seen += 1
if progress["stability"] and progress["stability"] > 10:

View File

@@ -84,8 +84,9 @@ def get_flashcard_batch(count=10, category=None):
remaining = count - len(due_entries)
if remaining > 0:
seen_ids = {e["id"] for e in due_entries}
all_progress = db.get_all_word_progress()
# Prefer unseen words
unseen = [e for e in pool if e["id"] not in seen_ids and not db.get_word_progress(e["id"])]
unseen = [e for e in pool if e["id"] not in seen_ids and e["id"] not in all_progress]
if len(unseen) >= remaining:
fill = random.sample(unseen, remaining)
else:

View File

@@ -1,13 +1,17 @@
"""Persian speech-to-text wrapper using sttlib."""
import sys
from pathlib import Path
import numpy as np
sys.path.insert(0, "/home/ys/family-repo/Code/python/tool-speechtotext")
# sttlib lives in sibling project tool-speechtotext
_sttlib_path = str(Path(__file__).resolve().parent.parent / "tool-speechtotext")
sys.path.insert(0, _sttlib_path)
from sttlib import load_whisper_model, transcribe, is_hallucination
_model = None
_whisper_size = "medium"
# Common Whisper hallucinations in Persian/silence
PERSIAN_HALLUCINATIONS = [
@@ -18,11 +22,19 @@ PERSIAN_HALLUCINATIONS = [
]
def get_model(size="medium"):
def set_whisper_size(size):
"""Change the Whisper model size. Reloads on next transcription."""
global _whisper_size, _model
if size != _whisper_size:
_whisper_size = size
_model = None
def get_model():
"""Load Whisper model (cached singleton)."""
global _model
if _model is None:
_model = load_whisper_model(size)
_model = load_whisper_model(_whisper_size)
return _model