Edge Detection¶
Sovara detects dataflow between LLM calls using content-based matching. This document explains how edges are created in the dataflow graph.
Overview¶
The edge detection system answers: "Which LLM outputs were used as input to this LLM call?"
When an LLM produces output, we store all text strings from the response. When a new LLM call is made, we check if any previously stored strings appear in the input. If so, we create an edge between those nodes.
Core Architecture¶
Content Registry¶
The content registry lives in string_matching.py and stores tokenized output strings for each node:
# Maps session_id -> {node_id -> [[word_lists]]}
_session_outputs: Dict[str, Dict[str, List[List[str]]]] = {}
Key properties: 1. Session-scoped: Outputs are only matched within the same session 2. In-memory: No persistence needed (LLM outputs are already cached in the database) 3. Tokenized: Text is split into words for efficient longest-match computation
String Matching Module¶
The matching logic is in src/sovara/runner/string_matching.py:
find_source_nodes(session_id, input_dict, api_type) -> List[str]
# Returns node_ids whose outputs appear in this input
store_output_strings(session_id, node_id, output_obj, api_type) -> None
# Stores output strings for future matching
Text Extraction¶
Text is extracted from HTTP request/response bodies in patching_utils.py:
extract_input_text(input_dict, api_type)- Extracts all strings from the request bodyextract_output_text(output_obj, api_type)- Extracts all strings from the response body
Both functions recursively extract all string values from the JSON, regardless of the API format (OpenAI, Anthropic, etc.).
How It Works¶
Example Flow¶
# LLM call 1
response1 = llm("Output the number 42") # Returns "42"
# -> Stored: node_1 -> ["42", "assistant", "stop", ...]
# LLM call 2
response2 = llm(f"Add 1 to {response1}") # Input contains "42"
# -> Input text: "Add 1 to 42..."
# -> Match found: "42" in input
# -> Edge created: node_1 -> node_2
Matching Algorithm¶
Uses word-level longest contiguous match via difflib.SequenceMatcher:
def is_content_match(output_words, input_words):
match_len = compute_longest_match(output_words, input_words)
if match_len > 0 and len(output_words) > 0:
output_coverage = match_len / len(output_words)
if output_coverage > 0.5 and match_len > MIN_MATCH_WORDS:
return True
return False
Key features:
- Tokenization: Text is cleaned (HTML stripped, lowercased) and split into words
- Coverage threshold: Match must cover >50% of the output
- Minimum length: Match must exceed MIN_MATCH_WORDS (default: 3)
Integration with Monkey Patches¶
Each monkey patch (httpx, requests, MCP, genai) calls the string matching functions:
# In httpx_patch.py
source_node_ids = find_source_nodes(session_id, input_dict, api_type)
store_output_strings(session_id, node_id, output, api_type)
send_graph_node_and_edges(
node_id=node_id,
source_node_ids=source_node_ids, # Edges!
...
)
Caching and Reruns¶
When an LLM call is intercepted:
- Cache lookup:
DB.get_in_out()hashes the input - Cache hit: Use cached output
- Cache miss: Call LLM, store result
- Edge detection:
find_source_nodes()checks for matches - Store output:
store_output_strings()saves for future matching - Graph update:
send_graph_node_and_edges()notifies server
Reruns work deterministically because:
- Same session_id means cache lookups find previous entries
- Content registry is rebuilt as calls are replayed
- UI edits to inputs/outputs are respected
Next Steps¶
- API patching - How LLM APIs are intercepted
- Testing - Running the test suite