Skip to content

centminmod/sonoma-dusk-sky-alpha-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Updated ranking with Claude Sonnet 4.5 and newer models at https://github.com/centminmod/claude-sonnet-4.5-evaluation

Sonoma Dusk Alpha and Sonoma Sky Alpha are 2 new stealth LLM models just released with 2 million context window sizes and Qwen 3 Max was also just released. I wanted to test them for code analysis for my csfa.sh nftables wrapper script and GitHub workflow action test against other LLM models I use. This is for code analysis and not code generation. Code analysis would be useful for understanding code bases, writing documentation, troubleshooting code and planning.

Update:

CSFA (v1.3.1) is a CSF-like wrapper for nftables that provides familiar ConfigServer Security & Firewall commands mapped to modern nftables equivalents. The project uses a single Bash script (csfa.sh) that manages firewall rules through a dedicated inet table called "csfa".

I have paid subscriptions and accounts with:

  • OpenAI ChatGPT Plus
  • Claude AI Max $100
  • Gemini AI Pro
  • T3 Chat
  • OpenRouter AI
  • KiloCode

I tested 13 AI LLM models for code analysis and summaries and then used 5 AI LLM models to rank all 13 AI LLM model responses.

The 5 AI LLM models which did response evaluation rankings are:

  • Claude Code Opus 4.1
  • ChatGPT GPT-5 Thinking
  • Gemini 2.5 Pro Web
  • Grok 4 via T3 Chat
  • Sonoma Sky Alpha via KiloCode

The 13 AI LLM models evaluated by above 5 models are (including costs for usage):

  • OpenAI Codex GPT-5 Medium Thinking included in subscription cost
  • OpenAI ChatGPT GPT-5 Thinking included in subscription cost
  • Claude Code Opus 4.1 included in subscription cost
  • Claude AI Web Opus 4.1 Thinking included in subscription cost
  • KiloCode Claude Sonnet 4 $0.240 | inputs tokens: 80,596 | output tokens: 4,822 | cache hits: 38,818
  • Google Gemini 2.5 Pro Web included in subscription cost
  • KiloCode Sonoma Dusk Alpha $0.000 | inputs tokens: 66,302 | output tokens: 3,049 | cache hits: 32,168
  • KiloCode Sonoma Sky Alpha $0.000 | inputs tokens: 31,761 | output tokens: 2,684 | cache hits: 397
  • KiloCode MoonshotAI Kimi K2 0905 $0.020 | inputs tokens: 30,763 | output tokens: 1,484 | cache hits: 0
  • KiloCode xAI Grok Code Fast 1 $0.000 | inputs tokens: 30,649 | output tokens: 1,025 | cache hits: 576
  • KiloCode Qwen3 Coder $0.010 | inputs tokens: 34,309 | output tokens: 1,422 | cache hits: 0
  • OpenRouter Qwen3 Max $0.039 | inputs tokens: 17,635 | output tokens: 2,981 | cache hits: 0
  • KiloCode Mistral Medium 3.1 $0.040 | inputs tokens: 76,355 | output tokens: 2,460 | cache hits: 0

Note:

You can easily replicate these tests asking AI LLM models to summarize/analyse your code bases/scripts and save their responses to markdown files. Then feed their responses into AI LLM models for evaluation.

Overall Findings

Average Overall Scores Across Evaluators (ranked)

Claude Code Opus 4.1 would of ranked 1st if not for ChatGPT GPT-5 Thinking probably bias lower scored ranking and would of swapped places with OpenRouter Qwen 3 Max moving to 2nd place.

Model ChatGPT GPT-5 Thinking Claude Code Opus 4.1 Gemini 2.5 Pro Web Grok 4 via T3 Chat Sonoma Sky Alpha via KiloCode Avg Score Rank
OpenRouter Qwen3 Max 93.0 96.0 98.0 97.0 93.0 95.4 1.0
Claude Code Opus 4.1 84.0 98.0 97.0 98.0 97.0 94.8 2.0
KiloCode xAI Grok Code Fast 1 87.0 95.0 99.0 95.0 95.0 94.2 3.0
Claude AI Web Opus 4.1 Thinking 87.0 88.0 90.0 94.0 90.0 89.8 4.0
KiloCode Claude Sonnet 4 82.0 94.0 88.0 93.0 92.0 89.8 5.0
Google Gemini 2.5 Pro Web 89.0 85.0 94.0 91.0 88.0 89.4 6.0
OpenAI ChatGPT GPT-5 Thinking 94.0 82.0 92.0 86.0 86.0 88.0 7.0
OpenAI Codex GPT-5 Medium Thinking 92.0 75.0 75.0 89.0 81.0 82.4 8.0
KiloCode Sonoma Sky Alpha 83.0 62.0 78.0 87.0 85.0 79.0 9.0
KiloCode Mistral Medium 3.1 80.0 78.0 80.0 78.0 68.0 76.8 10.0
KiloCode Sonoma Dusk Alpha 85.0 72.0 60.0 84.0 83.0 76.8 11.0
KiloCode Qwen3 Coder 82.0 65.0 65.0 73.0 77.0 72.4 12.0
KiloCode MoonshotAI Kimi K2 0905 82.0 68.0 55.0 68.0 71.0 68.8 13.0

avg score bar chart

heatmap chart

Which Of The 5 AI Evaluators Most Accurately Ranked LLM Models?

With above average score bar chart, I was also curious as to which of the 5 AI evaluators most accurately ranked the 13 AI LLM models.

To find the most accurate evaluator, I compared each of the five evaluators' individual ranking lists to the final ranking list derived from the average scores. I calculated the sum of the absolute differences in rank for each model across all 13 positions. The evaluator with the lowest total "difference score" is the one whose rankings are most aligned with the overall consensus.

For example, if the average rank for a model is #1 and an evaluator ranks it #2, the difference is 1. If another model is ranked #7 on average but #9 by the evaluator, the difference is 2. These differences are summed up for all 13 models.

Here are the total difference scores for each of the five evaluators. A lower score indicates a closer match to the final average ranking with Grok4 via T3 Chat being most accurate evaluator for rankings. OpenAI ChatGPT GPT-5 Thinking was the biggest outlier by a significant margin

Evaluator Total Difference Score
Grok 4 via T3 Chat 8
Claude Code Opus 4.1 14
Sonoma Sky Alpha via KiloCode 16
Google Gemini 2.5 Pro Web 20
OpenAI ChatGPT GPT-5 Thinking 48

Consensus Top Performers

The evaluation reveals a remarkably strong consensus among all 5 evaluating models regarding the top-tier performers:

  1. Claude Code Opus 4.1 - Achieved the highest or near-highest rankings from 3 out of 6 evaluators (98, 97, 98, 97 average), with only ChatGPT GPT-5 Thinking penalizing it for "unverifiable specifics"
  2. OpenRouter Qwen3 Max - Consistently ranked 2nd or 3rd across all evaluators, praised for exceptional formatting and comprehensive coverage
  3. KiloCode xAI Grok Code Fast 1 - Strong 3rd place consensus, particularly lauded by Gemini for its Mermaid diagrams and technical depth

Clear Performance Tiers

The rankings reveal distinct performance clusters:

Elite Tier (90-98 average)

  • Claude Code Opus 4.1, OpenRouter Qwen3 Max, KiloCode xAI Grok Code Fast 1
  • Characterized by: Professional formatting, comprehensive technical depth, visual aids (diagrams/tables), specific code references

High Performer Tier (85-94 average)

  • KiloCode Claude Sonnet 4, Claude AI Web Opus 4.1 Thinking, Google Gemini 2.5 Pro Web
  • Characterized by: Good structure and coverage but less technical depth or minor accuracy issues

Mid-Range Tier (75-89 average)

  • OpenAI ChatGPT GPT-5 Thinking, OpenAI Codex GPT-5 Medium Thinking, KiloCode Sonoma Sky/Dusk Alpha
  • Characterized by: Solid basics but lacking implementation details or comprehensive analysis

Lower Tier (55-78 average)

  • KiloCode Mistral Medium 3.1, KiloCode Qwen3 Coder, KiloCode MoonshotAI Kimi K2 0905
  • Characterized by: Brief summaries, surface-level coverage, missing critical technical details

Key Evaluation Criteria Trends

All evaluators consistently valued:

  1. Technical Accuracy - Specific line references, correct implementation details, avoiding speculation
  2. Thoroughness - Coverage of v1.3.1 fixes, CI pipeline details, temporary rule mechanisms
  3. Presentation Quality - Use of tables, diagrams, structured formatting, clear organization
  4. Practical Value - Inclusion of examples, limitations, recommendations
  5. Depth vs. Brevity Balance - Models too brief (Kimi K2, Qwen3 Coder) consistently ranked lower

Notable Divergences

The most significant disagreement occurred with ChatGPT GPT-5 Thinking's evaluation, which:

  • Ranked itself #1 (94) while others placed it mid-tier (82-92 range)
  • Heavily penalized Claude Code Opus 4.1 for "unverifiable specifics" (ranked 12th) while all others ranked it 1st
  • Showed potential self-bias by favoring OpenAI models (ranked both OpenAI models in top 3)

Evaluation Methodology Insights

Different evaluators emphasized different aspects:

  • Gemini 2.5 Pro valued visual elements heavily (99 score for Grok's Mermaid diagrams)
  • Claude Code Opus 4.1 provided the most granular scoring with detailed strengths/weaknesses
  • Grok 4 balanced accuracy and thoroughness equally in scoring
  • Sonoma Sky Alpha used a three-factor system (accuracy, thoroughness, overall)

Critical Success Factors

Analysis reveals the following separated top performers from others:

  1. Structured Documentation - Winners used executive summaries, clear sections, tables
  2. Visual Communication - Diagrams and formatted tables significantly boosted scores
  3. Specific Technical Details - References to exact mechanisms (systemd-run, flock) valued highly
  4. Balanced Coverage - Both script functionality AND CI testing details needed
  5. Actionable Insights - Recommendations, limitations, and use cases added value

Surprising Findings

  1. Self-Evaluation Bias: Models evaluating themselves showed varying degrees of objectivity, with Claude Code Opus 4.1 showing high self-awareness (ranked itself appropriately) while ChatGPT showed potential positive bias
  2. Brevity Penalty: All extremely concise responses (under 500 words) were consistently ranked in bottom third, suggesting evaluators prefer comprehensive analysis
  3. Visual Impact: The inclusion of even one diagram (Mermaid) could boost rankings by 5-10 points
  4. Version-Specific Focus: Models that explicitly addressed v1.3.1 OUTPUT chain cleanup scored notably higher

Overall Conclusion

The evaluation consensus strongly indicates that comprehensive, technically accurate, and well-structured responses with visual aids are universally valued across different AI evaluators. The top three models (Claude Code Opus 4.1, OpenRouter Qwen3 Max, and KiloCode xAI Grok Code Fast 1) achieved their positions through a combination of depth, accuracy, and presentation quality rather than any single factor. The consistency of these rankings across diverse evaluators suggests these qualities represent objective markers of response quality for technical code analysis tasks.

The Rankings


Claude Code Opus 4.1 Evaluations

Rank Model Name Score Accuracy Thoroughness Key Strengths Key Weaknesses
1 Claude Code Opus 4.1 98/100 Excellent Excellent Professional report format, executive summary, detailed architecture analysis, comprehensive command tables, specific v1.3.1 fixes, performance characteristics Could include more code examples
2 OpenRouter Qwen3 Max 96/100 Excellent Excellent Exceptional formatting, comprehensive feature tables, includes recommendations and limitations, effective use of visual elements Slightly verbose in places
3 KiloCode xAI Grok Code Fast 1 95/100 Excellent Excellent Mermaid diagrams, detailed command structure, usage examples, identifies limitations clearly Could be more concise
4 KiloCode Claude Sonnet 4 94/100 Excellent Excellent Line-by-line references, detailed technical architecture, CI testing innovations section Less accessible for non-technical readers
5 Claude AI Web Opus 4.1 Thinking 88/100 Very Good Very Good Well-structured, clear feature categorization, good v1.3.1 coverage Less technical depth than top tier
6 Google Gemini 2.5 Pro Web 85/100 Very Good Very Good Excellent explanations for broader audience, well-formatted sections Could include more technical details
7 OpenAI ChatGPT GPT-5 Thinking 82/100 Good Good Clear structure and explanations, good coverage Minor inaccuracies
8 KiloCode Mistral Medium 3.1 78/100 Good Good Good relationship explanation, clear example workflows Lacks technical depth
9 OpenAI Codex GPT-5 Medium Thinking 75/100 Good Good Concise bullet-point format, good coverage Relatively brief, lacks detailed analysis
10 KiloCode Sonoma Dusk Alpha 72/100 Good Good Comprehensive feature list, good CI coverage Less detailed analysis
11 KiloCode MoonshotAI Kimi K2 0905 68/100 Good Moderate Brief but accurate, covers key points Lacks depth and examples
12 KiloCode Qwen3 Coder 65/100 Good Moderate Concise summary, covers main features Minimal technical detail
13 KiloCode Sonoma Sky Alpha 62/100 Good Moderate Very concise, covers basics Least comprehensive analysis

ChatGPT GPT-5 Thinking Evaluations

# Model Accuracy Thoroughness Notable strengths Notable misses / issues Overall
1 OpenAI ChatGPT GPT-5 Thinking 95 92 Clear mapping of features: -method, jq auto/fallback, base chains @ priority 10, temp rules w/ timers, status/check/watch; CI matrix + systemd container captured well. Minor omissions (artifact upload detail). 94
2 OpenRouter Qwen3 Max 92 94 Very complete walkthrough of commands, temp directions/ports, handle tracking + timer cleanup; CI steps incl. traffic watch and artifact archiving. A few generalities, but no hard inaccuracies. 93
3 OpenAI Codex GPT-5 Medium 93 90 Concise and correct: dedicated table/priority, ip/port ops, temp TTL with json tracked handles + timers, status/check/test; CI’s OUTPUT-cleanup focus noted. Skipped artifact step specifics. 92
4 Google Gemini 2.5 Pro Web 90 88 Solid coverage of CLI, temp rules, directions/ports, systemd timers; got CI matrix + AlmaLinux/systemd right. Fewer CI details (e.g., artifacts). 89
5 KiloCode xAI Grok Code Fast 1 86 88 Broad, structured rundown (ports/ranges/UDP, directions, timers, CI phases). Some speculation/marketing tone; a bit repetitive. 87
6 Claude AI Web Opus 4.1 Thinking 88 86 Accurate high-level script + CI overview; calls out timer-based expiry across chains. Less depth on watch/status/check and artifacts. 87
7 KiloCode Sonoma Dusk Alpha 84 85 Balanced summary of script + CI; touches v1.3.1 OUTPUT cleanup, directions, ports. Light on specifics (handles, exact flags). 85
8 KiloCode Sonoma Sky Alpha 82 84 Good breadth; includes container + systemd details and CI flow. Adds a run-time estimate (speculative). 83
9 KiloCode MoonshotAI Kimi K2 0905 85 78 Correct essentials (ports/directions, timers, tracking). Briefer CI coverage; few implementation details. 82
10 KiloCode Qwen3 Coder 84 80 Reliable overview of key capabilities + CI matrix/systemd. Skims advanced diagnostics (watch/status/check). 82
11 KiloCode Mistral Medium 3.1 83 76 Accurate but compact; hits main commands + CI phases. Leaves out OUTPUT-cleanup emphasis + artifacts. 80
12 Claude Code Opus 4.1 78 95 Extremely thorough; dives into design/flows, multi-port sets, directionality, timers. Several unverifiable specifics: exact LOC counts, a “~2× faster” JSON claim, & named “security fix” not evident in code—so accuracy penalty. 84
13 KiloCode Claude Sonnet 4 75 90 Wide coverage with lots of feature bullets (ports/directions, timers, IPv6, checks). Cites fake line numbers and overly precise anchors (e.g., csfa.sh:1015) that don’t match the file—accuracy hit. 82

Gemini 2.5 Pro Web Evaluations

Model Accuracy & Thoroughness Analysis Rating (0-100)
KiloCode xAI Grok Code Fast 1 Exceptional. Provided the deepest technical breakdown, correctly identifying not just the features but the implementation details (e.g., systemd-run for timers, flock for locking). Its unique inclusion of Mermaid diagrams to visualize the script's logic and CI workflow demonstrated a superior level of comprehension. 99
OpenRouter Qwen3 Max Exceptional. Presented a highly polished, readable, and comprehensive report. It excelled at structuring the information logically with clear headings and summaries. It correctly identified and emphasized the criticality of the v1.3.1 OUTPUT chain cleanup test, and its "Strengths" and "Limitations" sections added significant analytical value. 98
Claude Code Opus 4.1 Exceptional. Formatted its output as a professional executive summary and technical report. It was extremely thorough, covering architecture, design patterns, and specific v1.3.1 bug fixes. The formal structure and level of detail were unparalleled, making it feel like a human-written document. 97
Google Gemini 2.5 Pro Web Excellent. Provided a very clear and well-written explanation of how the mechanisms work, particularly the "clever" use of systemd-run for temporary rules. It correctly highlighted the importance of the systemd-enabled container in the CI for this feature to be testable. The analysis was both deep and easy to understand. 94
OpenAI ChatGPT GPT-5 Excellent. A very thorough and accurate analysis. It was one of the few to explicitly mention the flock command for preventing race conditions. The "Net effect" summary was a great touch for synthesizing the script's overall purpose. It correctly understood all key mechanisms in both the script and the CI file. 92
Claude AI Web Opus 4.1 Excellent. A strong, detailed analysis that correctly identified the most critical test in the CI pipeline (v1.3.1 OUTPUT chain cleanup). It clearly explained the core capabilities and how the CI workflow validates them, showing a solid understanding of the project's technical depth. 90
KiloCode Claude Sonnet 4 Very Good. A well-structured and detailed analysis that used tables and code blocks effectively. Its unique feature of linking to specific line numbers in the script was a nice touch, although not explicitly requested. It provided a very granular and accurate breakdown of the CI test coverage. 88
KiloCode Mistral Medium 3.1 Good. A solid, high-level summary that was accurate and well-organized. It correctly identified the relationship between the script and the CI file and mentioned the key technologies (systemd, jq), but lacked the implementation details and depth of the higher-rated models. 80
KiloCode Sonoma Sky Alpha Good. A competent and accurate summary. It successfully identified the main features and the purpose of the matrix strategy in the CI workflow. However, it remained at a surface level and did not delve into the more complex mechanisms like the systemd timer callbacks or file locking. 78
OpenAI Codex GPT-5 Medium Good. Very concise and to the point. It served as an excellent quick reference, accurately listing the key features. It correctly noted the v1.3.1 focus on OUTPUT chain cleanup but didn't explain the "why" or "how" with the same clarity as top-tier models. 75
KiloCode Qwen3 Coder Fair. An accurate but very high-level summary. It correctly listed the main features and CI test areas but was too brief to be considered a thorough analysis. It provided a good "at a glance" overview. 65
KiloCode Sonoma Dusk Alpha Fair. This summary was too brief. While the information was accurate, it lacked sufficient detail to be truly useful. It identified the main components but missed almost all of the nuance regarding the advanced features and the complexity of the CI tests. 60
KiloCode MoonshotAI Kimi K2 0905 Needs Improvement. Extremely concise to the point of being a TL;DR. It impressively managed to mention the "critical 35-second TTL test," showing some depth, but the overall summary was too sparse. It sacrificed thoroughness for brevity. 55

Grok 4 via T3 Chat Evaluations

Rank Model Accuracy (0-100) Thoroughness (0-100) Overall Rating (0-100) Key Strengths Key Weaknesses
1 Claude Code Opus 4.1 98 98 98 Highly detailed with tables, breakdowns, and v1.3.1 specifics; accurate code references and CI coverage. Slightly verbose in sections.
2 OpenRouter Qwen3 Max 97 97 97 Comprehensive with overviews, tables, examples, and recommendations; excellent CI breakdown. Minor repetition in summaries.
3 KiloCode xAI Grok Code Fast 1 95 95 95 Accurate with Mermaid diagrams, breakdowns, and examples; strong on CI flow. Misses some internal helpers (e.g., flock locking).
4 Claude AI Web Opus 4.1 Thinking 94 93 94 Concise yet thorough; good focus on v1.3.x features and CI innovations. Less emphasis on code internals like parsing logic.
5 KiloCode Claude Sonnet 4 93 92 93 Detailed line references and breakdowns; strong on CI's advanced tests. Some sections feel list-heavy without synthesis.
6 Google Gemini 2.5 Pro Web 92 90 91 Clear, structured summary with examples; good balance of features. Omits CI's matrix and artifact details.
7 OpenAI Codex GPT-5 Medium Thinking 90 88 89 Solid high-level summary; covers modes and CI phases well. Lacks depth on v1.3.1 fixes and code helpers.
8 KiloCode Sonoma Sky Alpha 88 85 87 Accurate summaries with examples; covers both files evenly. Superficial on CI's temp rule tests; no diagrams.
9 OpenAI ChatGPT GPT-5 Thinking 87 84 86 Good purpose explanation and net effect; mentions README potential. High-level; misses specifics like port parsing or CI sleeps.
10 KiloCode Sonoma Dusk Alpha 85 82 84 Concise feature lists; covers commands well. Brief; ignores CI's debugging aspects like journalctl.
11 KiloCode Mistral Medium 3.1 80 75 78 Basic overview with example; covers relationship. Very short; inaccurate on some features (e.g., no mention of directions).
12 KiloCode Qwen3 Coder 75 70 73 Brief bullet points; covers basics. Inaccurate (e.g., claims IPv6 in CI but misses details); superficial.
13 KiloCode MoonshotAI Kimi K2 0905 70 65 68 Ultra-brief; captures essence. Lacks depth; minor inaccuracies (e.g., oversimplifies temp rules).

Sonoma Sky Alpha via KiloCode Evaluations

Rank Model Name Accuracy Score (0-100) Thoroughness Score (0-100) Overall Score (0-100) Key Strengths Key Weaknesses
1 Claude Code Opus 4.1 98 95 97 Detailed architecture, tables/diagrams, v1.3.1 fixes, full CI phases; code snippets. Slightly verbose; minor repetition in strengths.
2 KiloCode xAI Grok Code Fast 1 97 92 95 In-depth internals (handles, timers), Mermaid diagrams, CI flow; accurate command examples. Less emphasis on limitations; no explicit v1.3.1 bug list.
3 OpenRouter Qwen3 Max 95 90 93 Structured sections, examples, CI critical tests; holistic view with recommendations. Overlooks some CI artifacts details; assumes more than states.
4 KiloCode Claude Sonnet 4 96 88 92 Strong CI coverage (timing tests, multi-chain), feature tables; code line refs. Script section lighter on monitoring commands.
5 Claude AI Web Opus 4.1 Thinking 94 85 90 Good temp rules depth, CI innovations; clear summary. Misses port proto parsing details; briefer on IPv6.
6 Google Gemini 2.5 Pro Web 93 82 88 Balanced breakdown, temp rules mechanism; CI strategy focus. Generic on parsing fallback; no visuals or edge cases like UDP.
7 OpenAI ChatGPT GPT-5 Thinking 92 80 86 Clean feature lists, CI phases; practical examples. Skips handle management; undercovers v1.3.1 cleanup test.
8 KiloCode Sonoma Sky Alpha 91 78 85 Concise summaries, CI timeline; method-specific artifacts. Brief on internals (e.g., flock); no tables/diagrams.
9 KiloCode Sonoma Dusk Alpha 90 75 83 Quick overview, test coverage; CI characteristics. Surface-level; omits direction/port combos in detail.
10 OpenAI Codex GPT-5 Medium Thinking 89 72 81 Solid basics, v1.3.1 focus; CI functional tests. Less structured; misses geo-IP fallback logic.
11 KiloCode Qwen3 Coder 88 65 77 Core/advanced features split; CI environment. Too concise; skips monitoring/validation depth.
12 KiloCode MoonshotAI Kimi K2 0905 87 55 71 High-level features, CI validation; brief. Minimal coverage; no specifics on timers or CI steps.
13 KiloCode Mistral Medium 3.1 85 50 68 Basic overview, relationship explanation; example. Shallow; ignores enhanced features like directions.