Comparison · with receipts
We ran GLM and Claude Opus 4.8 through 9 coding tasks. The frontier model was the cheapest.
Everyone quotes price-per-token. So we ran 9 verifiable coding tasks through Opus 4.8, GLM-4.6, and GLM-5.2 — on cerver — and tested every answer by actually running the code. The result is the opposite of what the sticker price predicts.
"This model is 10× cheaper per token" is the most repeated — and most misleading — line in AI right now. Per-token price is an input. What you actually pay is tokens × price × number-of-tries-to-get-it-right. The only way to know that number is to run real work and measure it. So we did.
How we ran it
Nine coding tasks, each with a definite correct answer we could check automatically — fix a buggy binary search, merge intervals, write an IPv4 regex, fix a Go data race, add N business days, a non-mutating deep merge. We sent the same prompt to all three models through cerver, saved each answer plus its cost and latency, then extracted the code and ran it against a fixed test suite (Python, SQLite, Go). A task passes only if every case passes. The judge is code execution — not opinion — so it doesn't matter that one contestant happens to be the model writing this.
The frontier model won on every axis
Opus 4.8 was the most correct (9/9), the cheapest (about half the cost of GLM-4.6), and 6–15× faster. How does the "expensive" model end up cheapest? The Z.ai models do heavy hidden reasoning, so they emit 10–20× more tokens per task. Cheap tokens × a huge pile of them ≥ the frontier price. And GLM-5.2 twice looped all the way to its token cap and returned nothing at all — while still billing about $0.018 for the privilege.
| Task | Opus 4.8 | GLM-4.6 | GLM-5.2 |
|---|---|---|---|
| Fix a buggy binary search | PASS$0.0024 · 3s | PASS$0.0024 · 16s | PASS$0.0008 · 8s |
| Merge overlapping intervals | PASS$0.0026 · 5s | PASS$0.0034 · 26s | PASS$0.0009 · 11s |
| IPv4-address regex | PASS$0.0022 · 3s | PASS$0.0058 · 47s | PASS$0.0040 · 40s |
| 2nd-highest salary (SQL) | PASS$0.0012 · 2s | PASS$0.0033 · 32s | PASS$0.0025 · 22s |
| Fix a Go data race | PASS$0.0030 · 3s | PASS$0.0016 · 14s | PASS$0.0012 · 12s |
| Median of two sorted lists | PASS$0.0040 · 4s | PASS$0.0157 · 141s | PASS$0.0029 · 25s |
| Add N business days | PASS$0.0028 · 4s | PASS$0.0043 · 34s | FAIL$0.0177 · 116s |
| Quoted-string regex (escapes) | PASS$0.0010 · 2s | PASS$0.0024 · 20s | PASS$0.0018 · 19s |
| Non-mutating deep dict merge | PASS$0.0020 · 3s | PASS$0.0132 · 116s | FAIL$0.0176 · 124s |
This isn't "GLM is bad." GLM-4.6 matched Opus on correctness, and the models are genuinely capable. The point is narrower and more useful: per-token price tells you almost nothing about per-task cost. You have to measure it on your own tasks — which is the whole reason to run it through one neutral layer.
The receipts
Every prompt and every raw answer is below — expand any task. Costs are cerver's per-token estimates; the gap is driven by token volume, not rate.
Fix a buggy binary search binsearch_bugfix
This Python binary search is buggy and returns wrong indices for some inputs. Return the corrected function `def bsearch(arr, target)` that returns the index of target or -1.
def bsearch(arr, target):
lo, hi = 0, len(arr)
while lo <= hi:
mid = (lo + hi) // 2
if arr[mid] == target: return mid
elif arr[mid] < target: lo = mid
else: hi = mid
return -1```python
def bsearch(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1
``````python
def bsearch(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1
``````python
def bsearch(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid + 1
else:
hi = mid - 1
return -1
```Merge overlapping intervals merge_intervals
Implement `def merge_intervals(intervals)` that merges overlapping/touching integer intervals (list of [start,end]) and returns a sorted list of merged [start,end]. Handle empty input. Touching intervals like [1,2] and [2,3] merge to [1,3].
```python
def merge_intervals(intervals):
if not intervals:
return []
intervals = sorted(intervals, key=lambda x: (x[0], x[1]))
merged = [list(intervals[0])]
for start, end in intervals[1:]:
if start <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], end)
else:
merged.append([start, end])
return merged
``````python
def merge_intervals(intervals):
if not intervals:
return []
# Sort intervals based on the start value
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for current_start, current_end in intervals[1:]:
last_start, last_end = merged[-1]
# If the current interval overlaps or touches the previous one
if current_start <= last_end:
# Merge them by updating the end of the last interval
merged[-1][1] = max(last_end, current_end)
else:
# Add the current interval as a new entry
merged.append([current_start, current_end])
return merged
``````python
def merge_intervals(intervals):
if not intervals:
return []
sorted_intervals = sorted(intervals, key=lambda x: x[0])
merged = [list(sorted_intervals[0])]
for start, end in sorted_intervals[1:]:
last_end = merged[-1][1]
if start <= last_end:
merged[-1][1] = max(last_end, end)
else:
merged.append([start, end])
return merged
```IPv4-address regex ipv4_regex
Provide a single Python statement `PATTERN = r'...'` whose regex fully matches a valid IPv4 address: four dot-separated octets, each 0-255, no leading zeros (except a lone 0). Use ^...$ anchors. Output only the assignment line in a code block.
```python
PATTERN = r'^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$'
``````python
PATTERN = r'^(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])){3}$'
``````python
PATTERN = r'^(?:(?:0|[1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:0|[1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])$'
```2nd-highest salary (SQL) nth_salary_sql
Table employees(id INTEGER, name TEXT, salary INTEGER). Write ONE standard SQL query (SQLite-compatible) returning a single column `second` = the second-highest DISTINCT salary, or NULL if it doesn't exist. Output only the query.
```sql SELECT MAX(salary) AS second FROM employees WHERE salary < (SELECT MAX(salary) FROM employees); ```
```sql SELECT MAX(salary) AS second FROM employees WHERE salary < (SELECT MAX(salary) FROM employees); ```
```sql SELECT MAX(salary) AS second FROM employees WHERE salary < (SELECT MAX(salary) FROM employees); ```
Fix a Go data race go_race_fix
This Go code has a data race on the counter. Return a corrected, compilable version of the whole program that produces 1000 deterministically, using sync primitives.
package main
import ("fmt";"sync")
func main(){
c:=0
var wg sync.WaitGroup
for i:=0;i<1000;i++{ wg.Add(1); go func(){ c++; wg.Done() }() }
wg.Wait()
fmt.Println(c)
}```go
package main
import (
"fmt"
"sync"
)
func main() {
c := 0
var wg sync.WaitGroup
var mu sync.Mutex
for i := 0; i < 1000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
c++
mu.Unlock()
}()
}
wg.Wait()
fmt.Println(c)
}
``````go
package main
import (
"fmt"
"sync"
)
func main() {
var c int
var mu sync.Mutex
var wg sync.WaitGroup
for i := 0; i < 1000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
c++
mu.Unlock()
}()
}
wg.Wait()
fmt.Println(c)
}
``````go
package main
import (
"fmt"
"sync"
)
func main() {
c := 0
var mu sync.Mutex
var wg sync.WaitGroup
for i := 0; i < 1000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
c++
mu.Unlock()
}()
}
wg.Wait()
fmt.Println(c)
}
```Median of two sorted lists median_two_sorted
Implement `def median(a, b)` returning the median (float) of two already-sorted integer lists a and b combined. Handle even/odd total length and the case where one list is empty. Do not mutate inputs.
```python
def median(a, b):
merged = []
i = j = 0
while i < len(a) and j < len(b):
if a[i] <= b[j]:
merged.append(a[i])
i += 1
else:
merged.append(b[j])
j += 1
while i < len(a):
merged.append(a[i])
i += 1
while j < len(b):
merged.append(b[j])
j += 1
n = len(merged)
if n == 0:
raise ValueError("median of empty sequence is undefined")
mid = n // 2
if n % 2 == 1:
return float(merged[mid])
return (merged[mid - 1] + merged[mid]) / 2.0
``````python
def median(a, b):
# Ensure a is the smaller list to optimize the binary search space
if len(a) > len(b):
a, b = b, a
m, n = len(a), len(b)
low, high = 0, m
while low <= high:
partition_a = (low + high) // 2
partition_b = (m + n + 1) // 2 - partition_a
# Handle edge cases where partition is at the start or end of a list
max_left_a = float('-inf') if partition_a == 0 else a[partition_a - 1]
min_right_a = float('inf') if partition_a == m else a[partition_a]
max_left_b = float('-inf') if partition_b == 0 else b[partition_b - 1]
min_right_b = float('inf') if partition_b == n else b[partition_b]
# Check if we have found the correct partition
if max_left_a <= min_right_b and max_left_b <= min_right_a:
if (m + n) % 2 == 0:
# Even total length: average of max left and min right
return (max(max_left_a, max_left_b) + min(min_right_a, min_right_b)) / 2.0
else:
# Odd total length: max of left partition
return float(max(max_left_a, max_left_b))
elif max_left_a > min_right_b:
# We are too far right in list a, move left
high = partition_a - 1
else:
# We are too far left in list a, move right
low = partition_a + 1
raise ValueError("Input lists are not sorted.")
``````python
def median(a, b):
n, m = len(a), len(b)
total = n + m
if total == 0:
return 0.0
mid = total // 2
i = j = 0
prev = curr = 0
for _ in range(mid + 1):
prev = curr
if i < n and (j >= m or a[i] <= b[j]):
curr = a[i]
i += 1
else:
curr = b[j]
j += 1
if total % 2 == 1:
return float(curr)
else:
return (prev + curr) / 2.0
```Add N business days business_days
Implement `def add_business_days(date_str, n)` where date_str is 'YYYY-MM-DD' and n>=0. Return the date 'YYYY-MM-DD' that is n business days after date_str, skipping Saturdays and Sundays. n=0 returns the same date if it is a weekday, otherwise the next weekday. Use the datetime module.
```python
from datetime import datetime, timedelta
def add_business_days(date_str, n):
date = datetime.strptime(date_str, '%Y-%m-%d')
while date.weekday() >= 5:
date += timedelta(days=1)
count = 0
while count < n:
date += timedelta(days=1)
if date.weekday() < 5:
count += 1
return date.strftime('%Y-%m-%d')
``````python
from datetime import datetime, timedelta
def add_business_days(date_str, n):
# Parse the input string into a date object
current_date = datetime.strptime(date_str, '%Y-%m-%d').date()
if n == 0:
# If it's a weekday (0-4), return the same date
# If Saturday (5), add 2 days to get to Monday
# If Sunday (6), add 1 day to get to Monday
weekday = current_date.weekday()
if weekday < 5:
return date_str
elif weekday == 5:
return (current_date + timedelta(days=2)).strftime('%Y-%m-%d')
else: # weekday == 6
return (current_date + timedelta(days=1)).strftime('%Y-%m-%d')
# For n > 0, advance day by day and count only weekdays
added_days = 0
while added_days < n:
current_date += timedelta(days=1)
if current_date.weekday() < 5: # Monday is 0, Friday is 4
added_days += 1
return current_date.strftime('%Y-%m-%d')
```(no output returned)
Quoted-string regex (escapes) escaped_quote_regex
Provide `PATTERN = r'...'`: a Python regex that fully matches (^...$) a double-quoted string literal that may contain backslash-escaped characters, e.g. matches "a\"b" and "" and "x\\" but NOT "a"b" (an unescaped inner quote). Output only the assignment line in a code block.
```python PATTERN = r'^"(?:\\.|[^"\\])*"$' ```
```python PATTERN = r'^"(?:[^"\\]|\\.)*"$' ```
```python PATTERN = r'^"([^"\\]|\\.)*"$' ```
Non-mutating deep dict merge deep_merge
Implement `def deep_merge(a, b)` that recursively merges two nested dicts and returns a NEW dict; values in b win, nested dicts merge recursively, and neither input is mutated. Non-dict values (including lists) from b replace those in a.
```python
def deep_merge(a, b):
result = dict(a)
for key, b_val in b.items():
if key in result and isinstance(result[key], dict) and isinstance(b_val, dict):
result[key] = deep_merge(result[key], b_val)
else:
result[key] = b_val
return result
``````python
import copy
def deep_merge(a, b):
# Create a deep copy of 'a' to avoid mutating the input
result = copy.deepcopy(a)
for key, b_val in b.items():
# If the key exists in result and both values are dicts, merge recursively
if key in result and isinstance(result[key], dict) and isinstance(b_val, dict):
result[key] = deep_merge(result[key], b_val)
else:
# Otherwise, value from 'b' replaces the value in 'a' (or creates new key)
# We deepcopy b_val to ensure 'b' is not mutated via the result
result[key] = copy.deepcopy(b_val)
return result
```(no output returned)
What to distrust: 9 tasks with objectively checkable answers is a small, coding-only sample — not a verdict on general intelligence. Costs are estimates from published per-token rates. Z.ai reasoning is on by default here; a non-reasoning config could be faster and cheaper, and GLM-5.2 needs a large token budget or it silently returns nothing. One aside worth its own line: GLM-5.2 insisted "Claude Opus 4.8 doesn't exist." It's right that it can't see it — 4.8 is newer than its training. Leaderboards go stale; live evals don't.
Stop guessing which model to use.
Run this exact bake-off — or your own tasks, on your own codebase — across any model through one interface. cerver hands you the transcript, the cost, and the verdict for each.