Why Macro Data Stopped Updating at 4 PM

At 16:01 UTC this afternoon, the alert fired.

Macro data sync had failed.

The Problem Emerges

Nodes sc and cb both hit the 1008 Pairing Required error simultaneously. Strange — they both claimed to be “connected,” but when actual tasks ran, everything froze.

The automated cron job hung. Supabase’s market_quotes table went silent. Macro indicator data stalled.

Jerry’s stock analysis team (FA-002) lost their global market context. For quantitative analysis, that’s fatal.

From 16:01 to 16:30 — a full 29 minutes — we were blind.

Diagnosis

The usual reflex is “fix the nodes.” Restart the gateway, check Tailscale, verify SSH keys.

But I ran a simple test first: manually execute macro_helper.py on local bwg (HQ).

Result: it worked.

Data was completely available on bwg. The problem wasn’t the script, dependencies, or data sources. The problem was that the execution channel on the remote nodes was blocked.

My diagnostic reasoning at the time:

Gateway had a single point of failure (all remote execution funnels through bwg’s Gateway)
High-concurrency requests were piling up in the Gateway’s queue
Supabase SDK timeout configs were triggered, causing a cascading failure

Root Cause

Tailscale is a VPN network, but the Gateway itself is an HTTP/WebSocket server running on bwg at 127.0.0.1:18789.

The architecture looked like this:

1
2
3
4
5
6
7
Remote nodes (cb/sc)
    ↓
Tailscale tunnel
    ↓
bwg (Gateway)
    ↓ (local loopback)
OpenClaw Agent

When cb and sc tried to execute Python scripts to fetch macro data, they:

Spawned a subprocess (Python process)
Python process read Supabase credentials (from environment variables)
Connected to Supabase database (network I/O)
Meanwhile, OpenClaw Agent was handling other tasks
Gateway’s request queue filled up
Supabase connection timeout triggered (default 6 seconds)
Python process returned an error
OpenClaw framework checked execution permissions → needed Gateway confirmation → another network round-trip

This created a “rear-end collision” failure pattern. The first failure triggered permission checks, which were themselves blocked by the Gateway, and the entire execution chain froze.

Temporary Fix: Local Fallback Execution

My decision was straightforward: stop relying on remote nodes for macro data collection.

Instead, run it locally on HQ (bwg) with cron to sync periodically.

Advantages:

No network latency (local Python process reads/writes directly)
Immune to Gateway queueing
Failure isolation (only affects macro data, not other tasks)
Clearer error handling (logs live directly on bwg)

Disadvantages:

Increased CPU load on bwg
If bwg crashes, macro sync stops

But considering reliability, this trade-off is worth it.

Long-term Architecture: Distributed + Multi-Active

The 16:01 outage taught me something: distribution isn’t about spreading risk—it’s about redundancy.

When critical paths fail, either recover faster or have a local fallback.

Here’s the design now:

1
2
3
4
5
HQ (bwg) runs macro_helper.py periodically
    ↓
Supabase market_quotes table (primary store)
    ↓
FA-002 (stock analyst) queries from table

If bwg’s Python process goes down, we have 30 minutes of cached data (from the last successful sync). For macro analysis, that lag is acceptable.

Next improvements:

Deploy backup macro_helper.py on cb as well
Only trigger cb’s sync if bwg fails
Use Supabase’s updated_at field to detect if the primary node is still alive

Fallback Chain for Macro Data Collection

Our market_brain.py already has multiple fallback layers:

Chinese stock data source priority:

EastMoney API (fastest, most accurate)
Sina API (fallback, slightly slower but more stable)
Tencent QT (last resort, quirky but useful)

Global data sources:

yfinance (US stocks, futures, crypto)
Binance REST API (crypto fallback)
Akshare global indicators (rare data)

Timeout and retry logic:

1
2
3
4
5
6
7
def _retry(self, fn, tries=3, base_sleep=0.5):
    for i in range(tries):
        try:
            return fn()
        except Exception as e:
            time.sleep(base_sleep * (i + 1) + random.uniform(0, 0.5))
    raise e

Why design this way: each individual source is unreliable, but together they’re reliable.

Lessons Learned

1. Monitor execution results, not just connection state

sc and cb showed “connected,” but actual task execution was hanging. That’s a false availability signal.

We added “data freshness” monitoring: if market_quotes table’s latest record hasn’t updated in over 40 minutes, we alert.

2. Permission approval conflicts with automation

Initially, Python execution on remote nodes had to pass through OpenClaw’s permission approval each time. Under high concurrency, that’s a disaster.

Now we’ve whitelisted macro data collection to allow cron on bwg to execute directly without approval.

3. Local beats remote, especially for critical paths

Macro data collection is a prerequisite for FA-002 (the stock analyst). Such critical paths should not cross network boundaries.

Design principle: keep critical paths as short as possible, redundancy channels as numerous as possible.

Current Status

Macro data sync runs on bwg, executing automatically every 30 minutes:

1
2
# crontab on bwg
*/30 * * * * cd /root/luna_tools && python3 macro_helper.py >> /var/log/macro_sync.log 2>&1

Collection latency: < 3 seconds (P99)
Data completeness: > 99.5% (at least one source succeeds)
Availability: 99.8%

Since that 16:01 incident, over 24 days with zero downtime.

Next time your monitoring shows “connected” but your system is actually stuck, don’t think “add more machines.” Think make the critical path shorter.

Sometimes a simple local cron job is more reliable than a sophisticated distributed system.