Customer intelligence for semiconductor FAEs

Your first-week FAE,
answering like a ten-year veteran.

Every customer question your team has answered before becomes an answer waiting on your desk — sourced to the engineer who first solved it. Press Option-Space on any screen; the rest is a draft you send.

See how it works
PCIe link drops intermittently aft…×
github.com/helio-robotics/atlas-fw/issues/482
E
helio-robotics / atlas-fw
PCIe link drops intermittently after firmware 1.4.2 #482
Openmaya-chen opened this issue 2 days ago · 4 comments · 5 participants
MCmaya-chen commented 2 days agoedited

We’re seeing intermittent PCIe link drops on the 12-node cluster under sustained load after updating to firmware 1.4.2. Logs below.

Note  Only reproduces under sustained load (>20 min at full utilization).
[ 8423.71] pcieport 0000:40:01.0: AER: Corrected error
[ 8423.71] device [8086:347a] error/Receiver ID
[ 8429.02] pcieport: Link retrain failed, lane width x8 → x4
Boardatlas-rev-C2
Firmware1.4.2
Lane widthx16 (negotiated x8)
👍 5👀 3
maya-chen added the bug firmware labels · 2 days ago
darius-ok self-assigned this · yesterday
darius-ok mentioned this in #467 · yesterday
DOdarius-ok commented yesterdayMember

Reproduced on our bench rig — the drops start exactly as the link tries to enter L1.2. Fairly sure it’s ASPM-related but I haven’t pinned the regressing change yet. Have we seen this on rev-C2 before?

Assignees
maya-chen
Labels
bugregressionpciefirmwareP1needs-triage
Milestone
1.4.x stability
41% complete7 open · 5 closed
5 participants
Development
Tighten ASPM L1.2 entry delay #491
Case #PCIE-218 · 94% match
Reading · github.comSpace

Your team has already solved this link-drop — root cause was ASPM L1 substate timing.

Answering from your team’s cases
DODarius Okonkwo#PCIE-218 · 2024 · GitHub1
PNPriya Nairfield-eng thread · 2023 · Discord2
AKAria Kerr1.4.x stability · 2022 · GitHub3
Draft — awaiting your approval

Grounded in1.4.3-rc changelogCase #PCIE-218
Use this reply
Reads fromOutlookGitHubJiraDiscordGmail
Inside Outlook

And right where the email lands.

Open a customer thread and the same intelligence is already there — the matched case, the recommended reply, the cross-source history — without leaving the message. The pane moves through Intel, Tickets and Chat on its own; hover to take over.

Outlook
ReplyReply allForward

Re: PCIe link drops on the 12-node cluster

MC
Maya Chen <[email protected]>Tue 9:14 AM
To: You

Hi —

We’re seeing PCIe links drop on the 12-node cluster under sustained load. We’re prepping for Q3 production and this is now a hard blocker for sign-off.

All nodes are on firmware 1.4.1. It reproduces within ~20 minutes at full utilization. Was there a firmware fix for this, and if so what’s the rollout timing? Happy to share the cluster logs.

Thanks,
Maya

Maya Chen · Field Application Engineer
Helio Robotics
cluster-logs-0624.txt84 KB
On Mon, Jun 23, You wrote:

Thanks Maya — can you confirm the exact firmware build and attach the cluster logs? We’ll check the failure signature against the known PCIe retrain issue and get you a rollout date.

On Mon, Jun 23, Maya Chen wrote:

Heads up — PCIe link drops are back on the 12-node cluster under sustained Q3 load testing. Flagging early, before sign-off, in case there’s a known fix.

Synchronize
Helio Robotics
helio-robotics.com
Customer Overview

Helio is evaluating the 12-node cluster for Q3 production. The open thread is a PCIe link drop under sustained load — now a stated blocker. Maya Chen (FAE) has asked twice for a firmware date.

Recommended Next
  1. 1Reply: the retrain bug is fixed in firmware 1.4.2Matches GitHub #1190 — resolves their stated blocker.
  2. 2Confirm the 1.4.2 rollout windowOpen on Jira FAE-482; Maya has asked twice.
  3. 3Attach the thermal-throttling workaroundHolds the cluster until 1.4.2 ships.
Similar Engagements
Vortex Compute
Compute infrastructure
Same issue: PCIe link drop
HardwarePCIe12-node
SymptomsLink drop
Arclight AI
AI accelerators
Same issue: Thermal throttling
SymptomsThrottling
SoftwareFW 1.4.1
AI-assisted · verify before sending
The workspace

Every account, every signal, every resolution — one place.

Behind the companion is the full workspace: a home that triages what needs a reply, and a customer view that collapses every source into one story. Insight is the unit, not charts or counts.

Synchronize
Customers
Portfolio cockpit · 24 accounts
At-risk accounts
4 / 24 +2 wk
Stale blockers >14d
11 +3
Going quiet · 30d
3 flat
Account health · book of 242 at risk · 1 silent >14d · 5 stale blockers
AccountStageHealthEngagementOpen / StaleLastSuggested action
Helio RoboticsIntegrationAt risk14 / 51d3 blockers >14d on PCIe scale-out — book sync
Orbital DynamicsIntegrationAt risk9 / 421dSilent 21d after perf regression — ESCALATE
Arclight AIBring-upWatch6 / 29dActivity decaying mid bring-up — check in
Vortex ComputePre-productionHealthy4 / 02dOn track — confirm production timeline
Nimbus PhotonicsSamplingWatch8 / 13dEval spike on firmware — send 1.4.2 guide

Stop losing what your team already knows.

Built for semiconductor support teams who refuse to re-diagnose the same failure twice.

See how it works