|
| 1 | +--- |
| 2 | +name: ops-inspector |
| 3 | +description: AIOps-style one-click inspection skill for CloudBase resources. Use this skill when users need to diagnose errors, check resource health, inspect logs, or run a comprehensive health check across cloud functions, CloudRun services, databases, and other CloudBase resources. |
| 4 | +version: 2.16.1 |
| 5 | +alwaysApply: false |
| 6 | +--- |
| 7 | + |
| 8 | +## Standalone Install Note |
| 9 | + |
| 10 | +If this environment only installed the current skill, start from the CloudBase main entry and use the published `cloudbase/references/...` paths for sibling skills. |
| 11 | + |
| 12 | +- CloudBase main entry: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md` |
| 13 | +- Current skill raw source: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md` |
| 14 | + |
| 15 | +Keep local `references/...` paths for files that ship with the current skill directory. When this file points to a sibling skill such as `cloud-functions` or `cloudrun-development`, use the standalone fallback URL shown next to that reference. |
| 16 | + |
| 17 | +## Activation Contract |
| 18 | + |
| 19 | +### Use this first when |
| 20 | + |
| 21 | +- The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.). |
| 22 | +- The user reports errors, failures, or abnormal behavior and wants a quick diagnosis. |
| 23 | +- The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment. |
| 24 | +- The user wants to review recent error logs across services. |
| 25 | + |
| 26 | +### Read before writing code if |
| 27 | + |
| 28 | +- The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes. |
| 29 | +- The user wants to fix a problem found during inspection rather than just diagnose it. |
| 30 | + |
| 31 | +### Then also read |
| 32 | + |
| 33 | +- Cloud function issues -> `../cloud-functions/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md`) |
| 34 | +- CloudRun issues -> `../cloudrun-development/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md`) |
| 35 | +- Database issues -> `../relational-database-tool/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md`) or `../no-sql-web-sdk/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md`) |
| 36 | +- Platform overview -> `../cloudbase-platform/SKILL.md` (standalone fallback: `https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md`) |
| 37 | + |
| 38 | +### Do NOT use for |
| 39 | + |
| 40 | +- Deploying new resources or writing application code. This skill is read-only and diagnostic. |
| 41 | +- Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring. |
| 42 | +- Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill. |
| 43 | + |
| 44 | +### Common mistakes / gotchas |
| 45 | + |
| 46 | +- Running a full inspection without first confirming the environment is bound (`auth` tool must show logged-in and env-bound state). |
| 47 | +- Ignoring CLS log service status — if CLS is not enabled, `queryLogs` will fail; always check first with `queryLogs(action="checkLogService")`. |
| 48 | +- Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window. |
| 49 | +- Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue. |
| 50 | + |
| 51 | +### Minimal checklist |
| 52 | + |
| 53 | +- [ ] Environment is bound and accessible (`envQuery(action="info")`) |
| 54 | +- [ ] CLS log service is enabled (`queryLogs(action="checkLogService")`) |
| 55 | +- [ ] All target resources are listed before diving into details |
| 56 | +- [ ] Time range is specified for any log searches |
| 57 | +- [ ] Findings are summarized with severity levels and actionable recommendations |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## How to use this skill (for a coding agent) |
| 62 | + |
| 63 | +### Inspection Modes |
| 64 | + |
| 65 | +The skill supports two modes based on user intent: |
| 66 | + |
| 67 | +| Mode | When to use | Scope | |
| 68 | +|------|-------------|-------| |
| 69 | +| **Full inspection** | User asks for a general health check / 巡检 / 全面检查 | All resource types in the environment | |
| 70 | +| **Targeted inspection** | User reports a specific error or asks about a specific resource | One resource type or a specific resource | |
| 71 | + |
| 72 | +### Full Inspection Workflow |
| 73 | + |
| 74 | +Follow these steps in order for a comprehensive environment health check: |
| 75 | + |
| 76 | +**Step 1 — Environment Check** |
| 77 | + |
| 78 | +``` |
| 79 | +envQuery(action="info") |
| 80 | +``` |
| 81 | + |
| 82 | +Confirm the environment is accessible. Record the `envId` for console link generation. |
| 83 | + |
| 84 | +**Step 2 — Log Service Status** |
| 85 | + |
| 86 | +``` |
| 87 | +queryLogs(action="checkLogService") |
| 88 | +``` |
| 89 | + |
| 90 | +If CLS is not enabled, note this as a **warning** — log-based diagnosis will be unavailable. Recommend enabling CLS in the console: `https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log` |
| 91 | + |
| 92 | +**Step 3 — Cloud Functions Inspection** |
| 93 | + |
| 94 | +``` |
| 95 | +queryFunctions(action="listFunctions") |
| 96 | +``` |
| 97 | + |
| 98 | +For each function, check: |
| 99 | +- **Status**: Is the function in an active/deployed state? |
| 100 | +- **Recent errors**: `queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")` |
| 101 | +- **Common issues**: |
| 102 | + - Timeout errors (execution exceeded limit) |
| 103 | + - Memory limit exceeded |
| 104 | + - Runtime errors (unhandled exceptions) |
| 105 | + - Cold start frequency |
| 106 | + |
| 107 | +**Step 4 — CloudRun Services Inspection** |
| 108 | + |
| 109 | +``` |
| 110 | +queryCloudRun(action="list") |
| 111 | +``` |
| 112 | + |
| 113 | +For each service, check: |
| 114 | +- **Status**: Is the service running? |
| 115 | +- **Detail**: `queryCloudRun(action="detail", detailServerName="<name>")` |
| 116 | +- **Common issues**: |
| 117 | + - Service not running (scaled to zero or crashed) |
| 118 | + - Image pull failures |
| 119 | + - OOMKilled events |
| 120 | + - Health check failures |
| 121 | + |
| 122 | +**Step 5 — Error Log Aggregation** (if CLS is enabled) |
| 123 | + |
| 124 | +``` |
| 125 | +queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50) |
| 126 | +queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50) |
| 127 | +``` |
| 128 | + |
| 129 | +Look for patterns: |
| 130 | +- Repeated error messages (same error many times) |
| 131 | +- Cascading failures (errors in multiple services around the same time) |
| 132 | +- Timeout patterns |
| 133 | + |
| 134 | +**Step 6 — Summary Report** |
| 135 | + |
| 136 | +Generate a structured report: |
| 137 | + |
| 138 | +```markdown |
| 139 | +# CloudBase Resource Inspection Report |
| 140 | + |
| 141 | +**Environment**: ${envId} |
| 142 | +**Inspection Time**: ${timestamp} |
| 143 | + |
| 144 | +## Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found |
| 145 | + |
| 146 | +### Cloud Functions |
| 147 | +| Function | Status | Recent Errors | Severity | |
| 148 | +|----------|--------|---------------|----------| |
| 149 | +| ... | ... | ... | ... | |
| 150 | + |
| 151 | +### CloudRun Services |
| 152 | +| Service | Status | Issues | Severity | |
| 153 | +|---------|--------|--------|----------| |
| 154 | +| ... | ... | ... | ... | |
| 155 | + |
| 156 | +### Error Log Summary |
| 157 | +- Total errors in last 24h: N |
| 158 | +- Top error patterns: ... |
| 159 | + |
| 160 | +## Recommendations |
| 161 | +1. ... |
| 162 | +2. ... |
| 163 | + |
| 164 | +## Console Links |
| 165 | +- Cloud Functions: https://tcb.cloud.tencent.com/dev?envId=${envId}#/scf |
| 166 | +- CloudRun: https://tcb.cloud.tencent.com/dev?envId=${envId}#/platform-run |
| 167 | +- Logs: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log |
| 168 | +``` |
| 169 | + |
| 170 | +### Targeted Inspection Workflow |
| 171 | + |
| 172 | +When the user specifies a resource type or a specific resource: |
| 173 | + |
| 174 | +1. **Cloud function errors**: `queryFunctions(action="listFunctionLogs", functionName="<name>")` then `queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)` |
| 175 | +2. **CloudRun errors**: `queryCloudRun(action="detail", detailServerName="<name>")` then `queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)` |
| 176 | +3. **Database issues**: Check `querySqlDatabase` or `readNoSqlDatabaseStructure` depending on type |
| 177 | +4. **General error search**: `queryLogs(action="searchLogs", queryString="<error-keyword>", ...)` |
| 178 | + |
| 179 | +### AIOps Methodology |
| 180 | + |
| 181 | +This skill follows AIOps principles for intelligent inspection: |
| 182 | + |
| 183 | +1. **Data Collection**: Gather logs and resource states via MCP tools |
| 184 | +2. **Pattern Recognition**: Identify recurring errors, anomaly patterns, and correlations across services |
| 185 | +3. **Root Cause Hypothesis**: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck) |
| 186 | +4. **Actionable Recommendations**: Provide specific, prioritized remediation steps with links to relevant skills and console pages |
| 187 | + |
| 188 | +### Severity Levels |
| 189 | + |
| 190 | +| Level | Icon | Meaning | |
| 191 | +|-------|------|---------| |
| 192 | +| Critical | ❌ | Service is down or data is at risk; requires immediate action | |
| 193 | +| Warning | ⚠️ | Errors detected but service is still partially functional; investigate soon | |
| 194 | +| Info | ℹ️ | No errors found; informational status only | |
| 195 | +| Healthy | ✅ | Resource is operating normally | |
| 196 | + |
| 197 | +### Preferred Tool Map |
| 198 | + |
| 199 | +| Operation | MCP Tool Call | |
| 200 | +|-----------|---------------| |
| 201 | +| Check environment | `envQuery(action="info")` | |
| 202 | +| Check CLS status | `queryLogs(action="checkLogService")` | |
| 203 | +| List cloud functions | `queryFunctions(action="listFunctions")` | |
| 204 | +| Get function detail | `queryFunctions(action="getFunctionDetail", functionName="<name>")` | |
| 205 | +| Get function logs | `queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")` | |
| 206 | +| Get function log detail | `queryFunctions(action="getFunctionLogDetail", requestId="<id>")` | |
| 207 | +| List CloudRun services | `queryCloudRun(action="list")` | |
| 208 | +| Get CloudRun detail | `queryCloudRun(action="detail", detailServerName="<name>")` | |
| 209 | +| Search CLS logs | `queryLogs(action="searchLogs", queryString="<query>", service="tcb\|tcbr", startTime="<time>", endTime="<time>")` | |
| 210 | +| Check NoSQL structure | `readNoSqlDatabaseStructure(action="listCollections")` | |
| 211 | +| Check MySQL status | `querySqlDatabase(action="getContext")` | |
| 212 | + |
| 213 | +### Common CLS Query Patterns |
| 214 | + |
| 215 | +| Scenario | queryString | |
| 216 | +|----------|-------------| |
| 217 | +| All errors | `ERROR` | |
| 218 | +| Function timeout | `timeout OR 超时` | |
| 219 | +| Function OOM | `OOM OR out of memory OR 内存超限` | |
| 220 | +| CloudRun crash | `crash OR OOMKilled OR Error` | |
| 221 | +| Specific function errors | `functionName:<name> AND level:ERROR` | |
| 222 | +| 5xx HTTP errors | `statusCode:>499` | |
| 223 | +| Cold start issues | `coldStart OR 冷启动` | |
| 224 | + |
| 225 | +### Time Range Guidance |
| 226 | + |
| 227 | +- **Quick check**: Last 1 hour (`startTime` = 1 hour ago) |
| 228 | +- **Standard inspection**: Last 24 hours |
| 229 | +- **Trend analysis**: Last 7 days |
| 230 | +- **Specific incident**: Narrow to the reported time window |
| 231 | + |
| 232 | +Always use ISO 8601 format for `startTime`/`endTime`, e.g., `"2025-01-15 00:00:00"`. |
| 233 | + |
| 234 | +## Related Skills |
| 235 | + |
| 236 | +- `cloud-functions` — Cloud function development, deployment, and debugging |
| 237 | +- `cloudrun-development` — CloudRun backend deployment and management |
| 238 | +- `cloudbase-platform` — General platform knowledge and console navigation |
| 239 | +- `relational-database-tool` — MySQL database management and diagnostics |
0 commit comments