feat(aws_durable_execution): persist trace context across suspend/resume#17773
feat(aws_durable_execution): persist trace context across suspend/resume#17773joeyzhao2018 wants to merge 46 commits into
Conversation
Performance SLOsComparing candidate joey/cross-invocation-tracecontext-propagation (db67ae1) with baseline joey/apm-ai-toolkit/aws-durable-execution-sdk-python (92f0e53) 📈 Performance Regressions (2 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 104.846µs (SLO: <130.000µs 📉 -19.3%) vs baseline: +4.3% Memory: ✅ 43.988MB (SLO: <46.000MB -4.4%) vs baseline: +4.8% ✅ add_inplace_aspectTime: ✅ 101.578µs (SLO: <130.000µs 📉 -21.9%) vs baseline: -0.5% Memory: ✅ 43.921MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ add_inplace_noaspectTime: ✅ 28.128µs (SLO: <40.000µs 📉 -29.7%) vs baseline: -2.0% Memory: ✅ 43.872MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ add_noaspectTime: ✅ 49.152µs (SLO: <70.000µs 📉 -29.8%) vs baseline: +0.3% Memory: ✅ 43.857MB (SLO: <46.000MB -4.7%) vs baseline: +4.6% ✅ bytearray_aspectTime: ✅ 265.249µs (SLO: <400.000µs 📉 -33.7%) vs baseline: -0.2% Memory: ✅ 43.845MB (SLO: <46.000MB -4.7%) vs baseline: +4.8% ✅ bytearray_extend_aspectTime: ✅ 654.784µs (SLO: <800.000µs 📉 -18.2%) vs baseline: +0.2% Memory: ✅ 43.870MB (SLO: <46.000MB -4.6%) vs baseline: +4.5% ✅ bytearray_extend_noaspectTime: ✅ 273.422µs (SLO: <400.000µs 📉 -31.6%) vs baseline: +0.7% Memory: ✅ 43.967MB (SLO: <46.000MB -4.4%) vs baseline: +5.1% ✅ bytearray_noaspectTime: ✅ 148.822µs (SLO: <300.000µs 📉 -50.4%) vs baseline: +0.9% Memory: ✅ 43.975MB (SLO: <46.000MB -4.4%) vs baseline: +5.5% ✅ bytes_aspectTime: ✅ 231.096µs (SLO: <300.000µs 📉 -23.0%) vs baseline: +0.2% Memory: ✅ 43.962MB (SLO: <46.000MB -4.4%) vs baseline: +5.2% ✅ bytes_noaspectTime: ✅ 139.112µs (SLO: <200.000µs 📉 -30.4%) vs baseline: +0.4% Memory: ✅ 43.791MB (SLO: <46.000MB -4.8%) vs baseline: +4.6% ✅ bytesio_aspectTime: ✅ 3.824ms (SLO: <5.000ms 📉 -23.5%) vs baseline: +0.4% Memory: ✅ 43.870MB (SLO: <46.000MB -4.6%) vs baseline: +4.6% ✅ bytesio_noaspectTime: ✅ 322.401µs (SLO: <420.000µs 📉 -23.2%) vs baseline: +0.4% Memory: ✅ 43.852MB (SLO: <46.000MB -4.7%) vs baseline: +4.6% ✅ capitalize_aspectTime: ✅ 89.959µs (SLO: <300.000µs 📉 -70.0%) vs baseline: +0.8% Memory: ✅ 43.999MB (SLO: <46.000MB -4.3%) vs baseline: +5.0% ✅ capitalize_noaspectTime: ✅ 274.466µs (SLO: <300.000µs -8.5%) vs baseline: +8.2% Memory: ✅ 43.934MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ casefold_aspectTime: ✅ 89.525µs (SLO: <500.000µs 📉 -82.1%) vs baseline: ~same Memory: ✅ 43.930MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ casefold_noaspectTime: ✅ 312.184µs (SLO: <500.000µs 📉 -37.6%) vs baseline: +1.1% Memory: ✅ 43.962MB (SLO: <46.000MB -4.4%) vs baseline: +4.8% ✅ decode_aspectTime: ✅ 87.271µs (SLO: <100.000µs 📉 -12.7%) vs baseline: ~same Memory: ✅ 43.909MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ decode_noaspectTime: ✅ 155.960µs (SLO: <210.000µs 📉 -25.7%) vs baseline: -1.0% Memory: ✅ 43.868MB (SLO: <46.000MB -4.6%) vs baseline: +4.7% ✅ encode_aspectTime: ✅ 84.877µs (SLO: <200.000µs 📉 -57.6%) vs baseline: +0.6% Memory: ✅ 43.902MB (SLO: <46.000MB -4.6%) vs baseline: +4.8% ✅ encode_noaspectTime: ✅ 145.266µs (SLO: <200.000µs 📉 -27.4%) vs baseline: +0.1% Memory: ✅ 43.876MB (SLO: <46.000MB -4.6%) vs baseline: +4.7% ✅ format_aspectTime: ✅ 14.649ms (SLO: <19.200ms 📉 -23.7%) vs baseline: +0.6% Memory: ✅ 43.933MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ format_map_aspectTime: ✅ 16.368ms (SLO: <21.500ms 📉 -23.9%) vs baseline: ~same Memory: ✅ 43.951MB (SLO: <46.000MB -4.5%) vs baseline: +5.1% ✅ format_map_noaspectTime: ✅ 361.466µs (SLO: <500.000µs 📉 -27.7%) vs baseline: ~same Memory: ✅ 43.947MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ format_noaspectTime: ✅ 310.122µs (SLO: <500.000µs 📉 -38.0%) vs baseline: ~same Memory: ✅ 43.907MB (SLO: <46.000MB -4.6%) vs baseline: +5.0% ✅ index_aspectTime: ✅ 127.993µs (SLO: <300.000µs 📉 -57.3%) vs baseline: +5.0% Memory: ✅ 43.998MB (SLO: <46.000MB -4.4%) vs baseline: +5.3% ✅ index_noaspectTime: ✅ 41.123µs (SLO: <300.000µs 📉 -86.3%) vs baseline: +0.6% Memory: ✅ 43.919MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ join_aspectTime: ✅ 214.200µs (SLO: <300.000µs 📉 -28.6%) vs baseline: -0.3% Memory: ✅ 43.880MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ join_noaspectTime: ✅ 141.669µs (SLO: <300.000µs 📉 -52.8%) vs baseline: -0.9% Memory: ✅ 43.927MB (SLO: <46.000MB -4.5%) vs baseline: +5.1% ✅ ljust_aspectTime: ✅ 587.609µs (SLO: <700.000µs 📉 -16.1%) vs baseline: 📈 +16.7% Memory: ✅ 43.919MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ ljust_noaspectTime: ✅ 260.659µs (SLO: <300.000µs 📉 -13.1%) vs baseline: +0.2% Memory: ✅ 43.877MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ lower_aspectTime: ✅ 309.436µs (SLO: <500.000µs 📉 -38.1%) vs baseline: -1.3% Memory: ✅ 43.974MB (SLO: <46.000MB -4.4%) vs baseline: +5.2% ✅ lower_noaspectTime: ✅ 240.155µs (SLO: <300.000µs 📉 -19.9%) vs baseline: +0.5% Memory: ✅ 43.887MB (SLO: <46.000MB -4.6%) vs baseline: +5.0% ✅ lstrip_aspectTime: ✅ 0.279ms (SLO: <3.000ms 📉 -90.7%) vs baseline: +1.0% Memory: ✅ 43.857MB (SLO: <46.000MB -4.7%) vs baseline: +4.7% ✅ lstrip_noaspectTime: ✅ 0.179ms (SLO: <3.000ms 📉 -94.0%) vs baseline: +1.1% Memory: ✅ 43.947MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ modulo_aspectTime: ✅ 14.239ms (SLO: <18.750ms 📉 -24.1%) vs baseline: -0.1% Memory: ✅ 43.953MB (SLO: <46.000MB -4.4%) vs baseline: +4.9% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 14.755ms (SLO: <19.350ms 📉 -23.7%) vs baseline: +0.2% Memory: ✅ 43.943MB (SLO: <46.000MB -4.5%) vs baseline: +4.6% ✅ modulo_aspect_for_bytesTime: ✅ 14.393ms (SLO: <18.900ms 📉 -23.8%) vs baseline: -0.3% Memory: ✅ 43.906MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 14.544ms (SLO: <19.150ms 📉 -24.1%) vs baseline: -0.2% Memory: ✅ 44.052MB (SLO: <46.000MB -4.2%) vs baseline: +5.1% ✅ modulo_noaspectTime: ✅ 0.366ms (SLO: <3.000ms 📉 -87.8%) vs baseline: +1.2% Memory: ✅ 43.877MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ replace_aspectTime: ✅ 18.330ms (SLO: <24.000ms 📉 -23.6%) vs baseline: +0.2% Memory: ✅ 43.940MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ replace_noaspectTime: ✅ 288.694µs (SLO: <400.000µs 📉 -27.8%) vs baseline: +0.4% Memory: ✅ 43.943MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ repr_aspectTime: ✅ 329.026µs (SLO: <420.000µs 📉 -21.7%) vs baseline: +1.0% Memory: ✅ 43.837MB (SLO: <46.000MB -4.7%) vs baseline: +4.9% ✅ repr_noaspectTime: ✅ 46.298µs (SLO: <90.000µs 📉 -48.6%) vs baseline: -1.2% Memory: ✅ 44.006MB (SLO: <46.000MB -4.3%) vs baseline: +5.4% ✅ rstrip_aspectTime: ✅ 390.488µs (SLO: <500.000µs 📉 -21.9%) vs baseline: +1.3% Memory: ✅ 43.966MB (SLO: <46.000MB -4.4%) vs baseline: +5.2% ✅ rstrip_noaspectTime: ✅ 184.620µs (SLO: <300.000µs 📉 -38.5%) vs baseline: -0.9% Memory: ✅ 43.830MB (SLO: <46.000MB -4.7%) vs baseline: +4.3% ✅ slice_aspectTime: ✅ 180.609µs (SLO: <300.000µs 📉 -39.8%) vs baseline: -1.2% Memory: ✅ 43.929MB (SLO: <46.000MB -4.5%) vs baseline: +5.0% ✅ slice_noaspectTime: ✅ 54.646µs (SLO: <90.000µs 📉 -39.3%) vs baseline: +1.6% Memory: ✅ 43.916MB (SLO: <46.000MB -4.5%) vs baseline: +4.7% ✅ stringio_aspectTime: ✅ 4.598ms (SLO: <5.000ms -8.0%) vs baseline: 📈 +18.0% Memory: ✅ 43.994MB (SLO: <46.000MB -4.4%) vs baseline: +5.0% ✅ stringio_noaspectTime: ✅ 359.247µs (SLO: <500.000µs 📉 -28.2%) vs baseline: +0.4% Memory: ✅ 43.924MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ strip_aspectTime: ✅ 277.018µs (SLO: <350.000µs 📉 -20.9%) vs baseline: +1.4% Memory: ✅ 43.956MB (SLO: <46.000MB -4.4%) vs baseline: +5.1% ✅ strip_noaspectTime: ✅ 177.441µs (SLO: <240.000µs 📉 -26.1%) vs baseline: -0.6% Memory: ✅ 43.920MB (SLO: <46.000MB -4.5%) vs baseline: +4.7% ✅ swapcase_aspectTime: ✅ 348.183µs (SLO: <500.000µs 📉 -30.4%) vs baseline: +0.8% Memory: ✅ 43.929MB (SLO: <46.000MB -4.5%) vs baseline: +4.8% ✅ swapcase_noaspectTime: ✅ 272.516µs (SLO: <400.000µs 📉 -31.9%) vs baseline: -0.7% Memory: ✅ 44.000MB (SLO: <46.000MB -4.3%) vs baseline: +5.2% ✅ title_aspectTime: ✅ 333.461µs (SLO: <500.000µs 📉 -33.3%) vs baseline: -1.1% Memory: ✅ 43.927MB (SLO: <46.000MB -4.5%) vs baseline: +5.1% ✅ title_noaspectTime: ✅ 269.269µs (SLO: <400.000µs 📉 -32.7%) vs baseline: +3.4% Memory: ✅ 43.952MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ translate_aspectTime: ✅ 513.050µs (SLO: <700.000µs 📉 -26.7%) vs baseline: +0.1% Memory: ✅ 43.900MB (SLO: <46.000MB -4.6%) vs baseline: +4.8% ✅ translate_noaspectTime: ✅ 431.340µs (SLO: <500.000µs 📉 -13.7%) vs baseline: -0.6% Memory: ✅ 43.829MB (SLO: <46.000MB -4.7%) vs baseline: +4.9% ✅ upper_aspectTime: ✅ 311.331µs (SLO: <500.000µs 📉 -37.7%) vs baseline: +0.3% Memory: ✅ 43.931MB (SLO: <46.000MB -4.5%) vs baseline: +4.7% ✅ upper_noaspectTime: ✅ 239.055µs (SLO: <400.000µs 📉 -40.2%) vs baseline: +0.7% Memory: ✅ 43.882MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 522.536µs (SLO: <700.000µs 📉 -25.4%) vs baseline: 📈 +23.2% Memory: ✅ 43.855MB (SLO: <46.000MB -4.7%) vs baseline: +4.8% ✅ ospathbasename_noaspectTime: ✅ 429.429µs (SLO: <700.000µs 📉 -38.7%) vs baseline: +0.9% Memory: ✅ 43.866MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ ospathjoin_aspectTime: ✅ 630.061µs (SLO: <700.000µs -10.0%) vs baseline: +0.2% Memory: ✅ 43.949MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ ospathjoin_noaspectTime: ✅ 635.077µs (SLO: <700.000µs -9.3%) vs baseline: -0.3% Memory: ✅ 43.847MB (SLO: <46.000MB -4.7%) vs baseline: +4.7% ✅ ospathnormcase_aspectTime: ✅ 350.203µs (SLO: <700.000µs 📉 -50.0%) vs baseline: -0.5% Memory: ✅ 43.880MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ ospathnormcase_noaspectTime: ✅ 356.197µs (SLO: <700.000µs 📉 -49.1%) vs baseline: -0.6% Memory: ✅ 43.967MB (SLO: <46.000MB -4.4%) vs baseline: +5.0% ✅ ospathsplit_aspectTime: ✅ 480.753µs (SLO: <700.000µs 📉 -31.3%) vs baseline: -0.3% Memory: ✅ 43.859MB (SLO: <46.000MB -4.7%) vs baseline: +4.6% ✅ ospathsplit_noaspectTime: ✅ 492.526µs (SLO: <700.000µs 📉 -29.6%) vs baseline: +0.1% Memory: ✅ 43.782MB (SLO: <46.000MB -4.8%) vs baseline: +4.5% ✅ ospathsplitdrive_aspectTime: ✅ 369.938µs (SLO: <700.000µs 📉 -47.2%) vs baseline: -0.3% Memory: ✅ 43.899MB (SLO: <46.000MB -4.6%) vs baseline: +4.9% ✅ ospathsplitdrive_noaspectTime: ✅ 73.230µs (SLO: <700.000µs 📉 -89.5%) vs baseline: +0.3% Memory: ✅ 43.955MB (SLO: <46.000MB -4.4%) vs baseline: +5.0% ✅ ospathsplitext_aspectTime: ✅ 458.394µs (SLO: <700.000µs 📉 -34.5%) vs baseline: -0.3% Memory: ✅ 43.915MB (SLO: <46.000MB -4.5%) vs baseline: +4.9% ✅ ospathsplitext_noaspectTime: ✅ 464.313µs (SLO: <700.000µs 📉 -33.7%) vs baseline: +0.3% Memory: ✅ 43.907MB (SLO: <46.000MB -4.5%) vs baseline: +5.1% 🟡 Near SLO Breach (7 suites)🟡 djangosimple - 28/28✅ appsecTime: ✅ 19.631ms (SLO: <22.300ms 📉 -12.0%) vs baseline: ~same Memory: ✅ 71.612MB (SLO: <73.500MB -2.6%) vs baseline: +5.0% ✅ exception-replay-enabledTime: ✅ 1.371ms (SLO: <1.450ms -5.5%) vs baseline: ~same Memory: ✅ 69.837MB (SLO: <71.500MB -2.3%) vs baseline: +4.9% ✅ iastTime: ✅ 19.644ms (SLO: <22.250ms 📉 -11.7%) vs baseline: -0.8% Memory: ✅ 71.631MB (SLO: <75.000MB -4.5%) vs baseline: +5.2% ✅ profilerTime: ✅ 15.201ms (SLO: <16.550ms -8.2%) vs baseline: ~same Memory: ✅ 60.391MB (SLO: <61.000MB 🟡 -1.0%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 19.558ms (SLO: <21.750ms 📉 -10.1%) vs baseline: +0.3% Memory: ✅ 71.602MB (SLO: <73.500MB -2.6%) vs baseline: +4.9% ✅ span-code-originTime: ✅ 19.986ms (SLO: <28.200ms 📉 -29.1%) vs baseline: +0.2% Memory: ✅ 71.576MB (SLO: <75.000MB -4.6%) vs baseline: +4.8% ✅ tracerTime: ✅ 19.616ms (SLO: <21.750ms -9.8%) vs baseline: -0.6% Memory: ✅ 71.329MB (SLO: <75.000MB -4.9%) vs baseline: +4.4% ✅ tracer-and-profilerTime: ✅ 21.035ms (SLO: <23.500ms 📉 -10.5%) vs baseline: +0.6% Memory: ✅ 73.482MB (SLO: <75.000MB -2.0%) vs baseline: +4.7% ✅ tracer-dont-create-db-spansTime: ✅ 19.694ms (SLO: <21.500ms -8.4%) vs baseline: ~same Memory: ✅ 71.513MB (SLO: <75.000MB -4.6%) vs baseline: +4.8% ✅ tracer-minimalTime: ✅ 17.793ms (SLO: <18.500ms -3.8%) vs baseline: -1.2% Memory: ✅ 71.523MB (SLO: <75.000MB -4.6%) vs baseline: +4.9% ✅ tracer-no-cachesTime: ✅ 18.806ms (SLO: <19.650ms -4.3%) vs baseline: ~same Memory: ✅ 71.457MB (SLO: <75.000MB -4.7%) vs baseline: +4.9% ✅ tracer-no-databasesTime: ✅ 20.550ms (SLO: <21.100ms -2.6%) vs baseline: -0.8% Memory: ✅ 71.643MB (SLO: <75.000MB -4.5%) vs baseline: +5.2% ✅ tracer-no-middlewareTime: ✅ 19.350ms (SLO: <21.500ms -10.0%) vs baseline: -0.3% Memory: ✅ 71.596MB (SLO: <75.000MB -4.5%) vs baseline: +5.2% ✅ tracer-no-templatesTime: ✅ 19.548ms (SLO: <22.000ms 📉 -11.1%) vs baseline: +1.0% Memory: ✅ 71.722MB (SLO: <73.500MB -2.4%) vs baseline: +5.2% 🟡 iastpropagation - 8/8✅ no-propagationTime: ✅ 48.726µs (SLO: <60.000µs 📉 -18.8%) vs baseline: -0.5% Memory: ✅ 40.855MB (SLO: <42.000MB -2.7%) vs baseline: +3.9% ✅ propagation_enabledTime: ✅ 138.970µs (SLO: <190.000µs 📉 -26.9%) vs baseline: +0.3% Memory: ✅ 41.229MB (SLO: <42.000MB 🟡 -1.8%) vs baseline: +5.5% ✅ propagation_enabled_100Time: ✅ 1.572ms (SLO: <2.300ms 📉 -31.7%) vs baseline: +0.4% Memory: ✅ 41.012MB (SLO: <42.000MB -2.4%) vs baseline: +4.8% ✅ propagation_enabled_1000Time: ✅ 29.349ms (SLO: <34.550ms 📉 -15.1%) vs baseline: +0.3% Memory: ✅ 40.835MB (SLO: <42.000MB -2.8%) vs baseline: +4.5% 🟡 otelspan - 22/22✅ add-eventTime: ✅ 41.627ms (SLO: <47.150ms 📉 -11.7%) vs baseline: ~same Memory: ✅ 41.472MB (SLO: <47.000MB 📉 -11.8%) vs baseline: +4.3% ✅ add-metricsTime: ✅ 234.535ms (SLO: <344.800ms 📉 -32.0%) vs baseline: ~same Memory: ✅ 45.622MB (SLO: <47.500MB -4.0%) vs baseline: +5.1% ✅ add-tagsTime: ✅ 264.274ms (SLO: <330.000ms 📉 -19.9%) vs baseline: -0.5% Memory: ✅ 45.536MB (SLO: <47.500MB -4.1%) vs baseline: +4.7% ✅ get-contextTime: ✅ 80.953ms (SLO: <92.350ms 📉 -12.3%) vs baseline: ~same Memory: ✅ 41.288MB (SLO: <46.500MB 📉 -11.2%) vs baseline: +5.3% ✅ is-recordingTime: ✅ 37.977ms (SLO: <44.500ms 📉 -14.7%) vs baseline: +0.4% Memory: ✅ 41.117MB (SLO: <47.500MB 📉 -13.4%) vs baseline: +5.2% ✅ record-exceptionTime: ✅ 62.680ms (SLO: <67.650ms -7.3%) vs baseline: -0.1% Memory: ✅ 41.855MB (SLO: <47.000MB 📉 -10.9%) vs baseline: +5.1% ✅ set-statusTime: ✅ 43.722ms (SLO: <50.400ms 📉 -13.2%) vs baseline: +0.6% Memory: ✅ 40.958MB (SLO: <47.000MB 📉 -12.9%) vs baseline: +4.9% ✅ startTime: ✅ 38.863ms (SLO: <44.500ms 📉 -12.7%) vs baseline: +4.3% Memory: ✅ 40.987MB (SLO: <47.000MB 📉 -12.8%) vs baseline: +4.9% ✅ start-finishTime: ✅ 89.959ms (SLO: <92.000ms -2.2%) vs baseline: +0.6% Memory: ✅ 38.830MB (SLO: <46.500MB 📉 -16.5%) vs baseline: +4.9% ✅ start-finish-telemetryTime: ✅ 91.263ms (SLO: <93.000ms 🟡 -1.9%) vs baseline: -0.4% Memory: ✅ 38.692MB (SLO: <46.500MB 📉 -16.8%) vs baseline: +4.7% ✅ update-nameTime: ✅ 39.189ms (SLO: <45.150ms 📉 -13.2%) vs baseline: +0.5% Memory: ✅ 41.038MB (SLO: <47.000MB 📉 -12.7%) vs baseline: +5.1% 🟡 packagesupdateimporteddependencies - 24/24 (1 unstable)✅ import_manyTime: ✅ 169.027µs (SLO: <170.000µs 🟡 -0.6%) vs baseline: +0.4% Memory: ✅ 41.459MB (SLO: <46.000MB -9.9%) vs baseline: +5.2% ✅ import_many_cachedTime: ✅ 131.168µs (SLO: <170.000µs 📉 -22.8%) vs baseline: -1.1% Memory: ✅ 41.347MB (SLO: <46.000MB 📉 -10.1%) vs baseline: +5.0% ✅ import_many_stdlibTime: ✅ 1.257ms (SLO: <1.750ms 📉 -28.2%) vs baseline: +0.6% Memory: ✅ 41.222MB (SLO: <46.000MB 📉 -10.4%) vs baseline: +4.9%
|
de57fad to
f873a4f
Compare
Codeowners resolved as |
3eddaeb to
f33d7fa
Compare
6c14d36 to
ca81f78
Compare
f33d7fa to
9a25ef3
Compare
|
e0f7a28 to
fed5e48
Compare
Introduce ``ddtrace/contrib/internal/aws_durable_execution_sdk_python/trace_checkpoint.py``
which appends a synthetic ``_datadog_{N}`` STEP operation to the durable
execution log on every ``SuspendExecution``. The payload is a JSON dict of
the propagation headers for the active trace, with the per-span volatile
fields (``x-datadog-parent-id``, ``traceparent``'s parent segment) rewritten
to point at the durable-execution root span — either the grandparent of the
current ``aws.durable.execute`` span (first invocation) or the parent id
already stored in the latest prior checkpoint (replays).
Diffing the new headers against the stored payload of the highest-N existing
``_datadog_*`` operation suppresses redundant writes; only ``x-datadog-parent-id``
and the ``dd=p:`` entry of ``tracestate`` are stripped before comparison so
sampling priority, decision-maker, origin, and propagation tags still trigger
a fresh save when they change.
Wired into ``_traced_durable_execution`` so a single checkpoint is written per
invocation, only on the suspend path (workflows that return or fail
terminally would never read the checkpoint).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
24afe33 to
d283ef5
Compare
| try: | ||
| setattr(state, _STATE_NEXT_N_ATTR, n + 1) | ||
| except Exception: | ||
| log.debug("Could not advance checkpoint counter", exc_info=True) |
There was a problem hiding this comment.
I'm seeing a bunch of debug logs on fatal exceptions, should they be debug considering that level will not show by default? I understand the fear of spamming warn lines, but I thing I would find it even more confusing for a feature to completely fail and no error/warning log showing at all.
This comment is in this specific line, but it applies to a bunch of the debug logs.
There was a problem hiding this comment.
Allow me to push back a bit on this one. There are a few main reasons I simply cannot ignore.
- After all, we're not customer code. I don't want the WARN/ERROR from us lands in their application logs and pages their oncall for things they can't act on. The unwritten contract for a library running in someone else's runtime is to stay quiet unless something the customer can act on is broken.
- Graceful degradation, not feature loss. If these paths fail the worst outcome is a Datadog observability feature degrades — a checkpoint doesn't get pre-marked, a tag doesn't get set. Customer workflow correctness is untouched. WARN/ERROR for "an observability library lost some observability" is louder than the impact warrants.
- The "no signal at all" concern has a better mitigation than log level:
telemetry. If we want Datadog to see when an integration degrades, that's what integration-error telemetry is for — visible to us, not piped into customer logs.- But, we haven't turned on instrumentation telemetry for serverless yet
- But but, we are working on it and @emmettbutler actually have a few PRs for that. And we also need that to see how customers use the integration. So I added them here.
- But, we haven't turned on instrumentation telemetry for serverless yet
There was a problem hiding this comment.
To add to this, can we make the debug logs a bit more actionable. Just from the message it's difficult to debug and dive deeper into root causes. We should include some info about the state and in this case why we could not advance (this could be captured in exc_info but that might not be the case everywhere.
There was a problem hiding this comment.
Updated with some details to help debugging.
In general, we are trying to use
- the metric tags of integration telemertry carry the dimensions we want to aggregate on
- the logs carry the runtime values needed to reproduce locally
mabdinur
left a comment
There was a problem hiding this comment.
left some nits, only blocking comment is about using native threads. From a tracing stand point the direction looks good. Thanks for driving this
| *which* code path failed without us having to fish through customer | ||
| logs. Keep the set of values bounded — it becomes a metric tag. | ||
| """ | ||
| try: |
There was a problem hiding this comment.
Is try/except needed? Do we expect this to fail?
There was a problem hiding this comment.
_record_integration_error is called from inside except blocks whose entire job is to keep telemetry failures from breaking the workflow — including as the last statement of the outermost catch-all in maybe_save_trace_context_checkpoint. If telemetry_writer.add_count_metric ever raises (shutdown, post-fork, etc.), that exception propagates out and breaks the user's durable execution. The try/except is what makes the function's "must never raise" contract real.
| pass | ||
|
|
||
|
|
||
| _CHECKPOINT_NAME_PREFIX = "_datadog_" |
There was a problem hiding this comment.
Do we have to deal with size limits for checkpoint names or headers? Woud using _dd be more efficent?
There was a problem hiding this comment.
I think we can potentially benefit from this bit explicitness here. One reason is that these operation names will be directly viewed by the customers. Not just visible. They will for sure look at those checkpoints. I can even picture customers go look for trace-ids from these checkpoints and directly jump to our traces. Another reason is that _dd actually has a higher chance of collision. I actually once looked into a support case where the customer services also had a ton of stuff that starts with dd* something.
As for the size limits, I put more details in my design doc. But in short, I think our stuff's size is negligible.
|
|
||
| import hashlib | ||
| import json | ||
| import threading |
There was a problem hiding this comment.
We should should avoid using the theading library directly, this can cause stability issues in forked enviornments. We should use this module instead: https://github.com/DataDog/dd-trace-py/blob/main/ddtrace/internal/threads.py
There was a problem hiding this comment.
Fixed — swapped to ddtrace.internal.threads.Lock.
Quick follow-up: is this something most Python devs would catch on instinct, or is it pretty specific to ddtrace's constraints (gevent monkey-patching, forked workers)? If it's the latter, would it be worth a lint rule or CI grep that flags import threading / threading.Lock in new code under ddtrace/? Let me know and I can open a PR for that.
| try: | ||
| setattr(state, _STATE_NEXT_N_ATTR, n + 1) | ||
| except Exception: | ||
| log.debug("Could not advance checkpoint counter", exc_info=True) |
There was a problem hiding this comment.
To add to this, can we make the debug logs a bit more actionable. Just from the message it's difficult to debug and dive deeper into root causes. We should include some info about the state and in this case why we could not advance (this could be captured in exc_info but that might not be the case everywhere.
…ableContext and has .state
…tability issues in forked enviornments
Description
This adds Datadog trace-context checkpointing for AWS Durable Execution workflows so traces can continue across Lambda invocations when a workflow suspends and later resumes.
When a durable handler raises
SuspendExecution, the integration now appends an async synthetic_datadog_{N}STEP containing the current propagation headers. On replay,datadog-lambda-pythoncan read the latest checkpoint and reactivate the trace context before the workflow continues. And that part is in the datadog-lambda-python PR#818The checkpoint writer also:
Testing
tests/contrib/aws_durable_execution_sdk_python/test_trace_checkpoint.pyfor header stabilization, parent-id anchoring/reuse,traceparentrewriting, replay diff suppression, checkpoint numbering, concurrent allocation, and no-op failure paths.Risks
Low to medium. This only writes Datadog checkpoint metadata on the
SuspendExecutionpath, and failures in the checkpoint writer are swallowed so workflow execution should not be affected.The main behavioral risks are:
_datadog_*prefixstate.operations,create_checkpoint, or operation parent IDsNote
To avoid creating too many checkpoints, we are excluding the
dd=p:part when comparing the tracecontext.Diffing the new headers against the stored payload of the highest-N existing
_datadog_*operation suppresses redundant writes; onlyx-datadog-parent-idand thedd=p:entry oftracestateare stripped before comparison so sampling priority, decision-maker, origin, and propagation tags still trigger a fresh save when they change.