Date: 2026-01-28 Project: /Users/liu/workspace/taesd
- Read
README.mdto understand the repo. - Write reproducible scripts using coremltools to convert FLUX.1 VAE and FLUX.2 VAE to Core ML (
.mlmodelc), fixed input size 768x768. - Validate Core ML results match PyTorch.
- Start with NCHW, then validate NHWC (
[1, 768, 768, 16]/[1, 768, 768, 32]); later corrected to 96x96 latent (since VAE downsamples by 8). - Focus only on decoder (remove encoder support).
- Run the scripts and fix errors.
- Provide validation code for review.
- Write Swift code to load NHWC
.mlmodelcand validate it matches Python output. - Run Swift validation for FLUX.1/2 (NHWC).
- Switch conversion to Float16, re-run Python CoreML validation and Swift validation.
- Verify ANE-only execution (not possible to prove in this environment).
- Write a replay log; then asked to write this WORKLOG.
- Added
scripts/convert_flux_vae_coreml.py- Decoder-only conversion for FLUX.1/2
- Supports NCHW/NHWC
- TorchScript tracing before conversion
- ML Program output (
.mlpackage) + compiled.mlmodelc - coremlc fallback compile when MLModelProxy fails
- Validation against PyTorch with error metrics
- Added precision switch (
float32/float16) and tolerance handling
- Added
scripts/export_flux_decoder_io.py- Exports deterministic NHWC input/output tensors to
.binfor Swift validation
- Exports deterministic NHWC input/output tensors to
- Added
scripts/validate_coreml.swift- Loads
.mlmodelc, runs inference, compares output to reference - Handles float16 outputs and non‑contiguous strides
- Loads
- Updated
scripts/convert_flux_vae_coreml.pymultiple times to:- Remove encoder support
- Fix repo import
- Force
source="pytorch"and TorchScript tracing - Save ML Program as
.mlpackage - Compile via
coremlcwhen needed - Avoid
/var/folderstemp permission issues - Add precision selection and default tolerances
- Latent input is 96x96 for 768x768 image (1/8 downsample), not 768x768.
- Decoder-only conversion required setting
encoder_path=None. - coremltools compile failed in sandbox; used
xcrun coremlc compile. - Swift validation initially incorrect due to float16 output and strides; fixed.
Install deps in venv:
pip install coremltools numpy torch
Convert + validate with coremltools (float32):
python scripts/convert_flux_vae_coreml.py --variant flux1 --layout nchw --precision float32 --validatepython scripts/convert_flux_vae_coreml.py --variant flux2 --layout nchw --precision float32 --validatepython scripts/convert_flux_vae_coreml.py --variant flux1 --layout nhwc --precision float32 --validatepython scripts/convert_flux_vae_coreml.py --variant flux2 --layout nhwc --precision float32 --validate
Export deterministic NHWC IO for Swift:
python scripts/export_flux_decoder_io.py --variant flux1python scripts/export_flux_decoder_io.py --variant flux2
Convert + validate with coremltools (float16, adjusted tolerances):
python scripts/convert_flux_vae_coreml.py --variant flux1 --layout nchw --precision float16 --validate --atol 0.02python scripts/convert_flux_vae_coreml.py --variant flux2 --layout nchw --precision float16 --validate --atol 0.03python scripts/convert_flux_vae_coreml.py --variant flux1 --layout nhwc --precision float16 --validate --atol 0.02python scripts/convert_flux_vae_coreml.py --variant flux2 --layout nhwc --precision float16 --validate --atol 0.03
Swift validation (NHWC, float16):
swift -Xfrontend -module-cache-path -Xfrontend /tmp/swift-module-cache scripts/validate_coreml.swift coreml_out/flux1_decoder_nhwc_float16.mlmodelc coreml_io/flux1/input.bin coreml_io/flux1/output.bin 16swift -Xfrontend -module-cache-path -Xfrontend /tmp/swift-module-cache scripts/validate_coreml.swift coreml_out/flux2_decoder_nhwc_float16.mlmodelc coreml_io/flux2/input.bin coreml_io/flux2/output.bin 32
Optional Python validation using CompiledMLModel with CPU+NE:
- See the Python snippet in the previous replay log message.
Float32 (Python coremltools validation):
- FLUX.1 decoder NCHW: max_abs ~2.03e-06
- FLUX.2 decoder NCHW: max_abs ~2.06e-04
- FLUX.1 decoder NHWC: max_abs ~2.44e-06
- FLUX.2 decoder NHWC: max_abs ~1.67e-04
Float16 (Python coremltools validation):
- FLUX.1 decoder NCHW: max_abs 0.01793 (atol 0.02)
- FLUX.2 decoder NCHW: max_abs 0.02299 (atol 0.03)
- FLUX.1 decoder NHWC: max_abs 0.01973 (atol 0.02)
- FLUX.2 decoder NHWC: max_abs 0.02479 (atol 0.03)
Swift (NHWC, float16):
- FLUX.1 decoder NHWC: max_abs 0.004456, mae 0.000260, rmse 0.000347
- FLUX.2 decoder NHWC: max_abs 0.005844, mae 0.000406, rmse 0.000538
- coremltools exposes
ComputeUnit.CPU_AND_NEbut noNEURAL_ENGINE_ONLY. - No runtime reporting to prove ANE-only execution in this environment.
- Result correctness validated with CPU+NE (not exclusive to ANE).
Date: 2026-01-28 (continued)
- Clarify NHWC vs NCHW for Core ML / ANE; user confirmed ANE prefers channels-first and asked to keep NHWC work as reference.
- Run NCHW benchmarks for float32/float16 and clean up NHWC artifacts.
- Explore ANE performance optimization: ML Program FP16, float16 I/O, neuralnetwork format, and compute units.
- Verify performance discrepancies between Python vs Swift; determine if sandbox overhead affected timings.
- Switch focus to FLUX.2; generate Core ML models and enable Swift benchmarking.
- Generate 1024x1024 models for FLUX.1/2 and compare performance; then add flexible shape models supporting 768 and 1024.
- Added NHWC reference implementation in
taesd_nhwc.py(NHWC wrappers for Conv/GroupNorm/Upsample) and--nhwc-modeloption inscripts/convert_flux_vae_coreml.py. - Rebuilt NHWC float32/float16 models, inspected MIL op counts (extra transposes only at input/output), and validated correctness.
- Cleaned NHWC and bench artifacts when requested.
- Added conversion options:
--convert-to {mlprogram, neuralnetwork}--io-precision {auto,float16,float32}- lowered min deployment target for neuralnetwork
- float16 I/O validation handling
- Generated ANE-optimized ML Program (float16 compute + float16 I/O) for FLUX.1 and FLUX.2; compiled
.mlmodelcusingcoremlc. - Added benchmarking scripts:
scripts/benchmark_coreml_flux1.py(CoreML compute units)scripts/benchmark_coreml_flux1_python.py(Python timing script)
- Added/updated Swift harness for Instruments/benchmarking:
scripts/run_coreml_instruments.swift- supports compute units, dtype, input-hw, latent-channels
- prints input shape/dtype
- reports avg/median/p10/p90/min/max over N iters
- fixed behavior to use
--input-hwfor enumerated shapes by default
- Demonstrated that sandboxed Python timing was ~10x slower than local runs; user confirmed ~20 ms in local Swift/Python.
- FLUX.1 fixed 768: ~20 ms (CPU+NE, float16) in local environment.
- FLUX.2 fixed 768: ~21 ms (CPU+NE, float16) in local environment.
- FLUX.1 fixed 1024: ~36.2 ms (CPU+NE, float16).
- FLUX.2 fixed 1024: ~38.3 ms (CPU+NE, float16).
- Implemented enumerated shape support via
--input-hw-list 768,1024in converter. - Generated ML Program float16 models with float16 I/O:
coreml_out/ane_opt_flex/flux1_decoder_nchw_float16_hw768_1024_iofloat16.mlmodelccoreml_out/ane_opt_flex/flux2_decoder_nchw_float16_hw768_1024_iofloat16.mlmodelc
- User benchmarks for flex models:
- FLUX.1 flex: 768 ~21.84 ms, 1024 ~38.38 ms
- FLUX.2 flex: 768 ~22.94 ms, 1024 ~40.57 ms (CPU+NE, float16, 200 iters)
- Removed:
coreml_out/ane_opt/bench,coreml_out/ane_opt_256,coreml_out/ane_opt_512,coreml_out/ane_opt_flux2/bench - Removed all NHWC artifacts when requested.
- Core ML/ANE prefers channels-first; NHWC graph not beneficial beyond input/output.
- coremltools MLModel predict in sandbox showed large overhead vs Swift/Python locally; confirmed local results are fast.
- Flexible (enumerated) shapes do not significantly degrade performance vs fixed shapes.
- Discussed ANE performance optimization ideas for FLUX.1/2; user asked to try 4-bit palettization, 8-bit palettization, int8 weights, and int8 activations (QAT), focusing on 1024x1024 first.
- Implemented additional converter options and scripts to support quantization/benching:
scripts/convert_flux_vae_coreml.py: added--input-hw-list,--palettize-nbits,--quantize-weight-int8,--convert-to,--io-precision; supports enumerated shapes and naming suffixes.scripts/benchmark_coreml_flux1.pyandscripts/benchmark_coreml_flux1_python.pyfor CoreML timing.scripts/qat_flux_decoder_int8.pyfor QAT int8 weights + activations using coremltools LinearQuantizer.scripts/download_soa_subset.pyto fetch ~20 images frommadebyollin/soa-aestheticfor calibration.scripts/roundtrip_compare.pyto save round-trip outputs for visual inspection.
- Generated CoreML artifacts for 1024:
- Palettized 4-bit and 8-bit (
coreml_out/ane_opt_pal4_1024,coreml_out/ane_opt_pal8_1024). - Int8 weight-only (
coreml_out/ane_opt_int8w_1024). - QAT int8 weights+activations (
coreml_out/ane_opt_qat_int8_full).
- Palettized 4-bit and 8-bit (
- User-reported Swift benchmarks (1024):
- Palettization 4/8-bit: ~40 ms (no speedup).
- Int8 weights: ~40 ms (no speedup).
- QAT int8 weights+activations: ~36.3 ms (faster than fp16 baseline ~38.4 ms).
- QAT quality check (CoreML CPU compare): max_abs ~0.34, MAE ~0.015, RMSE ~0.020 across small dataset; accuracy did not improve vs earlier runs.
- Saved round-trip images for inspection to
coreml_io/roundtrip_compare/with subdirs:original,fp16_baseline,pal4,pal8,int8w,qat_int8. - Installed Python deps in
_env:requests,datasets,pillow(and dependencies).