feat: Update roadmap and fix bug in regression post.

mchav · mchav · commit 0d0cfb60fc0b · 2025-11-11T12:24:29.000-08:00
diff --git a/_posts/2025-11-09-linear-regression.md b/_posts/2025-11-09-linear-regression.md
@@ -60,7 +60,7 @@ df <- D.readCsv "../data/housing.csv"
 Real-world data is messy. Sometimes values are missing, and we need to deal with that:
 
 ```haskell
-let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean
+let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean (F.col @Double "total_bedrooms")
 ```
 
 **Translation:** "Hey DataFrame, take our data, filter out the rows where `total_bedrooms` is missing, then calculate the mean of what's left."
@@ -279,3 +279,6 @@ Now that you've mastered the basics:
 - Experiment with more complex feature engineering
 - Learn about train/test splits and model validation
 - Explore Hasktorch's neural network modules
+
+## Get involved
+Wanna help contribute to data science in Haskell?
diff --git a/_posts/2025-11-09-roadmap.md b/_posts/2025-11-09-roadmap.md
@@ -22,31 +22,33 @@ This roadmap outlines the strategic direction for building a complete, productio
 By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.
 
 ### Core Principles
-1. **Type Safety First**: Leverage Haskell's type system to catch errors at compile time
-2. **Interoperability**: Seamless integration between ecosystem components
-3. **Performance**: Match or exceed Python/R performance benchmarks
-4. **Ergonomics**: Intuitive APIs that lower the barrier to entry
-5. **Production Ready**: Focus on reliability, monitoring, and deployment
+1. **Interoperability**: Seamless integration between ecosystem components
+2. **Performance**: Match or exceed Python/R performance benchmarks
+3. **Ergonomics**: Intuitive APIs that lower the barrier to entry
+4. **Production Ready**: Focus on reliability, monitoring, and deployment
+5. **Type Safety**: Leverage Haskell's type system (where possible) to catch errors at compile time
 
 ---
 
 ## Current State Assessment
 
-### 🟢 Strengths
-- **dataframe** (v0.1 launch March 5): Modern, type-safe dataframe library with IHaskell integration
+### Strengths
+- **dataframe**: Modern dataframe library with IHaskell integration
 - **Hasktorch**: Mature deep learning library with PyTorch backend and GPU support
 - **distributed-process**: Battle-tested distributed computing framework
+- **IHaskell**: A Haskell kernel for Jupyter notebooks.
 - Strong functional programming foundations
 - Excellent parallelism and concurrency primitives
 
-### 🟡 Gaps to Address
+### Gaps to Address
+- No community of maintainers and contributors
 - Fragmented visualization ecosystem
 - Limited data I/O format support
 - Incomplete documentation and tutorials
 - Sparse integration examples between major libraries
 - Limited model deployment tooling
 
-### 🔴 Critical Needs
+### Critical Needs
 - Unified onboarding experience
 - Comprehensive benchmarking against Python/R
 - Production deployment patterns
@@ -62,7 +64,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 **Owner**: dataframe team
 
 **Goals**:
-- ✅ Complete dataframe v0.1 release (March 2026)
+- Complete dataframe v1 release (March 2026)
 - Establish dataframe as the standard tabular data library
 - Performance parity with Pandas/Polars for common operations
 
@@ -95,8 +97,8 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 
 **Goals**:
 - Advanced data manipulation features
-- Integration with database systems
-- Time series support
+- Computing on files larger than memory
+- Integration with Cloud database systems
 
 **Deliverables**:
 1. **Advanced Operations**
@@ -106,18 +108,13 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
    - Complex joins (anti, semi)
    - Reshaping operations (melt, cast)
 
-2. **Database Connectivity**
+2. **Cloud database Connectivity**
+   - Read files from AWS/GCP/Azure
    - PostgreSQL integration
    - SQLite support
    - Query pushdown optimization
    - Streaming query results
 
-3. **Time Series Extensions**
-   - Date/time indexing
-   - Resampling operations
-   - Time-based rolling windows
-   - Timezone handling
-
 ---
 
 ## Pillar 2: Statistical Computing & Visualization
@@ -126,14 +123,13 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 **Owner**: Community (needs maintainer)
 
 **Goals**:
-- Establish comprehensive statistics library
+- Create a unified machine learning library on top of Hasktorch and Statistics
 - Create unified plotting API
 
 **Deliverables**:
-1. **statistics-next** (modernize existing library)
-   - Descriptive statistics
-   - Hypothesis testing (t-test, ANOVA, chi-square)
-   - Linear regression
+1. **statistics**
+   - Extend hypothesis testing (t-test, ANOVA)
+   - Simple regression models (linear and logistic)
    - Generalized linear models (GLM)
    - Survival analysis basics
    - Integration with dataframe
@@ -173,7 +169,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 **Owners**: Hasktorch + dataframe teams
 
 **Goals**:
-- Seamless dataframe → tensor pipeline
+- Improve dataframe → tensor pipeline
 - Example-driven documentation
 
 **Deliverables**:
@@ -183,7 +179,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
    - GPU memory management
    - Batch loading utilities
 
-2. **ML Workflow Examples**
+2. **ML Workflow Examples with new unified library**
    - End-to-end classification (Iris, MNIST)
    - Regression examples (California Housing)
    - Time series forecasting
@@ -297,7 +293,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 
 **Deliverables**:
 1. **DataHaskell Website Revamp**
-   - Modern design
    - Clear getting started guide
    - Library comparison matrix
    - Migration guides (from Python, R)
@@ -330,7 +325,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
    - Example project templates
 
 2. **IDE Support Improvements**
-   - VSCode extension enhancements
+   - VSCode IHaskell support with dataHaskell stack supported out the box
    - HLS integration guides
    - Debugging workflows
    - IHaskell kernel improvements
@@ -380,41 +375,22 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 
 ---
 
-## Integration Priority Matrix
-
-### Critical Integrations (Start Immediately)
-1. **dataframe ↔ Hasktorch**: Data → Training pipeline
-2. **dataframe ↔ IHaskell**: Interactive analysis
-3. **dataframe ↔ statistics**: Analysis workflow
-
-### High Priority (Q2-Q3 2026)
-4. **dataframe ↔ distributed-process**: Distributed operations
-5. **Hasktorch ↔ distributed-process**: Distributed training
-6. **statistics ↔ visualization**: Plot statistical results
-
-### Medium Priority (Q4 2026)
-7. **All ↔ model deployment**: Production pipeline
-8. **All ↔ monitoring**: Observability
-
----
-
 ## Success Metrics
 
 ### Q2 2026
-- [ ] dataframe v0.1 released with 500+ downloads/month
+- [ ] dataframe v1 released
 - [ ] 3 complete end-to-end tutorials published
 - [ ] Performance benchmarks showing ≥70% of Pandas speed
 - [ ] 5 integration examples between major libraries
 
 ### Q4 2026
 - [ ] 10,000+ total library downloads/month across ecosystem
-- [ ] 20+ companies using DataHaskell in production
-- [ ] 50+ active contributors
+- [ ] 5+ active contributors
 - [ ] Performance parity (≥90%) with Pandas for common operations
 - [ ] Complete ML workflow from data to deployment documented
 
 ### Q2 2027
-- [ ] 100+ companies using DataHaskell
+- [ ] 2+ companies using DataHaskell
 - [ ] DataHaskell track at major Haskell conference
 - [ ] 3+ published case studies
 - [ ] Comprehensive distributed computing examples
@@ -431,7 +407,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 ### Maintainer Coordination
 - **Monthly sync**: All pillar leads (1 hour)
 - **Quarterly planning**: Full maintainer group (2 hours)
-- **Annual retreat**: Strategic direction (virtual or in-person)
 
 ### Funding Needs (Optional but Helpful)
 1. **Infrastructure**
@@ -441,7 +416,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 
 2. **Developer Support**
    - Part-time technical writer
-   - Maintainer stipends (Haskell Foundation)
+   - Maintainer stipends or grants
    - Summer of Haskell projects
 
 3. **Events**
@@ -490,11 +465,10 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 **Criteria**:
 1. Unmaintained for >6 months
 2. Better alternative exists
-3. Low usage (<100 downloads/month)
-4. Creates confusion in ecosystem
+3. Creates confusion in ecosystem
 
 ### Version Compatibility Policy
-- Support last 2 GHC versions
+- Support last 2 major GHC versions
 - Semantic versioning (PVP)
 - Deprecation warnings for 2 releases before removal
 - Compatibility matrix published on website
@@ -504,7 +478,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 ## Communication Plan
 
 ### Internal (Maintainers)
-- **Slack/Discord channel**: Daily async communication
+- **Discord channel**: Daily async communication
 - **GitHub Discussions**: Technical decisions, RFCs
 - **Monthly video call**: Roadmap progress, blockers
 - **Quarterly planning session**: Next phase priorities
@@ -518,75 +492,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
 
 ---
 
-## Near-Term Action Items (Next 30 Days)
-
-### For dataframe maintainer (mchav)
-1. [ ] Finalize v0.1 release checklist
-2. [ ] Write Parquet support specification
-3. [ ] Create 3 dataframe ↔ Hasktorch examples
-4. [ ] Set up benchmark infrastructure
-
-### For Hasktorch team
-1. [ ] Test dataframe integration patterns
-2. [ ] Document tensor conversion APIs
-3. [ ] Create example pipeline notebook
-4. [ ] Identify distributed training requirements
-
-### For distributed-process team
-1. [ ] Prototype distributed dataframe operations
-2. [ ] Document deployment patterns
-3. [ ] Create cluster setup guide
-4. [ ] Design fault-tolerance strategy
-
-### For community coordinator
-1. [ ] Set up monthly call schedule
-2. [ ] Create Discord/Slack workspace
-3. [ ] Draft website redesign plan
-4. [ ] Reach out to potential contributors
-
-### For all
-1. [ ] Review and comment on this roadmap
-2. [ ] Identify personal capacity for next 6 months
-3. [ ] Claim ownership of specific deliverables
-4. [ ] Share roadmap with broader community
-
----
-
-## Appendix A: Related Projects to Consider
-
-### Existing Haskell Projects
-- **Frames**: Alternative dataframe (potential collaboration/consolidation?)
-- **hmatrix**: Linear algebra (ensure compatibility)
-- **statistics**: Statistical computing (modernization candidate)
-- **Chart/hvega**: Visualization (integration targets)
-- **postgresql-simple**: Database connectivity
-- **accelerate**: Array processing with GPU support
-
-### External Integration Targets
-- **Apache Arrow**: Zero-copy data interchange
-- **DuckDB**: Embedded analytical database
-- **ONNX**: Model interchange format
-- **MLflow**: ML lifecycle management
-
----
-
-## Appendix B: Glossary
-
-**Critical Path**: dataframe → statistics → ML toolkit → distributed operations
-**Integration Points**: Where libraries share data structures or APIs
-**Zero-Copy**: Data sharing without duplication in memory
-**Type-Safe**: Compile-time guarantees about data structure and operations
-
----
-
-## Appendix C: Version History
-
-| Version | Date | Changes | Author |
-|---------|------|---------|--------|
-| 1.0 | Nov 2026 | Initial comprehensive roadmap | DataHaskell coordinators |
-
----
-
 ## How to Use This Roadmap
 
 This is a **living document**. We will:
@@ -595,8 +500,6 @@ This is a **living document**. We will:
 - Celebrate milestones publicly
 - Adapt based on community feedback
 
-**Contributing**: See [CONTRIBUTING.md] for how to propose changes to this roadmap.
-
 **Questions?** Open a discussion on GitHub or join our community calls.
 
 ---