Skip to content

Commit 0d0cfb6

Browse files
committed
feat: Update roadmap and fix bug in regression post.
1 parent c344a06 commit 0d0cfb6

File tree

2 files changed

+34
-128
lines changed

2 files changed

+34
-128
lines changed

_posts/2025-11-09-linear-regression.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ df <- D.readCsv "../data/housing.csv"
6060
Real-world data is messy. Sometimes values are missing, and we need to deal with that:
6161

6262
```haskell
63-
let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean
63+
let meanTotalBedrooms = df |> D.filterJust "total_bedrooms" |> D.mean (F.col @Double "total_bedrooms")
6464
```
6565

6666
**Translation:** "Hey DataFrame, take our data, filter out the rows where `total_bedrooms` is missing, then calculate the mean of what's left."
@@ -279,3 +279,6 @@ Now that you've mastered the basics:
279279
- Experiment with more complex feature engineering
280280
- Learn about train/test splits and model validation
281281
- Explore Hasktorch's neural network modules
282+
283+
## Get involved
284+
Wanna help contribute to data science in Haskell?

_posts/2025-11-09-roadmap.md

Lines changed: 30 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -22,31 +22,33 @@ This roadmap outlines the strategic direction for building a complete, productio
2222
By 2027, DataHaskell will provide a high-performance, end-to-end data science toolkit that enables practitioners to build reliable machine learning systems from data ingestion through model deployment.
2323

2424
### Core Principles
25-
1. **Type Safety First**: Leverage Haskell's type system to catch errors at compile time
26-
2. **Interoperability**: Seamless integration between ecosystem components
27-
3. **Performance**: Match or exceed Python/R performance benchmarks
28-
4. **Ergonomics**: Intuitive APIs that lower the barrier to entry
29-
5. **Production Ready**: Focus on reliability, monitoring, and deployment
25+
1. **Interoperability**: Seamless integration between ecosystem components
26+
2. **Performance**: Match or exceed Python/R performance benchmarks
27+
3. **Ergonomics**: Intuitive APIs that lower the barrier to entry
28+
4. **Production Ready**: Focus on reliability, monitoring, and deployment
29+
5. **Type Safety**: Leverage Haskell's type system (where possible) to catch errors at compile time
3030

3131
---
3232

3333
## Current State Assessment
3434

35-
### 🟢 Strengths
36-
- **dataframe** (v0.1 launch March 5): Modern, type-safe dataframe library with IHaskell integration
35+
### Strengths
36+
- **dataframe**: Modern dataframe library with IHaskell integration
3737
- **Hasktorch**: Mature deep learning library with PyTorch backend and GPU support
3838
- **distributed-process**: Battle-tested distributed computing framework
39+
- **IHaskell**: A Haskell kernel for Jupyter notebooks.
3940
- Strong functional programming foundations
4041
- Excellent parallelism and concurrency primitives
4142

42-
### 🟡 Gaps to Address
43+
### Gaps to Address
44+
- No community of maintainers and contributors
4345
- Fragmented visualization ecosystem
4446
- Limited data I/O format support
4547
- Incomplete documentation and tutorials
4648
- Sparse integration examples between major libraries
4749
- Limited model deployment tooling
4850

49-
### 🔴 Critical Needs
51+
### Critical Needs
5052
- Unified onboarding experience
5153
- Comprehensive benchmarking against Python/R
5254
- Production deployment patterns
@@ -62,7 +64,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
6264
**Owner**: dataframe team
6365

6466
**Goals**:
65-
- Complete dataframe v0.1 release (March 2026)
67+
- Complete dataframe v1 release (March 2026)
6668
- Establish dataframe as the standard tabular data library
6769
- Performance parity with Pandas/Polars for common operations
6870

@@ -95,8 +97,8 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
9597

9698
**Goals**:
9799
- Advanced data manipulation features
98-
- Integration with database systems
99-
- Time series support
100+
- Computing on files larger than memory
101+
- Integration with Cloud database systems
100102

101103
**Deliverables**:
102104
1. **Advanced Operations**
@@ -106,18 +108,13 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
106108
- Complex joins (anti, semi)
107109
- Reshaping operations (melt, cast)
108110

109-
2. **Database Connectivity**
111+
2. **Cloud database Connectivity**
112+
- Read files from AWS/GCP/Azure
110113
- PostgreSQL integration
111114
- SQLite support
112115
- Query pushdown optimization
113116
- Streaming query results
114117

115-
3. **Time Series Extensions**
116-
- Date/time indexing
117-
- Resampling operations
118-
- Time-based rolling windows
119-
- Timezone handling
120-
121118
---
122119

123120
## Pillar 2: Statistical Computing & Visualization
@@ -126,14 +123,13 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
126123
**Owner**: Community (needs maintainer)
127124

128125
**Goals**:
129-
- Establish comprehensive statistics library
126+
- Create a unified machine learning library on top of Hasktorch and Statistics
130127
- Create unified plotting API
131128

132129
**Deliverables**:
133-
1. **statistics-next** (modernize existing library)
134-
- Descriptive statistics
135-
- Hypothesis testing (t-test, ANOVA, chi-square)
136-
- Linear regression
130+
1. **statistics**
131+
- Extend hypothesis testing (t-test, ANOVA)
132+
- Simple regression models (linear and logistic)
137133
- Generalized linear models (GLM)
138134
- Survival analysis basics
139135
- Integration with dataframe
@@ -173,7 +169,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
173169
**Owners**: Hasktorch + dataframe teams
174170

175171
**Goals**:
176-
- Seamless dataframe → tensor pipeline
172+
- Improve dataframe → tensor pipeline
177173
- Example-driven documentation
178174

179175
**Deliverables**:
@@ -183,7 +179,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
183179
- GPU memory management
184180
- Batch loading utilities
185181

186-
2. **ML Workflow Examples**
182+
2. **ML Workflow Examples with new unified library**
187183
- End-to-end classification (Iris, MNIST)
188184
- Regression examples (California Housing)
189185
- Time series forecasting
@@ -297,7 +293,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
297293

298294
**Deliverables**:
299295
1. **DataHaskell Website Revamp**
300-
- Modern design
301296
- Clear getting started guide
302297
- Library comparison matrix
303298
- Migration guides (from Python, R)
@@ -330,7 +325,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
330325
- Example project templates
331326

332327
2. **IDE Support Improvements**
333-
- VSCode extension enhancements
328+
- VSCode IHaskell support with dataHaskell stack supported out the box
334329
- HLS integration guides
335330
- Debugging workflows
336331
- IHaskell kernel improvements
@@ -380,41 +375,22 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
380375

381376
---
382377

383-
## Integration Priority Matrix
384-
385-
### Critical Integrations (Start Immediately)
386-
1. **dataframe ↔ Hasktorch**: Data → Training pipeline
387-
2. **dataframe ↔ IHaskell**: Interactive analysis
388-
3. **dataframe ↔ statistics**: Analysis workflow
389-
390-
### High Priority (Q2-Q3 2026)
391-
4. **dataframe ↔ distributed-process**: Distributed operations
392-
5. **Hasktorch ↔ distributed-process**: Distributed training
393-
6. **statistics ↔ visualization**: Plot statistical results
394-
395-
### Medium Priority (Q4 2026)
396-
7. **All ↔ model deployment**: Production pipeline
397-
8. **All ↔ monitoring**: Observability
398-
399-
---
400-
401378
## Success Metrics
402379

403380
### Q2 2026
404-
- [ ] dataframe v0.1 released with 500+ downloads/month
381+
- [ ] dataframe v1 released
405382
- [ ] 3 complete end-to-end tutorials published
406383
- [ ] Performance benchmarks showing ≥70% of Pandas speed
407384
- [ ] 5 integration examples between major libraries
408385

409386
### Q4 2026
410387
- [ ] 10,000+ total library downloads/month across ecosystem
411-
- [ ] 20+ companies using DataHaskell in production
412-
- [ ] 50+ active contributors
388+
- [ ] 5+ active contributors
413389
- [ ] Performance parity (≥90%) with Pandas for common operations
414390
- [ ] Complete ML workflow from data to deployment documented
415391

416392
### Q2 2027
417-
- [ ] 100+ companies using DataHaskell
393+
- [ ] 2+ companies using DataHaskell
418394
- [ ] DataHaskell track at major Haskell conference
419395
- [ ] 3+ published case studies
420396
- [ ] Comprehensive distributed computing examples
@@ -431,7 +407,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
431407
### Maintainer Coordination
432408
- **Monthly sync**: All pillar leads (1 hour)
433409
- **Quarterly planning**: Full maintainer group (2 hours)
434-
- **Annual retreat**: Strategic direction (virtual or in-person)
435410

436411
### Funding Needs (Optional but Helpful)
437412
1. **Infrastructure**
@@ -441,7 +416,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
441416

442417
2. **Developer Support**
443418
- Part-time technical writer
444-
- Maintainer stipends (Haskell Foundation)
419+
- Maintainer stipends or grants
445420
- Summer of Haskell projects
446421

447422
3. **Events**
@@ -490,11 +465,10 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
490465
**Criteria**:
491466
1. Unmaintained for >6 months
492467
2. Better alternative exists
493-
3. Low usage (<100 downloads/month)
494-
4. Creates confusion in ecosystem
468+
3. Creates confusion in ecosystem
495469

496470
### Version Compatibility Policy
497-
- Support last 2 GHC versions
471+
- Support last 2 major GHC versions
498472
- Semantic versioning (PVP)
499473
- Deprecation warnings for 2 releases before removal
500474
- Compatibility matrix published on website
@@ -504,7 +478,7 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
504478
## Communication Plan
505479

506480
### Internal (Maintainers)
507-
- **Slack/Discord channel**: Daily async communication
481+
- **Discord channel**: Daily async communication
508482
- **GitHub Discussions**: Technical decisions, RFCs
509483
- **Monthly video call**: Roadmap progress, blockers
510484
- **Quarterly planning session**: Next phase priorities
@@ -518,75 +492,6 @@ By 2027, DataHaskell will provide a high-performance, end-to-end data science to
518492

519493
---
520494

521-
## Near-Term Action Items (Next 30 Days)
522-
523-
### For dataframe maintainer (mchav)
524-
1. [ ] Finalize v0.1 release checklist
525-
2. [ ] Write Parquet support specification
526-
3. [ ] Create 3 dataframe ↔ Hasktorch examples
527-
4. [ ] Set up benchmark infrastructure
528-
529-
### For Hasktorch team
530-
1. [ ] Test dataframe integration patterns
531-
2. [ ] Document tensor conversion APIs
532-
3. [ ] Create example pipeline notebook
533-
4. [ ] Identify distributed training requirements
534-
535-
### For distributed-process team
536-
1. [ ] Prototype distributed dataframe operations
537-
2. [ ] Document deployment patterns
538-
3. [ ] Create cluster setup guide
539-
4. [ ] Design fault-tolerance strategy
540-
541-
### For community coordinator
542-
1. [ ] Set up monthly call schedule
543-
2. [ ] Create Discord/Slack workspace
544-
3. [ ] Draft website redesign plan
545-
4. [ ] Reach out to potential contributors
546-
547-
### For all
548-
1. [ ] Review and comment on this roadmap
549-
2. [ ] Identify personal capacity for next 6 months
550-
3. [ ] Claim ownership of specific deliverables
551-
4. [ ] Share roadmap with broader community
552-
553-
---
554-
555-
## Appendix A: Related Projects to Consider
556-
557-
### Existing Haskell Projects
558-
- **Frames**: Alternative dataframe (potential collaboration/consolidation?)
559-
- **hmatrix**: Linear algebra (ensure compatibility)
560-
- **statistics**: Statistical computing (modernization candidate)
561-
- **Chart/hvega**: Visualization (integration targets)
562-
- **postgresql-simple**: Database connectivity
563-
- **accelerate**: Array processing with GPU support
564-
565-
### External Integration Targets
566-
- **Apache Arrow**: Zero-copy data interchange
567-
- **DuckDB**: Embedded analytical database
568-
- **ONNX**: Model interchange format
569-
- **MLflow**: ML lifecycle management
570-
571-
---
572-
573-
## Appendix B: Glossary
574-
575-
**Critical Path**: dataframe → statistics → ML toolkit → distributed operations
576-
**Integration Points**: Where libraries share data structures or APIs
577-
**Zero-Copy**: Data sharing without duplication in memory
578-
**Type-Safe**: Compile-time guarantees about data structure and operations
579-
580-
---
581-
582-
## Appendix C: Version History
583-
584-
| Version | Date | Changes | Author |
585-
|---------|------|---------|--------|
586-
| 1.0 | Nov 2026 | Initial comprehensive roadmap | DataHaskell coordinators |
587-
588-
---
589-
590495
## How to Use This Roadmap
591496

592497
This is a **living document**. We will:
@@ -595,8 +500,6 @@ This is a **living document**. We will:
595500
- Celebrate milestones publicly
596501
- Adapt based on community feedback
597502

598-
**Contributing**: See [CONTRIBUTING.md] for how to propose changes to this roadmap.
599-
600503
**Questions?** Open a discussion on GitHub or join our community calls.
601504

602505
---

0 commit comments

Comments
 (0)