Skip to content

UmdTask430_Data605_Spring2026_txtai_for_market_research_1#452

Open
Gauravp2104 wants to merge 7 commits into
gpsaggese:masterfrom
Gauravp2104:UmdTask430_DATA605_Spring2026_txtai_for_market_research_1
Open

UmdTask430_Data605_Spring2026_txtai_for_market_research_1#452
Gauravp2104 wants to merge 7 commits into
gpsaggese:masterfrom
Gauravp2104:UmdTask430_DATA605_Spring2026_txtai_for_market_research_1

Conversation

@Gauravp2104
Copy link
Copy Markdown

Related to #430

Progress update 1

  • Project template files (Dockerfile, requirements.txt, shell scripts)
  • README with architecture overview and setup instructions

Next steps

  • Implement txtai embeddings pipeline
  • Add data ingestion tools (NewsAPI, SEC EDGAR, web scraper)
  • Build out individual agents (sentiment, diligence, web research, earnings, regulatory)

Reviewers: @gpsaggese @protocorn
Assignee: @Gauravp2104 @SanjanaK1801

Gaurav Prakash and others added 3 commits April 1, 2026 10:22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Architecture:
- Hot tier: KeyDB for caching and live data
- Warm tier: PostgreSQL + pgvector for filings and embeddings
- Cold tier: MinIO for raw document archive

Components:
- Storage clients (KeyDB, PostgreSQL, MinIO) with connection pooling
- FilingsManager for high-level warm tier operations
- SEC EDGAR collector with full pipeline support
- Data collectors for news, web, and social sources
- Ingestion pipeline orchestrator
- Docker Compose for local infrastructure

Scripts:
- run_sec_collector.py: CLI for SEC filings collection

Infrastructure:
- docker-compose.yml: KeyDB, pgvector, MinIO services
- sql/init.sql: Database schema with vector indexes
- .gitignore: Comprehensive exclusions for Python, IDE, secrets
@protocorn
Copy link
Copy Markdown
Collaborator

This PR currently includes a very large number of unrelated file changes

Your PR is expected to include only your own project folder under:
class_project/data605/Spring2026/projects/

Please remove unrelated repository-wide changes and ensure that your PR only contains the files required for your project submission.

Gaurav Prakash and others added 4 commits May 7, 2026 01:47
Adds an end-to-end agentic search system on top of the existing collector
storage. New components:

- app/agents/research_agent.py: streaming agent with router, SEC/news
  sub-agents, and an extractive synthesizer (LLM-pluggable).
- app/api/server.py: FastAPI service with /research (sync) and
  /research/stream (SSE) endpoints.
- app/ui/research.py: Streamlit UI showing live agent trace then
  collapsing to a clean answer + sources view.
- scripts/eval_research.py: benchmark harness reporting p50/p95/p99
  latency, routing accuracy, and retrieval health.
- scripts/run_all_collectors.py, run_sec_bulk.py, backfill_txtai_from_chunks.py,
  check_storage_status.py: bulk collection + index utilities.

Removes the social and web collectors; the project now sources data only
from SEC EDGAR and news APIs (NewsAPI + Alpha Vantage). Updates the
embeddings search() helper to expose per-chunk metadata (ticker,
filing_type, filing_date) by joining txtai's data column.

RUN_INSTRUCTIONS.md gains a quickstart section for new users covering
docker-compose, collectors, API/UI bring-up, and the eval harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updated README to reflect project focus and removed outdated architecture details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants