Skip to content
View costadev00's full-sized avatar
:shipit:
Competitive Programmer | Computer Science Student | Scientific Researcher
:shipit:
Competitive Programmer | Computer Science Student | Scientific Researcher

Highlights

  • Pro

Organizations

@PET-SI-UFU @C4AI

Block or report costadev00

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
costadev00/README.md

Matheus Costa Monteiro

AI Research Engineer | LLM Data Pipelines | Model Evaluation | M.Sc. Student at USP | C4AI

I am an AI Research Engineer focused on building reliable data and evaluation pipelines for Large Language Models.

Currently, I work on research and engineering efforts involving Brazilian Portuguese language models, dataset curation, tokenization workflows, model evaluation, post training datasets, and reproducible tooling for LLM development.

My background combines machine learning engineering, graph algorithms, competitive programming, backend systems, and applied research in heterogeneous GPU scheduling for AI workloads.


Current focus

Research and engineering

Main stack


Technical stack

AI and Machine Learning

LLM Engineering

Systems and Backend


Featured work

Python library and CLI for token counting, token distribution analysis, and dataset inspection in LLM data workflows.

Focus: LLM data pipelines, tokenization, Hugging Face datasets, and reproducible reports.

Comparative study of list scheduling heuristics for heterogeneous GPU environments inspired by AI training workloads.

Algorithms: DLS, HEFT, HEFT LA, PEFT, IHEFT, and IPEFT.

Repository with implementations, experiments, reports, and study material related to NLP, deep learning, and modern AI systems.

Focus: Transformers, multimodal models, NLP experiments, and reproducible learning projects.

Public datasets and resources related to Brazilian Portuguese LLM development, instruction data, evaluation, and data centric AI workflows.

Focus: dataset curation, pretraining data, instruction tuning data, and LLM evaluation.


Background

I have experience in software engineering, machine learning engineering, and academic research.

My work has involved:

Artificial Intelligence LLM data pipelines, dataset curation, model evaluation, post training datasets, tokenization workflows, and reproducible ML tooling.
Research Scheduling algorithms, graph theory, heterogeneous GPU environments, performance evaluation, and AI workload optimization.
Software Engineering Backend systems, Python tooling, APIs, automation, JavaScript applications, TypeScript applications, and production oriented development.
Competitive Programming Algorithms, data structures, optimization, problem solving, and ICPC style contests.

Connect with me


GitHub stats

GitHub profile summary

Repositories per language Most committed languages

GitHub stats Productive time

Pinned Loading

  1. C4AI/token-counter C4AI/token-counter Public

    Python library + CLI for token counting and token distribution analysis in text datasets for LLM data workflows.

    Python 8

  2. wikipedia-dump wikipedia-dump Public

    Extract wikipedia pt-br dumps updated till Feb 2026 https://huggingface.co/datasets/costadev00/wikipedia-pt-br-extract and https://huggingface.co/datasets/costadev00/wikibooks-pt-br-extract

    Python 5

  3. competitive_programming competitive_programming Public

    "Classics" algorithms that I use in competitions

    C++ 6 1

  4. scheduling-algorithms-for-llm-training-in-heterogeneous-gpu scheduling-algorithms-for-llm-training-in-heterogeneous-gpu Public

    Matheus Monteiro work as a researcher in algorithms optimization and artificial intelligence. Paper: https://repositorio.ufu.br/handle/123456789/47466

    Jupyter Notebook 1

  5. jogo-da-velhAI jogo-da-velhAI Public

    Inteligência Artificial criada para jogar o "jogo da velha". Foi implementado o algoritmo de busca competitiva - MinMaxAlfaBeta. O objetivo desse projeto é didático com o intuito de apresentacao ao…

    Python 19

  6. MCC_UFU MCC_UFU Public

    Repositório criado para compartilhar códigos da matéria de Matemática para ciência da computação, do curso de Sistemas de Informação da Universidade Federal de Uberlândia

    C++ 10