Data Engineer / AI Developer (Python, Airflow, LLMs)
Location: Remote (preferably CET time zone)
About Startup Researcher
Startup Researcher is building the most reliable intelligence platform for private tech investments — combining media, data, and AI.
Our products span:
* Media Hub → editorial news & signals on startups, investors, and markets
* Data Hub → a live database of startups, investors, and funding rounds
* Data Engine → the backend AI system that collects, cleans, and structures the world’s startup data
We’re hiring a dedicated developer for the Data Engine — the core AI-powered pipeline behind our entire ecosystem.
Role Overview
You’ll be the main developer of the Data Engine repository, responsible for designing, maintaining, and evolving our end-to-end ETL pipelines.
You’ll work closely with the founders, experimenting with LLM-based extraction, web scraping, and data enrichment pipelines — and have full freedom to explore, test, and deploy new ideas.
This is a hands-on, highly autonomous, and exploratory role for a builder who loves both Python craftsmanship and AI research.
Responsibilities
* Design and build modular ETL pipelines using Airflow (2.x / 3.x): ingest → normalize → enrich → load.
* Develop custom scrapers for startup-related sources (e.g., news websites, VC portfolios, startup websites…).
* Integrate LLMs (Gemini etc.) to extract structured information (e.g., funding rounds, acquisitions) from unstructured text.
* Transform raw text data (HTML, RSS, JSON, PDFs) into structured database entities aligned with our Data Hub schema.
* Maintain clean, production-ready code — versioned, tested, and easy to extend for future engineers.
* Collaborate asynchronously: share progress, results, and new ideas for improvement.
* Continuously research new AI/LLM techniques relevant to automated data extraction and summarization.
Example Projects You’ll Work On
* Build a news scraper that extracts articles from multiple African tech media sites, cleans the content, and loads it into our CMS.
* Create a funding-round detector: use keyword filters + LLM reasoning to identify and structure funding events from news articles and load the data to the Data Hub.
* Enrich startup profiles by scraping, and standardizing data to inject it to the Data Hub.
* Implement AI-powered classification of startups by sector, stage, and market using embeddings and few-shot models.
* Build a data validation framework to detect duplicates or inconsistencies across incoming pipelines.
Must-Have Skills
* Excellent Python (async programming, typing, modular design)
* Strong experience with Airflow and ETL orchestration. Knowledge of Airflow 3.x is a huge plus
* Hands-on experience with LLMs / Generative AI — fine-tuning, prompt design, or API integration.
* Web scraping expertise (BeautifulSoup, Requests, Selenium, or Async methods).
* Experience with Git / GitHub workflows, CI/CD, and code versioning best practices.
* Strong understanding of data transformation & normalization principles.
* Comfortable working autonomously, taking ownership of a complex repository.
Bonus Points
* Own projects or demos involving LLMs, embeddings, or data pipelines.
* Familiarity with startup ecosystems — funding rounds, VCs, acquisitions, exits.
* Experience with Docker, PostgreSQL (SQLAlchemy, alembic).
* Interest in data accuracy, knowledge graphs, or AI-driven data enrichment.
Annuel based
Rabat, Rabat-Salé-Kénitra, Morocco
Rabat, Rabat-Salé-Kénitra, Morocco