Roadmap¶

This document outlines the evolution path from the current architecture, a passive, API-driven scanner spike, to a full operational system. The architecture provides a solid foundation for incremental enhancement across all components.

Principles for development¶

De-risk First: Each phase validates a core assumption before investing in scale.
Operational Reality: Features are prioritised based on real deployment needs.
Incremental Scaling: The architecture allows for modular upgrades without refactoring the entire system.

Phased development¶

       +----------------------------------+
       |    Current State:                |
       |    API Scanner Spike             |
       | (Core Integration Validated)     |
       +----------------------------------+
                       |
                       v
       +----------------------------------+
       |   Phase 1: Hardening & I/O       |
       |   Handle real data, be reliable  |
       +----------------------------------+
         |               |               |
         v               v               v
    +-----------+  +-----------+  +-----------+
    | Bulk      |  | Result    |  | Robust    |
    | Targets   |  | Caching   |  | Error     |
    | Input     |  | Layer     |  | Handling  |
    +-----------+  +-----------+  +-----------+
         |               |               |
         +---------------+---------------+
                       |
                       v
           +---------------------------+
           |   Phase 2: Enhanced       |
           |   Discovery               |
           |   Scale the provider layer|
           +---------------------------+
             |           |           |
             v           v           v
       +----------+ +----------+ +----------+
       | Censys   | | Local/   | | Provider |
       | & Shodan | | Private  | | Fallback |
       | Providers| | Datasets | | Logic    |
       +----------+ +----------+ +----------+
             |           |           |
             +-----------+-----------+
                       |
                       v
           +---------------------------+
           |   Phase 3: Operational    |
           |   Integration             |
           |   Connect to workflow     |
           +---------------------------+
             |           |           |
             v           v           v
       +----------+ +----------+ +------------------+
       | Active   | | Automation| | Desk Integration|
       | Probe    | | &         | | for Auto-       |
       | Module   | | Scheduling| | Fingerprint     |
       +----------+ +----------+ +------------------+
             |           |           |
             +-----------+-----------+
                       |
                       v
       +----------------------------------+
       |   Full Operational Scanner       |
       |   (Mature, Multi-Source Tool)    |
       +----------------------------------+

System hardening and I/O scaling¶

Goal: Transform the spike into a reliable tool for running against thousands of targets, integrating seamlessly with operational data pipelines.

Component	Enhancement	Rationale
IO/Targets	Bulk Input: Support CIDR ranges (`192.168.0.0/24`), CSV files, and integrate with asset inventories.	The spike’s simple target list doesn’t scale. Real-world scans are defined by network blocks and dynamic lists.
Providers	Result Caching: Implement a persistent cache (e.g., SQLite) for API responses. Key by `(provider, query, target)`.	Dramatically reduces API quota usage for repeat scans, increases speed, and allows offline analysis of past results.
Controller	Robust Error Handling: Implement retries with exponential backoff for transient API failures, comprehensive logging, and graceful degradation.	The spike assumes a perfect API. Real operations require the scanner to be resilient and report failures clearly.
IO/Output	Structured Outputs: Extend JSONL output to support multiple formats (CSV for analysts, Syslog for SIEM integration, direct database insertion).	Results must feed into different parts of the operational pipeline (analyst review, alerting, asset management).
Engine	Parallel Execution: Refactor the `planner.py` & provider calls to use async/threading, respecting per-provider rate limits.	Scanning 10,000 IPs sequentially is impractical. Parallelization is essential for performance.

Enhanced discovery & provider layer¶

Goal: Increase discovery coverage and resilience by scaling the provider layer beyond the initial Netlas integration.

Component	Enhancement	Rationale
Providers/	Multi-Provider Support: Implement `censys.py` and `shodan.py` providers against the same `base.py` interface.	Different datasets have different coverage. Using multiple sources increases the likelihood of finding a target.
Providers/	Private Dataset Provider: Create a provider that queries internal, passive data sources (e.g., internal network sensor logs, Zeek data).	Critical for scanning internal, non-routable networks that public APIs cannot see.
Engine/Planner	Intelligent Query Planning: Enhance `planner.py` to generate the most efficient query for each provider (e.g., batching IPs, using bulk search endpoints).	Optimises for API cost and speed.
Providers/Base	Provider Fallback & Blending Logic: Implement logic to try providers in sequence or blend results if a probe fails or returns empty.	Ensures a single API outage or data gap doesn’t break a scan. Increases reliability.

Operational integrationi¶

Goal: Integrate the scanner into the full Desk → Forge → Scanner workflow and add controlled active capabilities.

Component	Enhancement	Rationale
New Component	Active Probe Provider Module: Create a new provider that, when configured and authorized, performs safe, minimal active probes (e.g., TCP SYN, basic HTTP GET). Key: It implements the same `base.py` interface.	For internal networks or targets with no passive data, a controlled active check is necessary. Keeping it as a provider isolates the risk and complexity.
Controller	Automation & Scheduling: Package the scanner for scheduled runs (Docker image, systemd service). Integrate with workflow tools (Apache Airflow, Rundeck) for regular scanning jobs.	The scanner must run reliably without manual intervention to provide continuous visibility.
Integration	Obsidian Desk Integration: Build automation to consume a Desk’s `Artefact List` and propose or generate a draft `fingerprint.yaml`.	Closes the loop between analysis (Desk) and detection (Scanner), dramatically speeding up the fingerprint creation process.
Engine/Evaluator	Confidence Scoring: Extend the `evaluator.py` and `evidence.py` model to output a confidence score (e.g., `HIGH` for 3/3 probes, `MEDIUM` for 1/1 unique probe) alongside the boolean match.	Helps operators prioritise findings. A `HIGH` confidence match on a critical vulnerability is a P0 ticket.

Non-goals¶

To prevent scope creep, the following are explicitly out of scope for the foreseeable future:

Graphical User Interface (GUI): The tool is and will remain a CLI/API-driven service.
Real-Time Streaming Analysis: The scanner operates on defined target lists and schedules, not as a live packet analysis engine.
Exploitation or Post-Exploitation: The scanner’s purpose is identification and fingerprinting. It will never contain payloads or exploit code.
Stealth or Evasion Techniques: The active probe module will perform straightforward, identifiable requests. It is a diagnostic tool, not a penetration testing tool.

Success metrics¶

The transition from spike to operational tool will be measured by:

Reliability: Can run a scan of 10,000 targets over 24 hours without crashing, respecting all API limits.
Coverage: Successfully identifies targets across multiple data sources (2+ public APIs, 1 internal source).
Integration: Produces outputs consumed by at least two other operational systems (e.g., SIEM, ticketing system).
Speed: From fingerprint definition (fingerprint.yaml) to first results in production is under 1 hour.

This roadmap uses the spike architecture as a stable foundation, detailing what to build next while respecting the clear separation of concerns you’ve already defined. The path is incremental, de-risked, and focused on operational utility.