Roadmap¶
This document outlines the evolution path from the current architecture, a passive, API-driven scanner spike, to a full operational system. The architecture provides a solid foundation for incremental enhancement across all components.
Principles for development¶
De-risk First: Each phase validates a core assumption before investing in scale.
Operational Reality: Features are prioritized based on real deployment needs.
Incremental Scaling: The architecture allows for modular upgrades without refactoring the entire system.
Phased development¶
+----------------------------------+
| Current State: |
| API Scanner Spike |
| (Core Integration Validated) |
+----------------------------------+
|
v
+----------------------------------+
| Phase 1: Hardening & I/O |
| Handle real data, be reliable |
+----------------------------------+
| | |
v v v
+-----------+ +-----------+ +-----------+
| Bulk | | Result | | Robust |
| Targets | | Caching | | Error |
| Input | | Layer | | Handling |
+-----------+ +-----------+ +-----------+
| | |
+---------------+---------------+
|
v
+---------------------------+
| Phase 2: Enhanced |
| Discovery |
| Scale the provider layer|
+---------------------------+
| | |
v v v
+----------+ +----------+ +----------+
| Censys | | Local/ | | Provider |
| & Shodan | | Private | | Fallback |
| Providers| | Datasets | | Logic |
+----------+ +----------+ +----------+
| | |
+-----------+-----------+
|
v
+---------------------------+
| Phase 3: Operational |
| Integration |
| Connect to workflow |
+---------------------------+
| | |
v v v
+----------+ +----------+ +------------------+
| Active | | Automation| | Desk Integration|
| Probe | | & | | for Auto- |
| Module | | Scheduling| | Fingerprint |
+----------+ +----------+ +------------------+
| | |
+-----------+-----------+
|
v
+----------------------------------+
| Full Operational Scanner |
| (Mature, Multi-Source Tool) |
+----------------------------------+
System hardening and I/O scaling¶
Goal: Transform the spike into a reliable tool for running against thousands of targets, integrating seamlessly with operational data pipelines.
Component |
Enhancement |
Rationale |
|---|---|---|
IO/Targets |
Bulk Input: Support CIDR ranges ( |
The spike’s simple target list doesn’t scale. Real-world scans are defined by network blocks and dynamic lists. |
Providers |
Result Caching: Implement a persistent cache (e.g., SQLite) for API responses. Key by |
Dramatically reduces API quota usage for repeat scans, increases speed, and allows offline analysis of past results. |
Controller |
Robust Error Handling: Implement retries with exponential backoff for transient API failures, comprehensive logging, and graceful degradation. |
The spike assumes a perfect API. Real operations require the scanner to be resilient and report failures clearly. |
IO/Output |
Structured Outputs: Extend JSONL output to support multiple formats (CSV for analysts, Syslog for SIEM integration, direct database insertion). |
Results must feed into different parts of the operational pipeline (analyst review, alerting, asset management). |
Engine |
Parallel Execution: Refactor the |
Scanning 10,000 IPs sequentially is impractical. Parallelization is essential for performance. |
Enhanced discovery & provider layer¶
Goal: Increase discovery coverage and resilience by scaling the provider layer beyond the initial Netlas integration.
Component |
Enhancement |
Rationale |
|---|---|---|
Providers/ |
Multi-Provider Support: Implement |
Different datasets have different coverage. Using multiple sources increases the likelihood of finding a target. |
Providers/ |
Private Dataset Provider: Create a provider that queries internal, passive data sources (e.g., internal network sensor logs, Zeek data). |
Critical for scanning internal, non-routable networks that public APIs cannot see. |
Engine/Planner |
Intelligent Query Planning: Enhance |
Optimizes for API cost and speed. |
Providers/Base |
Provider Fallback & Blending Logic: Implement logic to try providers in sequence or blend results if a probe fails or returns empty. |
Ensures a single API outage or data gap doesn’t break a scan. Increases reliability. |
Operational integrationi¶
Goal: Integrate the scanner into the full Desk → Forge → Scanner workflow and add controlled active capabilities.
Component |
Enhancement |
Rationale |
|---|---|---|
New Component |
Active Probe Provider Module: Create a new provider that, when configured and authorized, performs safe, minimal active probes (e.g., TCP SYN, basic HTTP GET). Key: It implements the same |
For internal networks or targets with no passive data, a controlled active check is necessary. Keeping it as a provider isolates the risk and complexity. |
Controller |
Automation & Scheduling: Package the scanner for scheduled runs (Docker image, systemd service). Integrate with workflow tools (Apache Airflow, Rundeck) for regular scanning jobs. |
The scanner must run reliably without manual intervention to provide continuous visibility. |
Integration |
Obsidian Desk Integration: Build automation to consume a Desk’s |
Closes the loop between analysis (Desk) and detection (Scanner), dramatically speeding up the fingerprint creation process. |
Engine/Evaluator |
Confidence Scoring: Extend the |
Helps operators prioritize findings. A |
Non-goals¶
To prevent scope creep, the following are explicitly out of scope for the foreseeable future:
Graphical User Interface (GUI): The tool is and will remain a CLI/API-driven service.
Real-Time Streaming Analysis: The scanner operates on defined target lists and schedules, not as a live packet analysis engine.
Exploitation or Post-Exploitation: The scanner’s purpose is identification and fingerprinting. It will never contain payloads or exploit code.
Stealth or Evasion Techniques: The active probe module will perform straightforward, identifiable requests. It is a diagnostic tool, not a penetration testing tool.
Success metrics¶
The transition from spike to operational tool will be measured by:
Reliability: Can run a scan of 10,000 targets over 24 hours without crashing, respecting all API limits.
Coverage: Successfully identifies targets across multiple data sources (2+ public APIs, 1 internal source).
Integration: Produces outputs consumed by at least two other operational systems (e.g., SIEM, ticketing system).
Speed: From fingerprint definition (
fingerprint.yaml) to first results in production is under 1 hour.
This roadmap uses the spike architecture as a stable foundation, detailing what to build next while respecting the clear separation of concerns you’ve already defined. The path is incremental, de-risked, and focused on operational utility.