Search
The worker is what runs scans. The code can be found in the backend
directory.
When running Crossfeed locally, the Elasticsearch cluster is run as the
crossfeed_es_1
Docker container. When deployed, we run an Elasticsearch cluster managed
by Amazon Elasticsearch Service.
Directory structure
The file tasks/es-client.ts
handles the task of interfacing with the Elasticsearch cluster.
Configuration
To configure properties for Elasticsearch, you can modify
environment variables in .env
in the root directory.
If you need to configure Elasticsearch for deployment, you should update the
env.yml
file. You may also need to update parameters in AWS SSM, as several
environment variables use values that are stored in SSM.
Kibana
Kibana is a tool that helps visualize and query data that is stored in Elasticsearch. By default, Kibana is disabled because it adds a lot of overhead to local development and isn't required for normally running Crossfeed locally.
If you want to view a local version of Kibana (if you, for example, want to inspect the data of the
local Elasticsearch instance), you should first uncomment the "kib" section of docker-compose.yml
,
re-launch Crossfeed, and then navigate to http://localhost:5601.
Syncing with the database
All data is populated to the database by other scans, and synchronization between the database and Elasticsearch is done by the searchSync
scan.
The searchSync
scan retrieves all domains / services / vulnerabilities / webpages that need to be synced to Elasticsearch, then bulk
uploads them to Elasticsearch. Afterwards, it sets the syncedAt
column on these entities so that they will not be synced again in the future,
until they are updated by other scans.
Indexes and mapping
We use a single index called "domains"; its name might change due to reindexing, so the current name is stored as the DOMAINS_INDEX constant in es-client.ts.
The domain index has a mapping. In order to create or update the mapping, you can run npm run syncdb
from the backend
directory. This calls
the ESClient.syncDomainsIndex()
, which will update the index's mapping if it exists, or create a new index if it doesn't exist.
Both services
and vulnerabilities
are stored with the
nested field type. This means that they are all stored on the same domain
document, and adding services / vulnerabilities will require updating / reindexing of an entire domain document.
However, webpages
are stored with the join field type. This means
that each webpage is stored as a separate document in the "domains" index, but contains a value for the parent_join
field that indicates that
that webpage is a child of another domain document. This makes it more efficient to add or remove single webpages, since it doesn't require
reindexing all the webpages for a given domain.
So that the webpage fields don't conflict with fields in regular parent domain records, fields in webpage records are stored with the
webpage_
prefix
(see schema here).
Building search queries
The search query is built by the buildRequest function on the frontend. As of now, the logic there roughly corresponds to:
(
(
(has a domain matching query) OR
(has a webpage with body matching query)
)
AND (matches filters)
)
Search results are individual domains, but they may contain snippets of webpage bodies if they contain the webpage content. For example:
![search result](./img/search result.png)
Webpage scraping
Webpage scraping is done by the webscraper
scan. This scan uses the scrapy
Python library to follow and scrape all links, observing
rate limits and respecting robots.txt as well.
When a webpage is scraped, basic information such as the URL and status code are stored in the database through the Webpage
model. However,
webpage contents and headers are not stored in the database; instead, they are directly uploaded to Elasticsearch.