Dark Web: Scanning the dark web like a threat intel company


Back in 2015 (before the dark web was “trendy”) I started writing code to scan the dark web (specifically Tor). It was based off a conversation with Daniel Cuthbert (@danielcuthbert) about whether you could use the same techniques for mapping infrastructure on the surface web (the normal web bit) on dark web servers.

Over the years the number of companies selling “Dark Web” data as part of their platform has grown, but it doesn’t have to be expensive or complicated to build your own “Dark Web Threat Intel” platform.

DISCLAIMER: You will find some websites on Tor that contain illicit/illegal material, be sensible about what you view and be willing to accept the risk/liability if you do something stupid

We have 4 objectives that we want to achieve.

  1. Validate a list of dark web servers
  2. Build your Tor infrastructure (proxies, load balancers)
  3. Run your Tor scraper
  4. Analyse the results

To achieve our goals we are going to use the following technology (in no particular order).

  • Python
  • Docker
  • Scrapy
  • Python
  • Some more Python

NOTE: All code in this post is based on Python 3.x

The good news for you, is that I’ve done most of the hard work to get this up and running and all you need to do is to follow the instructions as we go. All the code is up in the GitHub repo which you can find HERE.

And off we go…

Validate a list of dark web servers

All servers that run within the Tor network have the TLD .onion, but unlike the normal domains/URLs that you might be use to seeing these can look like a jumble of random characters and numbers. That being said there is some rules to what a .onion address can look like, .onion addresses can now come in two different lengths and are either 16 characters long (not including the .onion) or 56 characters long (not including the .onion). They will only contain the letters a-z and the numbers 2-7, the code below shows have to use Python and the python library re to extract .onion addresses from blocks of text.

def extract_onion_links(html):
        short = re.compile(r'[a-z2-7]{16}\.onion', re.DOTALL | re.MULTILINE)
        longer = re.compile(r'[a-z2-7]{56}\.onion', re.DOTALL | re.MULTILINE)
        links = re.findall(short, html)
        links.extend(re.findall(longer, html))
        return set(links)
        return None

This code is used in the web scraper we will talk about later, but the function is the same regardless of what data you feed it.

Now we have the code we need a list to start of us off, in this instance I used the Hunchly Dark Web Report. Once you’ve signed up you get daily emails which will give you a link to download the latest lists. Download the list (it comes as an Microsoft Excel document), and then copy the Hidden Services from the Up and Down sheets into a text document.

Now we need to sort those, validate them against and regular expressions above and remove any duplicates. The Python code for this is below, you will need to change the input filename and the output filename.

import re

def extract_onion_links(html):
    short = re.compile(r'[a-z2-7]{16}\.onion', re.DOTALL | re.MULTILINE)
    long = re.compile(r'[a-z2-7]{56}\.onion', re.DOTALL | re.MULTILINE)
    links = re.findall(short, html)
    links.extend(re.findall(long, html))
    return set(links)
# change yourfilename.txt to match the path and filename of the text file you just created
input = open('yourfilename.txt', 'r').read()
links = extract_onion_links(input)
# change thenameofyouroutputlist.txt to the name of the file you want as output
output = open('thenameofyouroutputlist.txt', 'w')
for l in links:

All being well you should have a neat list of .onion addresses that all match the rules defined above. If you want to skip this part then you can download the list HERE.

Build your Tor infrastructure

Now we have a list of .onion addresses we need to build some infrastructure so we can run our web scraper. Over the years I have messed around with the best way to scan services within Tor at a sensible rate (not too fast, not too slow). In the end my preferred way is to use docker to build a set of servers running Tor and Prixovy and a load balancer in front of them.

Random side note (not to be taken seriously): Over the years I’ve built/designed and managed some high volume e-commerce sites with load balancers and I’m 90% sure most of the world’s internet problems can be solved with a load balancer…just saying…

To build the infrastructure part of this project we need a host machine that we can install Docker on, for my testing I use a VM running Ubuntu 20.04 with 2GB of RAM, 20GB HDD and 1 Processor. It really doesn’t need a lot of resources, the web scraper ran through over 5000 addresses in about 8 hours.

For purposes of this blog post I’m not going to go into detail about installing docker and docker-compose. Instead I will just provide some handy links below.

NOTE: If at this point you are like, “oh god I have to build VM’s and install stuff”, don’t worry you can skip this part (well most of this blog) and download the results.json file from the repo.

Random side note (to be taken seriously): I have to give massive thanks to Digital Ocean for the sheer number of install guides they produce for a whole range of things, it makes my life much easier

Once you have your VM up and running (or your laptop whatever you fancy), you just need to build the docker images and get them running. The config files are HERE and can be stored in the same directory. Once you have downloaded them you can just run:

docker-compose up

This should build you 3 Tor containers and a HA Proxy container, the HA proxy is responsible for directing traffic from your web scraper to each of the Tor containers and for making sure that they are accepting traffic.

The HA proxy exposes two TCP ports that you can connect to:

  • 8118: This is the port for the proxy
  • 8888: This is just a HA proxy stats web page under /stats

You need to make a note of the IP address that your docker containers are running on as you will need to configure the web scraper with the correct proxy address in the next section.

Run your Tor scraper

All being well you will now have a list of .onion addresses and a running set of infrastructure that we can proxy requests through. Now onto probably the most and interesting point our own Tor web scraper.

For this use case the Tor web scraper has been written using the Python library scrapy (which is awesome by the way). We feed the web scraper a list of .onion addresses and it attempts to connect to each one.

If it is successful in that attempt there are a number of extractors built into the web scraper that look for the following artifacts in the HTML content of the first page it connected to.

  • Bitcoin addresses
  • Email addresses
  • PGP blocks (signature or public blocks)
  • Google tracking codes
  • .onion addresses

In addition it also attempts to get the following:

  • Favicon hash (md5)
  • Robots.txt (Disallowed entries only)

We also store the following information from the web scraper response:

  • HTTP headers
  • Web page encoding
  • Web page title
  • List of HTTP redirects (if any)

NOTE: Running web scrapers (or any kind of connection) into the Tor network is not 100% reliable you will miss live servers or they might be offline at the time etc. etc.

The extractors that are used are all based on regular expressions, if you are interested in how those work you can find them HERE.

To get the web scraper running on your own environment the easiest way is to clone the entire GitHub repo which will give you access to all the files relating to this blog post.

git clone https://github.com/catalyst256/CyberNomadResources

Now you just need to navigate into the directory for this blog post.

cd CyberNomadResources/Darkweb/ScanningTheDarkWeb/WebScraper

and then we can install the Python libraries;

pip3 install -r requirements.txt

Before you run the web scraper for the first time you need to update the IP address of the proxy. In the settings.py file for the web scraper, scroll to the end and update the HTTP_PROXY variable with the IP address of your VM running the docker containers.

Once that update is done you should be ready to run the web scraper. To do this we need to pass it two arguments at the command line.

  • The name of the file containing the list of servers
  • The name of the output file we want to save

Assuming you downloaded the whole repo from GitHub then your command should look like this:

cd Darkweb/ScanningTheDarkWeb/WebScraper
scrapy crawl explorer -a filename=../ServerList/validated-server-list.txt -o ../Results/results.json

If the Python gods have been kind this should start working its way through the list of servers. Scrapy by default is verbose with it’s logging so don’t worry about it too much when you see mountains of text fly across your terminal screen.

Once the web scraper is finished you will have a JSON file with individual lines for each web server that was running (and returned the right HTTP status code).

The screenshots below show a couple of examples taken from the results.json file.


Now we have the output from our web scraper we can move onto the last objective.

Analyse the results

It would be impossible in this blog post to work through all the different scenario’s you could look at with the data we have collected. Instead we are going to look to answer a small subset of questions out of all the possibilities.

We are going to answer these questions:

  • How many servers returned data?
  • How many servers run off a Raspberry PI?
  • How many servers share the same favicon.ico hash?
  • How many artifacts did we find (email, pgp, bitcoin, google codes)?

NOTE: The data is all stored in a JSON file so you load this into NOSQL databases or something like Elasticsearch/Splunk.

The first step is to load the JSON data from the file so that it is available to use. Again using Python we just need a simple block of code.

import json

raw = 'Darkweb/ScanningTheDarkWeb/Results/results.json'
with open(raw, 'r') as raw_json:
    data = json.load(raw_json)

Now we can move forward and answer the questions we are interested in.

How many servers returned data?

print('How many servers returned data? {0}'.format(len(data)))
How many servers returned data? 2664

How many servers run off a Raspberry PI?

raspberries = 0
for d in data:
    if d.get('headers').get('server'):
        if 'Raspbian' in d['headers']['server']:
            raspberries += 1
print('How many servers run off a Raspberry PI? {0}'.format(raspberries))
How many servers run off a Raspberry PI? 3

How many servers share the same favicon.ico hash?

favicons = {}
count = 0
for d in data:
    if d.get('favicon'):
        if favicons.get(d['favicon']):
            favicons[d['favicon']] += 1
            favicons[d['favicon']] = 1
for k, v in favicons.items():
    if v > 1:
        count += 1
print('How many servers share the same favicon.ico hash? {0}'.format(count))
How many servers share the same favicon.ico hash? 108

How many artifacts did we find (email, pgp, bitcoin, google codes)?

artifacts = {'email': 0, 'pgp': 0, 'bitcoin': 0, 'google': 0}
for d in data:
    if d.get('email'):
        artifacts['email'] += len(d['email'])
    if d.get('pgp'):
        artifacts['pgp'] += len(d['pgp'])
    if d.get('bitcoin'):
        artifacts['bitcoin'] += len(d['bitcoin'])
    if d.get('google'):
        artifacts['google'] += len(d['google'])
{'email': 843, 'pgp': 19, 'bitcoin': 120884, 'google': 120}


This is just a small subset of the type of questions you can find answers to in this data, the end goal would be find ways to pivot between the data collected by the web scraper and other sources to build profiles and persona’s of either the infrastructure hosting a particular website or the individual’s running them.

Moving forward I will probably keep adding functionality to the web scraper and look at ways to enrich the data more (I have more tricks up my sleeve). I might even make the data available as a free service depending on the costs involved.

General: Back to Blogging…

I started writing blog posts back in 2011, before I moved into “Cyber” and before I had been introduced into the world of Maltego and Python. Over the last few years my blogging has died down a lot due to the stuff I’ve been working on hasn’t been for public consumption (sorry about that).

What has happened is that over time I’ve ended up with blog posts in two different places, over 100 code repo’s (some public, mostly private) all over the place and generally just a different outlook on how I want to write, code and share information.

With that in mind I have a new domain and I will migrate the blog posts from Medium and my old blog (where it still applies, or I might update some) and organise all my code repo’s. This will no doubt take me a while, but at least it will be in one place again and we can move forward towards the 10 year anniversary of when I started blogging (ignoring the years without posts).

I’m going to be focusing on the things I enjoy doing, so the blog will contain lots of content relating to:

  • Python
  • OSINT type stuff
  • Maltego
  • Darkweb (I’ve spent years mapping darkweb services)
  • More Python…

OSINT: Certificate Transparency Lists

NOTE: The wonderful people over at OSINT Curious wrote a similar blog post about this last year. The link is HERE. I recommend giving it a read if you want to know more about how to “hunt” using information from certificates.

It’s 2020 and I made myself a promise to try and write more blog posts about things so here is the first for the year. SSL/TLS certificates have been a great resource for a while now to identify, pivot from and hunt for infrastructure on the internet. The introduction of the Certificate Transparency Lists were first introduced by Google in 2013 after DigiNotar was compromised in 2011. It basically provides public logs of certificates issued by trusted certificate authorities and is a great source of intel for a number of functions (such as, but not limited to):

  • Detection of phishing websites
  • Brand/Reputation Monitoring
  • Tracking malicious infrastructure
  • Bug Bounties

For a long time newly registered/observed domains have been an indicator or potential malicious activity, for example if ebaythisisfake.com was registered you could guess that this was going to be a phishing website. The issue with watching for newly registered domains (it’s still a good indicator), is that a domain could be purchased and then not used for days/months/years.

The wonderful thing about using the CT lists is that the fact a certificate has been registered usually (probably 95% of the time) means that the infrastructure is active and in use, it also means that there is active DNS (required for requesting a certificate) for that domain and an IP address to investigate. It also allows you to catch when a legitimate domain has been compromised and is being used for malicious stuff, for example ebay.legitimatedomain.com which is why CT lists are great for catching phishing websites.

Since the initial launch of number of companies have started collecting the information published on the CT lists (why not it’s free), and either providing it as part of an existing service (Cyber Threat Intel), providing tools to look up specific domains (Facebook for example) or giving people access to the raw data in an easy to consume way (this is the one we are interested in).

It’s worth noting that while the CT lists are publicly available, the sheer volume of data generated by them means it’s much easier to use a feed from someone else than build your own, unless of course you want to..

For the rest of this post we are going to be focusing on one specific provider, which is from Calidog (https://certstream.calidog.io/), this company provide the data for free, and have a number of libraries for different programming languages and a command line tool as well.

The command line tool is python based so you can simply install it using Python’s PIP library (we use pip3 because Python 2.7 is end of life).

pip3 install certstream

Once it’s been installed you can then run “certstream” from the command line, running it without any options will just connect and start pulling down the firehose of certificate information.

certstream just being run from the command linecertstream just being run from the command line

Straight away you can start to see the certificates information flying across your terminal. The default option just shows a timestamp, certificate authority and the certificate common name that is being registered. OK its interesting but not very helpful (for our needs anyway). If you run “certstream — full” you will get the same information as well as “all domains” which are any other domain names assigned to that certificate.

certstream with the — full switchcertstream with the — full switch

This “all domains” information is really useful if you are for example tracking a domain or keyword that uses Cloudflare or similar platforms that provide SSL certificates. If you were to run certstream — full | grep cloudflare you would see all the new certificates for services behind Cloudflare.

certstream with — full and grepping for cloudflarecertstream with — full and grepping for cloudflare

Let’s have a look at some use cases, how about if you were interesting in bug bounties you can use CT lists to discover new infrastructure being created by your target company. If you were to run certstream — full | grep tesla.com for example you see any new certificates being created that contained the word “tesla.com”. You could even remove the “.com” part and search for the word tesla, this would generate more false positives but would show third party services that Tesla might be using (for example tesla.glassdoor.com). Another use case similar to this if you were to run certstream — full | grep amazon (for example) it would show any new certificates (as well as some false positives) that are generated by Amazon.

certstream with — full and grepping for amazoncertstream with — full and grepping for amazon

What about brand reputation monitoring, this is where someone is registering a certificate that will be part of a phishing website. This is important if you have customers that have to log into your website, or have to enter some kind of credentials. You can easily do this at the command line using similar commands to before, such as certstream — full | grep [keyword].

The problem with using the command line in this way is that it’s great for short time monitoring but if you want to include this as part of an overall monitoring solution or longer term research you will need to code up a solution. The good thing is that there are libraries available on the calidog.io website and they provide code examples and data pulled from the CT lists (via calidog.io) is JSON formatted so easy to use (well sometimes at least).

For this example we are going to use the Python library and extend the code example to allow the use of multiple keywords. The output will write to the command line but can easily be changed to send either the full JSON or the parts you want to another platform (NoSQL, Elastic, Splunk, Slack etc. etc.).

This is the code currently, it takes a list of keywords and then looks for those in each newly registered certificate, if there is a match it will print it out to the command line. The code uses “paypal”, “ebay” & “amazon” as our example keywords but you can could change this to anything you want.

    import logging
    import sys
    import datetime
    import certstream

    keyword = ['paypal', 'amazon', 'ebay']

    def print_callback(message, context):
        logging.debug("Message -> {}".format(message))

    if message['message_type'] == "heartbeat":

    if message['message_type'] == "certificate_update":
            domains = message['data']['leaf_cert']['all_domains']
            if [k for k in keyword if k in ' '.join(domains)]:
                sys.stdout.write(u"[{}] {} {} \n".format(datetime.datetime.now().strftime('%m/%d/%y %H:%M:%S'), domains[0], ','.join(domains)))

    logging.basicConfig(format='[%(levelname)s:%(name)s] %(asctime)s - %(message)s', level=logging.INFO)

    certstream.listen_for_events(print_callback, url='wss://certstream.calidog.io/')

Now the code won’t necessarily output something straight away as it’s looking just for the keywords you’ve specified but in the screenshot below you can see the output from about 5 minutes using the example keywords.

Example output of the above python script after 5 minutesExample output of the above python script after 5 minutes

If you want to use the data retrieved from the CT list to pivot into other data sources, you can expand the number of fields returned (or use the whole thing) to get the certificate serial number, fingerprint etc.

All the available fields are shown in the Github repository for the calidog.io Python library which is available at the link below:

Python Code:

I’ve made the code used in this post a bit more robust and pushed it to my “Junk” Github repo which you can find HERE.

To run the code you first need to make sure you are using Python3, and have installed the “certstream” library. To install the certstream python library you just need to type on a terminal/command line.

pip3 install certstream

Once the certstream python library is installed you need to create a text file with your list of “keywords”. A single keyword per line is perfect so for example if you wanted to look for new certificate requests for Amazon, EBay and PayPal (for example) just create a file (doesn’t matter what the filename is) and add;


To run the code just execute the following;

python3 osint-certstream.py [filename]

So if your keywords filename is keywords.txt for example you would just run;

python3 osint-certstream.py keywords.txt

To stop the code at any time you can just CTRL + C to kill it.

Let me know if you have any issues, the code works it’s just production ready but gives you an idea of what is possible.

OSINT: Getting email addresses from PGP Signatures

This morning I was reviewing some new Tor onion addresses one of my scripts had collected over the weekend, a lot of the websites I collect are either junk or no longer working.

One of the ones I checked was working, and on the website there was a PGP signature block. That got me thinking if you don’t know someones email address can you determine it from the PGP signature block??

Well the answer is yes, yes you can and with a little of time and some Python (who doesn’t love Python on a Monday morning) you can work out the email address(es) associated from just a PGP signature. The magic comes in determining the KeyId associated with a PGP signature and then using that Key ID querying a PGP key server to find the email address.

The code only currently works on PGP signatures that are the format as the one below (I may update it to work on other formats):

DISCLAIMER: I found this on the “darkweb” so I hold no responsibility/liability for what you do with it.




There is a specific RFC associated with PGP formats, which you can find HERE. I’ve not read it to be honest (it’s not that kind of Monday morning). In order to get the Key ID you need to a few steps.

Firstly we need to strip out the start and end of the signature block (the bits with — — ).

regex_pgp = re.compile(
r” — — -BEGIN [^-]+ — — -([A-Za-z0–9+\/=\s]+) — — -END [^-]+ — — -”, re.MULTILINE)
matches = regex_pgp.findall(m)[0]

Then we need to decode the signature, which is base64 encoded.

b64 = base64.b64decode(matches)

Convert that output to hex, so you can pull out the values you need.

hx = binascii.hexlify(b64)

Then get the values you need.

keyid = hx.decode()[48:64]

Once you have the Key ID you can just make a web call to a PGP key server to search for the key.

server = 'http://keys.gnupg.net/pks/lookup?search=0x{0}&fingerprint=on&op=index'.format(keyid)

Then because well I’m lazy I used regex to find any email address.

regex_email = re.compile(r’([\w.-]+@[\w.-]+\.\w+)’, re.DOTALL | re.MULTILINE)
email = re.findall(regex_email, resp.text)

The full script can be found HERE

NOTE: I’ve not tested it exhaustively but it works enough for my needs.

OSINT: Etag you’re it…

Investigators, malware analysts and a whole host of other people spend time mapping out infrastructure that may or may not belong to criminal entities. Some of the time this infrastructure is hidden behind services that are designed to provide anonymity (think Tor and Cloudflare) which makes any kind of attribution difficult.

Now I’m a data magpie, I have a tendency to collect data and work out later what I’m going to do with it. A while back someone suggested ETag (HTTP Header) as something worth collecting, so I thought I would have a look at potential uses.

ETag’s are part of the HTTP response header and is used for web cache validation. ETag’s are an opaque identifier assigned by a web server to a specific version of a resource found at a URL. The method for how ETag’s are generated isn’t specified in the HTTP specification but generated ETag’s are supposed to use a “collision resistant hash function” (source: https://en.wikipedia.org/wiki/HTTP_ETag)..

I wanted to see if you could use ETag’s as a way to fingerprint web servers, however ETag’s are optional so not all web servers will return an ETag in the HTTP response header. Let’s look at an example, I have an image hosted on an Amazon S3 bucket, this S3 bucket is also fronted by Cloudflare (mostly for caching and performance). The HTTP headers for both requests are below;

Direct request:

HTTP/1.1 200 OK

Accept-Ranges: bytes
Content-Length: 1150
Content-Type: image/x-icon
Date: Thu, 25 Oct 2018 09:06:25 GMT
*ETag: “e910c46eac10d2dc5ecf144e32b688d6”*
Last-Modified: Mon, 06 Aug 2018 12:18:30 GMT
Server: AmazonS3
x-amz-id-2: LxghTaCpRKQhaP69Qm942BrdyhN87m+SIJTh1xz403c5nVEGj4Y7fsPtNrZ2uedLPRL3zCIeHas=
x-amz-request-id: 5116840933538C62

Through Cloudflare:

HTTP/1.1 200 OK

CF-RAY: 46f3876dbb4b135f-LHR
Cache-Control: public, max-age=14400
Connection: keep-alive
Content-Encoding: gzip
Content-Type: image/x-icon
Date: Thu, 25 Oct 2018 09:06:46 GMT
*ETag: W/”e910c46eac10d2dc5ecf144e32b688d6"
*Expect-CT: max-age=604800, report-uri=”https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Expires: Thu, 25 Oct 2018 13:06:46 GMT
Last-Modified: Mon, 06 Aug 2018 12:18:30 GMT
Server: cloudflare
Set-Cookie: __cfduid=d5bd0309883f2470956c30bdbca9c99751540458406; expires=Fri, 25-Oct-19 09:06:46 GMT; path=/; domain=.sneakersinc.net; HttpOnly
Transfer-Encoding: chunked
Vary: Accept-Encoding
x-amz-id-2: vJsCSPCMJFdBmxdPemrtoX07SCKHliZicOpUYM32Do9TPnCFFLiZxsNXeNo3SK/lv/90sqDBPa0=
x-amz-request-id: CDC9A8EF960ABE44

From this example you can see that the ETag’s are the same, with the exception that the Cloudflare returned ETag has an additional “W/” in the front. According to the Wikipedia article a “W/” at the start of an ETag means it’s a “weak ETag validator”, whereas the direct ETag (not through Cloudflare) had a “strong ETag validator”.

A search on Shodan shows over 23 million results where ETag’s are shown in the results, this gives us a big data set to compare results against. Let’s look at another example, if we search Shodan for this ETag “122f-537eaccb76800” we find 263 results, which shows that ETag’s aren’t necessarily unique. However, if you look at the results the web page titles are the same “Test Page for the Apache HTTP Server on Fedora”. So, it seems that the ETag does relate to the content of the web page, which is in line with the information from Wikipedia.

So, in some instances the ETag is unique, in others it might not be, but it does give you another pivot point to work from.

Let’s have another look at an example, the ETag “dd8094–2c-3e9564c23b600” provides 16,042 results. That’s a lot but if you look at the results (not all of them obviously), you will notice that they are all the same organisation “Team Cymru”. So, while the ETag in this example doesn’t provide a unique match it does provide a unique organisation, which again gives you an additional pivot point.

I’m going to continue looking at ways to collate ETag’s to web servers, if nothing else it’s an interesting way to match content being hosted on a web server. Have a look at this ETag “574d9ce4–264”.

Maltego: Meta Search Engines

The other day on the train, I was reading “The Tao of Open Source Intelligence” by Stewart Bertram and there is a section when he talks about “Metasearch Engines”. Basically when you use search engines like Google or Bing you are searching their database of results. Metasearch engines typically aggregate the results from multiple different search engines which is useful when performing OSINT related searches or just if you want to save time.

A while back I came across searx:

Searx is a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.

Now I like searx for a number of reasons;

  • It’s written in Python (flask).
  • You can run it in a Docker container.
  • You can create your own (or add) search engines.
  • You can search for images, files, bittorents etc.
  • It has the ability to output in JSON, CSV and RSS

The part of searx that really interests me is the ability to add your own search engines. I’ve yet to try this but in theory not only can you add “normal” search engines but you should also be able to add your own internal data sources. For example if you have an Elastic search index you could write a simple API to query it and then add it as a search engine within searx.

You could in theory (I am going to test this) remove all the normal search engines and use searx as a search engine over all your internal data sources (some assembly required).

Now what’s any of this got to do with Maltego? Well Maltego (out of the box) has some transforms for passing entities through search engines, however sometimes you just want to “Google” something and see what you get back (well that might just be me).

The idea was to run a docker instance of searx, and then create a local Maltego transform to run a query and return the first 10 pages of results.

Lets break down the steps:

  1. Build a docker image for searx.
  2. Write a local transform to query searx and return the results.

Before I get into the good stuff, let me quickly explain local transforms (in case you don’t know). Local transforms work the same as remote transforms except that they are run locally on your machine. Local transforms have some benefits over remote transforms but also some downsides as well (not an exhaustive list).


  • You get all the benefits of the processing power of your local machine.
  • You can run transforms that query local files on your machine, for example if you want to have a transform to parse a pcap file you can (more difficult with remote transforms).
  • It’s much quicker to develop/make changes to local transforms and you don’t have any other infrastructure requirements.


  • Its more difficult to share local transforms with others (sharing is caring).
  • You have to worry about library dependencies and all that stuff.

Ok enough of that stuff, lets get onto the interesting stuff.

First off we need to get a Docker image for searx, luckily this is really easy as the author’s documentation is really good. The instructions for this can be found HERE.

All being well you should have a running instance of searx which should look like the image below:

Have a play with searx, it’s well worth the time and has lots of cool features.

The local Maltego transform (it’s written in Python 3.x ) is pretty simple, pass it your query (based on a Maltego Phrase entity) and it will return the results as URL entities.

Paterva provide a Python library for local transforms, and similar to the remote TRX library it was written in Python 2.x so I tweaked it to work with Python 3.x.

Here is the code for the transform (Github repo linked at the end).

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import sys
import json
from settings import BaseConfig as settings
from MaltegoTransform import *

def metasearch(query):
    m = MaltegoTransform()
    for page in range(1, settings.MAX_PAGES):
        url = '{0}{1}&format=json&pageno={2}'.format(
            settings.SEARX, query, page)
        response = requests.post(url).json()
        for r in response['results']:
            ent = m.addEntity('maltego.URL', r['url'])
            ent.addAdditionalFields('url', 'URL', True, r['url'])
            if r.get('title'):
                    'title', 'Title', True, r['title'])
            if r.get('content'):
                    'content', 'Content', True, r['content'])

if __name__ == '__main__':

Now the transform maps across the URL, the title of the page (if it exists) and the content (which is a snippet of text, similar to Google results) as an additional field.

Below is a screenshot of what it looks like in Maltego.

To get the transform working, clone the Github repo (link below) and then you will just need to add the new transform, if you’ve never done it before Maltego provide instructions HERE.

One of the things I love about Maltego is it doesn’t need to be complicated, creating transforms is easy, and then using them is just as simple. The ability to define what the tool can do for you, makes it powerful and versatile.

The Github repo is HERE

Maltego: AWS Lambda

One of the awesome things about Maltego (and Paterva, the company that makes it), is that they allow people like me, to host remote Maltego transforms (Transform Host Server) using a mixture of their Community iTDS server and your own infrastructure.

For a few years now I’ve been running remote Maltego Transforms on an AWS instance (t2.micro). All you need to get it up and working is a Linux server, Apache2 (you could probably use nginx) and a little bit of time. Paterva provide all the files and installation notes you need HERE and it works a treat.

If you work in tech you can’t really escape the flood of people talking about “Serverless” deployments and all that kind of stuff so I thought it would be cool to try and recreate the Transform Host Server but using Python 3.x, Flask (instead of Bottle), Zappa, and AWS’s Lambda functions.

Paterva’s original Maltego.py (the Python library for remote transforms) is written in Python 2.x, which means the first job was to tweak the file so Python 3.x didn’t get all funny about print statements (the good news is the tweaked version is available in the Github repo below).

Once that was done, I then needed to recreate the web server component (which was originally written in Python’s bottle framework) to make use of the awesome Python Flask framework. This probably took the longest as I had to work out the differences between Flask & Bottle.

A skeleton Flask server is shown below, the transform I wrote to prove the concept was to simply connect to a website, and return the status code (200, 404 etc. etc.).

#!/usr/bin/env python

from flask import Flask, request
from maltego import MaltegoMsg
from transforms import trx_getstastuscode

app = Flask(__name__)

@app.route(‘/’, methods=[‘GET’])
def lambda_test():
    return ‘Hello World’

@app.route(‘/tds/robots’, methods=[‘GET’, ‘POST’])
def get_robots():
    return (trx_getstastuscode(MaltegoMsg(request.data)))

if __name__ == "__main__":

The lamba_test function, is just there so I could make sure it was working, you can (and probably should remove it).

Now the next step was to deploy this to AWS’s Lambda service, being lazy I decided to try Zappa;

Zappa makes it super easy to build and deploy server-less, event-driven Python applications (including, but not limited to, WSGI web apps) on AWS Lambda + API Gateway. Think of it as “serverless” web hosting for your Python apps. That means infinite scaling, zero downtime, zero maintenance — and at a fraction of the cost of your current deployments!

Zappa is super easy to use, just follow the instructions and make sure to use Python virtual environments (not like someone we won’t mention, who forgot). Provide Zappa with some AWS credentials that have the level of access and within minutes you will be deploying your new app as a AWS Lambda function (seriously it only takes a few minutes). It’s important that you take note of the URL provided at the end of the deployment as you will need it in the final stage.

The final stage to this great masterpiece is to configure your account on Paterva’s Community iTDS server to point to your new transform. The documentation is HERE if you’ve never done before. Just one thing to note, the *Transform URL *is the URL outputted by Zappa above (it should stay the same not matter how many times you deploy).

The nice thing about using AWS’s Lambda functions is that its really easy and quick to deploy and the pricing model works great if you aren’t expecting heavy usage (1 million requests per month or 400,000 GB-seconds per month on the free-tier). Now there is no reason for you not to be hosting Maltego transforms for the world to share…

All the files you need to deploy your first Maltego Lambda function are in my Github repo below, clone the repo, configure your virtual environment (there is a handy requirements.txt to help) and off you go.

Simple example of how to use AWS’s Lambda functions to host Maltego remote transforms

OSINT: Email Verification API

When I started writing Python my focus was on building tools, then I realised that the tools I built I never actually used, mostly because at the time my job wasn’t security related and it was more of a hobby. Now I work in security and the majority of the code I write is focused around collecting, processing and displaying data. I’m lucky I love the work I do, for me the “fun” is around solving problems and because of that the code I write (and I still write a LOT of it) has evolved.

Open Source Intelligence (OSINT) is my new “hobby” there is some cross over into my job but for the most part I write code to perform OSINT related functions. One of the things I discovered on my OSINT journey is that there are lots of API’s, data feeds out there and all sorts that you can use but a lot you have to pay for and in some cases it’s not always clear on how the data is collected or stored. When faced with issues like this I tend to work on the following principal, “if in doubt, write it yourself”.

A while back I discovered a new Python library that is just awesome when you want to create your own API’s called flask_api (an extension of the Flask framework) and I’ve been using it a lot lately to create new “things”.

Today I wanted to share with you my email verification API, essentially you pass an email address to it and it will check to see if it can determine whether or not the email is “valid”. It will also check to see if the target email server is configured as a “catch-all” (in other words any email address at the specified domain will be accepted).

All of this is return in a lovely JSON formatted output which makes it really easy to use as part of a script of another service and even easier to dump the returned data into a database (if you like that sort of thing) and to ease installation there is also a Docker container available if you are into that sort of thing.

The README.md in the Github repository has the necessary installation instructions but its nice and simple once you’ve cloned the repo.

  • pip install -r requirements.txt
  • python server.py

Then to test it you simply use this URL:

http://localhost:8080/api/v1/verify/?q=anon@anon.com (replacing anon@anon.com with your target email address). If all works, you should get something similar to the screenshot below.


The code can be found HERE

Any questions, queries, feedback etc. etc. let me know.

Beginners Guide to OSINT – Chapter 1

DISCLAIMER: I’m not an Open Source Intelligence (OSINT) professional (not even close). I’ve done some courses (got a cert), written some code and spend far too much using Maltego. OSINT is a subject I enjoy, it’s like doing a jigsaw puzzle with most of the pieces missing. This blog series is MY interpretation of how I do (and view) OSINT related things. You are more than welcome to disagree or ignore what I say.

The first chapter in the OSINT journey is going to cover the subject of “What is OSINT and what can we use it for”, sorry it’s the non technical one but I promise not to make it too long or boring.

What is OSINT??

OSINT is defined by wikipedia as:

“Open-source intelligence (OSINT) is intelligence collected from publicly available sources. In the intelligence community (IC), the term “open” refers to overt, publicly available sources (as opposed to covert or clandestine sources); it is not related to open-source software or public intelligence.” (source)

For the purpose of this blog series we are going to be talking about OSINT from online sources, so basically anything you can find on the internet. The key point for me about OSINT is that it (in my opinion) only relates to information you can find for free. Having to pay to get access to information such as an API or raw data isn’t really OSINT material. You are essentially paying someone else to collect the data (that’s the OSINT part) and then just accessing their data. I’m not saying that’s wrong or should be a reason not to use data from paid sources, it’s just (and again just my opinion) not really OSINT in its truest form.

Pitfalls of OSINT

Before we go any further I just wanted to clarify something about collecting data via OSINT. This is something that I often talk to people about and there are varying different opinions about it. When you collect some data via OSINT methods it’s important to remember that the data is only as good as the source you collect it from. The simple rule is “Don’t trust the source, don’t use it”.

You also need to consider about the way that the data is collected. Let me explain a bit more, consider this scenario (totally made up).

You spot someone (within a corporate environment) called Ronny emailing a file called “secretinformation.docx” to an external email address of ronnythespy@madeupemail.com. You decide to do some “OSINT” to work out if the two Ronnies are the same people. Using a tool or chunk of code (in a language you don’t know) you decide that you have enough information to link the two Ronnies together.

Corporate Ronny takes you to court to claim unfair dismissal, during the court procedures you are asked (as the expert witness) how the information was collected. Now you can explain the process you followed (run code or click on the tool) but can you explain how the tool or chunk of code provided you with that information or the methods it used to collect it (where they lawful for example)?

For me, that’s the biggest consideration when using OSINT material if you want to use it to provide true value to what you are trying to accomplish. Being able to collect the information is one thing, validating the methods or techniques on how it was collected is another. Again this is a conversation I have many, many times and I work on this simple principle, “if in doubt, create it yourself” which basically means I have to/get to write some code or build a tool.

This quote essentially sums up everything I just said, “In OSINT, the chief difficulty is in identifying relevant, reliable sources from the vast amount of publicly available information.” (source)

What is OSINT good for?

Absolutely everything!! Well ok nearly everything, but there are a lot of ways that OSINT can be used for fun or within your current job. Here are some examples;

  • Company Due Diligence
  • Recruitment
  • Threat Intelligence
  • Fraud & Theft
  • Marketing
  • Missing Persons

What are we going to cover??

At the moment I’ve got a few topics in mind to cover in this blog series, I am open to suggestions or ideas so if you have anything let me know and I will see what I can do. Here are the topics I’ve come up with so far (which is subject to change).

  • Image Searching
  • Social Media
  • Internet Infrastructure
  • Companies
  • Websites

Hopefully you found this blog post of use (or interesting), leave a comment if you want me to cover another subject or have any questions/queries/concerns/complaints.

Building your own Whois API Server

So it’s been a while since I’ve blogged anything not because I haven’t been busy (I’ve actually been really busy), but more because a lot of the things I work on now I can’t share (sorry). However every now and again I end up coding something useful (well I think it is) that I can share.

I’ve been looking at Domain Squatting recently and needed a way to codify whois lookups for domains. There are loads of APIs out there but you have to pay and I didn’t want to, so I wrote my own.

It’s a lightweight Flask application that accepts a domain, does a whois lookup and then returns a nice JSON response. Nothing fancy, but it will run quite happily on a low spec AWS instance or on a server in your internal environment. I built it to get around having to fudge whois on a Windows server (lets not go there).

In order to run the Flask application you need the following Python libraries (everything else is standard Python libraries).

  • Flask
  • pythonwhois (this is needed for the pwhois command that is used in the code)

To run the server just download the code (link at the bottom of page) and then run.

python whois-server.py

The server runs on port 9119 (you can change this) and you can submit a query like this:


You will get a response like the picture below:

From here you can either roll it into your own tool set or just it for fun (not sure what sort of fun you are into but..).

You can find the code in my GitHub repo MyJunk or if you just want this code it’s here.

There may well be some bugs but I haven’t found any yet, it runs best on Linux (or Mac OSX).

Any questions, queries etc. etc. you know where to find me.