Understanding private proxies for crawling

Disclaimer: Since I started working with web crawling, I noticed it could be used in unethical ways. The line is not always clear, and it depends a lot on interpretations. I wrote this with the intention to help people to learn, so be responsible with this knowledge.

Along with CAPTCHAs, ReCAPTCHAs and Fingerprinting, IP blocking is one of the most used techniques that sites use to keep crawlers away.

The main way to bypass this protection is to use proxies. However, using a public proxy is almost always inefficient because of the blacklists, which keep a track on these public IPs.

In this context, some people feel inclined to use Tor as the proxy, but besides being very slow, Tor is a resource aimed to help people under freedom constraints, and you should think twice before using it for trivial purposes.

Then you decide to rent private proxies, but there are many other things you should know to make the best decision. I will discourse below about a few things that may help you.

Proxy protocol

The most used protocol for proxies is HTTP/HTTPS, but sometimes you want to connect under another protocol, such as FTP, ICQ, Whois, etc, and then you should use a TCP Layer proxy protocol, which is the SOCKS protocol.

There are yet mainly two types of SOCKS: SOCKS4, which supports only TCP, and SOCKS5, which supports both TCP and UDP, is more secure and it also supports authentication.

SOCKS are very powerful and can also be used to connect to HTTP/HTTPS services. However, there are not so many providers on the internet that enables us to connect in the destination port we want, almost always it is restricted only to the ports 443 and 80.

Proxy exclusivity

There are three levels of proxy exclusivity: public, shared and dedicated.

The public, as we said before, are used by lots of people and thus get blocked very fast, either because it is io a blacklist or simply being noticed by the host faster (because probably many people have used them for that purpose).

Shared proxies are used by more than one user, and have some of the public proxies downsides, but it can be a good choice if you intend to use it in a not so popular domain.

Dedicated proxies are the most expensive of all, but also have the best results in a general case, and it is very indicated to be used in popular domains as social media and e-commerce.

Connection type

When talking about crawlers, you usually connect to a proxy using some library (like the Python Requests) and must insert the proxy you want to connect to the function parameters. Usually, those proxy providers give you an IP and port to connect, like Depending on the authentication method you also will need to put your credentials, like user:passwd@

However, the connection to this IP can happen in various ways, the most common is the direct connection, in other words, if you connect to that proxy, and then make a request to, for instance, ipify, the answer must be that same IP you connected directly. When the proxy provider adopts this approach, they tend to give you an endpoint with the list of your proxies.

When these providers want to give you a plethora of IPs, they cannot give you all the possible IPs they have in their database, so they give you some socket (IP + port) that works like a gateway for their service. That’s how back-connected and rotating proxies work.

Back-connected proxies are just various proxies connected by only one socket. If you connect to a back-connected proxy, let’s say, and then make a request to ipify, your result will not be the same as your socket.

The back-connected proxies can retrieve a proxy from a pool of randomized proxies or update your proxies periodically. This last type is usually called Rotating Proxy.

Network type

When you install internet in your home, you must contract an Internet Service Provider, or ISP, which gives you an IP each time you connect to the internet. If you check your IP on the internet, you will see that your IP is registered as coming from some ISP.

There are other types of IPs which are bought just to serve as proxies. They are known as Datacenter IPs, and it is the most common proxy sold in the market today.

The Datacenter IP have its downsides, one of them being the relative ease of blocking from well-known IP providers. The Luminati Proxy service achieved a solution for this, which uses residential IPs from people who use Hola VPN, as said here.

Residential IPs use a back-connected socket (or Super Proxy, as called by Luminati), in which every time you connect they change your IP from a pool of millions of IPs.

There is another solution called Mobile IPs, which is very similar to Residential but not so popular.

Other relevant features to check

While renting a proxy there are yet many other things to pay attention.

Bandwidth: You must know how much data pass through the proxies and if your proxy provider has any limit for data traffic.

Subnets: When talking about Datacenter IPs, this may be the most important resource to pay attention, because lots of firewalls block all the subnet you are.

Replacement: You need to have an option to replace your proxies if they get blocked, even if the replacement is done once in a month. Otherwise, you will pay for a service you cannot use.

IP location: This is mostly related to the crawled sites, some of them (especially government sites) allow only proxies from a given location to access.

Authentication system: Most of the proxy services allow connection with IP or with username + password. However, when the authentication is using IPs, you must pay attention if the amount of IPs is limited.

Customer service: To me, this is one of the most important features, since you almost always have any doubt or problem while using the proxy service.

My experience with proxies

I must have used a few proxy services in the last year and will share below my impressions. Most of the proxies that I did not use was because they did not answer my tickets or took several days to do so.

Blazing proxies: We use it yet in most of our application because it has support to SOCKS5 (and we can choose the port we will connect), it has lots of subnets, have a fair price and a fair customer service, and no bandwidth limitations.

Luminati: I was amazed by their product because they are the vanguard in the Residential proxies, which are virtually unblockable. Even their Datacenter IPs are fresh and constantly replaced. They have a Proxy Management System and you can configure it in lots of ways. But the most incredible feature they have is their customer service, which was my best customer experience so far. Their downside is the high prices (which is still fair to the product they sell) and the bureaucracy to contract some services, which are also fair, given the nature of their services.

RotatingProxies: A back-connected proxy service which changes their IPs every 5 minutes. They use residential IPs, but I found out the proxies had no much quality and were almost always already blocked by the service I tried to connect. They also have a long time to activate your account.

SharedProxies: A cheap service but with very few subnets. They also charge if you will try to connect to it using more than one IP.


Proxies are almost always needed for large scale web crawling and can be painful to choose the best service if you do not know what you need. You must be able to define your priorities and constraints and know well the choices you have in the market, and this post is a starting point to make that choice.

GnuPG small guide

GnuPG, or Gnu Privacy Guard, is a software that implements the OpenPGP standard — in which PGP stands for Pretty Good Privacy.

It is a tool used for — crash the drums — privacy. By privacy, you should understand sending private messages between users, to store any data safely, being sure that a message has a given origin and others.

For those who think that privacy is a waste of time, stuff for paranoid tech guys, I suggest reading this post in Stack Exchange.

Although we live in a free era, the freedom is sometimes menaced by dictatorships and state surveillance and we do not know when these privacy knowledge will go from “nice to have” to a “must have”.

PGP was created in 1991 by Phil Zimmermann, in order to securely store messages and files, and no license was required for its non-commercial use. In 1997 it was proposed as a standard in the IETF, and thus emerged OpenPGP.

Uses for the GPG

GPG have some nice features, some used more frequently, others not so:

  • Signing and verifying a commit: With this, you can always know the owner of a commit, as well as let the others be sure you wrote some piece of code. The Git Horror Story is a tale that tries to instill the sense for sign/verify commits into the greenhorn programmers. Besides it all, there are some people that have its own issues with commit signatures.
  • Send and receive cryptographed messages: Remember when as a kid you created encoded messages and thought no one would never understand¹? Well, bad news, everyone could read them. But now you can create messages that no one will understand², for real.
  • Checking signatures: It is possible to know (with a great confidence degree) that some file is from someone specific. For instance, consider downloading Tor, it is a secure browser, but if you download a hacked software, it will serve for the exact contrary of its purpose. But calm down, no need to cry, with GPG you can check the integrity of any software.

Come with us to learn cool stuff!

¹ No? Oh… sorry for your childhood. :'(
² Maybe it is possible to break a 4096 RSA encryption.

Creating your key

First of all, you must download and install the GPG tools. Then you can check the installation with

> gpg --version
gpg (GnuPG) 1.4.20
License GPLv3+: GNU GPL version 3 or later [...]

The next step is to create a key for your use, which is very easy! The step-by-step can be seen here or here. Briefly, it is

> gpg --gen-key

And some things you must pay attention to the creation of the key:

  • I would rather choose RSA kind of key. I suggest you to read this to understand why.
  • Be sure to choose the maximum bit length for your key, if you want it to be safer. At the time this article is being written, the maximum is 4096.
  • It is nice to set an expiration date. It must happen to you to lose your key or die (as we all some day) and then is a good practice to let others know that key is not being used anymore.
  • Avoid comments on your key, as it is redundant at almost every time.
  • It is nice to use strong digest algos like SHA512. I suggest you understand and create a nice gpg.conf to assure you are using the best configuration.

Here you can find a lot of other good practices for your GPG day-to-day use.

Assuming you finished the creation of your key, you can check it all with

> gpg --list-keys
pub   4096R/746F8A65 2017-05-24 [expires: 2018-05-24]
      Key fingerprint = 014C F6E9 C2E0 12A2 4187  F108 178A C6CD 746F 8A65
      uid                  Lucas Almeida Aguiar <lucas.tamoios@gmail.com>
      sub   4096R/AFC85A01 2017-05-24 [expires: 2018-05-24]

As a brief summary, pub stands for “public” key, then you have the key length (4096 bits) with the R from RSA, a slash, and the short fingerprint, then the creation and expiration date. The short fingerprint takes the last 8 digits from your actual fingerprint. The uid is what you wrote a few minutes ago when I told you to do not write a comment. Let’s talk about the subs later.

For now, pay attention: the uid is not enough for you to believe someone is who he/she is telling you he/she is. Anyone can create a key with any name or e-mail. To be sure someone is really who he/she is telling you, you must check its fingerprint. We will cover it deeper when discussing web of trust.

Working with keys

Generally, you have not only your keys but also other people’s public keys, that you use to verify signatures and to send them encoded stuff. You have the power to edit your key and change how you see the other’s keys with

> gpg --edit-keys 746F8A65

This hash in front of the command is just the short fingerprint of the uuid you want to edit.

> gpg --edit-keys 746F8A65
gpg (GnuPG) 1.4.20; Copyright (C) 2015 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Secret key is available.

pub  4096R/746F8A65  created: 2017-05-24  expires: 2018-05-24  usage: SC  
                     trust: ultimate      validity: ultimate
sub  4096R/AFC85A01  created: 2017-05-24  expires: 2018-05-24  usage: E   
sub  4096R/B2CD6DC9  created: 2017-05-24  expires: 2018-05-24  usage: S   
[ultimate] (1). Lucas Almeida Aguiar <lucas.tamoios@gmail.com>

Pay attencion to the letters in the usage attribute, they mean
S for sign
E for encrypt
C for certify

The certify usage is the most powerful of them because it can create, trust and revoke keys.

The gpg --edit-keys allows you to change passwords, trust keys, sign keys, change expire date, and other.

Trusting keys

The PGP have a decentralized trust model, called web of trust. It allows you to trust keys even without a central server, as it is on X.509. The kinds of trust you can set to keys are:

  • Ultimate: is only used for your own keys.
  • Full(I trust fully): is used for keys which you really trust. Anyone who trust you, will also trust all of your fully trusted keys. Take care with it.
  • Marginal(I trust marginally): if you set a key as marginal trusted, it is like you only trust 1/3 of its trusted keys. An example took from here: If you set Alice’s, Bob’s and Peter’s key to ‘Marginal’ and they all sign Ed’s key, Ed’s key will be valid.
  • Unknown: is the default state.
  • Undefined(Don’t know): has the same meaning as ‘Unknown’ but actively set by the user.
  • Never(I do NOT trust): same of ‘Unknown / Undefined’, but meaning you know that the key owner is not accurately verifying other keys before signing them.

To trust someone you must first import a key. You can import the raw file with
the public key with

> gpg --import john_doe.asc

or download it from a keyserver

> gpg --keyserver hkp://pgp.mit.edu --search-keys "john_doe@example.com"

Then when you --list-keys, the key you imported should be there.

Trusting keys is a serious issue. If you start to trust everyone, without checking, you will probably end up being yourself trusted as “Never”. I recommend you to only trust keys when you are sure it belongs to its owner, and checked with him the key’s fingerprint.

Check the fingerprint of the key you want to trust

> gpg --fingerprint jonh_doe@example.com

Edit the key you want to trust

> gpg --list-keys
pub   3072D/930F2A9E 2017-06-20 [expires: 2017-06-21]
      Key fingerprint = ...
> gpg --edit-keys 930F2A9E
pub  4096R/930F2A9E  created: 2017-06-20 expires: 2018-06-20 usage: SC  
gpg> trust

Then you must set the trust level, according to the list described above. See here with more details.


The aim of this post was to give an overview about GPG, since the links I put
here can serve as a bootstrap to further learnings.

Linux Asynchronous I/O

Disclaimer: This post is a compilation of my study notes and I am not an expert on this subject (yet). I put the sources linked for a deeper understanding.

When we talk about asynchronous tasks what comes to our minds is almost always a task running in a separated thread. But, as I will clear below, this task usually is blocking, and not async at all.

Another interesting misunderstanding occurs between nonblocking and asynchronous (or blocking and synchronous as well), and people use to think these pair of words are always interchangeable. We will discuss below why this is wrong.

The models

So here we will discuss the 04 available I/O models under Linux, they are:

  • blocking I/O
  • nonblocking I/O
  • I/O multiplexing
  • asynchronous I/O

Blocking I/O

The most common way to get information from a file descriptor is a synchronous and blocking I/O. It sends a request and waits until data is copied from the kernel to the user.

An easy way to implement multitasking is to create various threads, each one with blocking requests.

Non-blocking I/O

This is a synchronous and non-blocking way to get data.

When a device is open with the option O_NONBLOCK, any unsuccessful try to read it return an exception EWOULDBLOCK or EAGAIN. The application then retries to read the data until the file descriptor is ready for reading.

This method is very wasteful and maybe that’s the reason to be rarely used.

I/O multiplexing or select/poll

This method is an asynchronous and blocking way to get data.

The multiplexing, in POSIX, is done with the functions select() and poll(), that registers one or more file descriptors to be read, and then blocks the thread. As been as the file becomes available, the select returns and is possible to copy the data from the kernel to the user.

I/O multiplexing is almost the same as the blocking I/O, with the difference that is possible to wait for multiple file descriptors to be ready at a time.

Asynchronous I/O

This one is the method that is, at the same time, asynchronous and non-blocking.

The async functions tell the kernel to do all the job and report only when the entire message is ready in the kernel for the user (or already in the user space).

There are two asynchronous models:

  • The all asynchronous model, that the POSIX functions that begin with aio_ or lio_;
  • The signal-driven model, that uses SIGIO to signal when the file descriptor is ready.

One of the main difference between these two is that the first copy data from kernel to the user, while the seconds let the user do that.

The confusion

The POSIX states that

Asynchronous I/O Operation is an I/O operation that does not of itself cause the thread requesting the I/O to be blocked from further use of the processor. This implies that the process and the I/O operation may be running concurrently.

So the third and fourth models are, really, asynchronous. The third being blocker, since after the register of the functions it waits for the FDs.

I believe that there is almost always space to a discussion when it comes to the use of any terms. But only the fact that there is divergence whether blocking I/O and synchronous I/O are the same thing shows us that we have to be cautious when we use these terms.

To finish, an image worths a thousand of words, and a table even more, so let us look at this:

Further words

When dealing with asynchronous file descriptors, it is important to make account of how many of them the application can handle open at a time. This is easily checked with

$ cat /proc/sys/fs/file-max

Be sure to check it to the right user.

Other References

I/O Multiplexing, by shichao
Boost application performance using asynchronous I/O, by M. Jones
Asynchronous I/O and event notification on linux
The Open Group Base Specifications Issue 7, by The IEEE and The Open Group

Web crawlers 101

Neste post mostrarei algumas ferramentas interessantes para desenvolver web crawlers.

– Conhecimento básico em python
– Conhecimento básico em HTML
– Conhecimento básico em HTTP

Para quem não sabe, um web crawler é basicamente um software que visita várias páginas web em busca de algum tipo de informação.

Por exemplo, suponha que você queira encontrar um emprego como freelancer, digamos, para desenvolver crawlers. Talvez você já tenha outros projetos, e não queria ficar passeando entre os sites com as vagas todos os dias. Não seria legal automatizar esse processo e ainda torná-lo divertido? E ainda teria uma boa experiência para citar em sua entrevista! 😉

Iremos fazer aqui um crawler simples para o 99freelas, e pegaremos todas as vagas que tenham algumas palavras chaves que definiremos. Sim, eu sei que existe um filtro de pesquisas para isso, mas lembre-se que este post tem apenas fins didáticos.

Vamos começar?

Para começar, é importante ter alguma ferramenta para inspecionar o código que iremos buscar. Particularmente, eu gosto de usar as DevTools do Chrome, que você pode entender muito melhor aqui, mas sinta-se à vontade para utilizar as ferramentas de desenvolvimento web de sua preferência.

Usando o DevTools do Chrome, vá para a página do 99freelas e pressione F12 no teclado. Essa nova janela que apareceu mostram as DevTools de que falei. Antes de prosseguir, tente entender um pouco de como funciona essa ferramenta!

Mas e que horas que começaremos a extrair os dados?

Calma! Antes vamos instalar as dependências! Aconselho a usar um ambiente virtual do python:

$ pip install requests beautifulsoup4  

Agora sim podemos começar nosso script para buscar os dados do site:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Parâmetros da query string da URL, utilizados como filtro
params = {
    'categoria': 'web-e-desenvolvimento',

# Fazendo a requisição
resposta = requests.get(URL, params)

Antes de prosseguir, vamos brincar um pouco com a resposta do sistema.


O primeiro irá mostrar o status HTTP de resposta da requisiçao. O resposta.content e o resposta.text ambos retornam a informação buscada no site, mas a primeira dá o resultado em bytes e o segundo em unicode.

Utilizaremos o BeautifulSoup para explorar o HTML da página. Que é bem mais fácil que fazer um parser de texto puro para o resposta.content.

Vamos buscar a primeira entrada da tabela no site 99freelas. Pressionando Ctrl+Shift+C e clicando na primeira entrada da tabela, o inspetor de elementos provavelmente irá aparecer na tela. No HTML vamos buscar pela div <div class="projects-result">, e dentro desse tag, buscar cada elemento da lista:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Filtros da busca no site, que são a query string da URL
params = {
    # Na URL isso estará como ?categoria=web-e-desenvolvimento
    'categoria': 'web-e-desenvolvimento',

# Fazendo a requisição para a URL completa:
# https://www.99freelas.com.br/projects?categoria=web-e-desenvolvimento
resposta = requests.get(URL, params)

# Porque com o BeautifulSoup fica mais fácil analisar a página
site = bs(resposta.content, 'html.parser')

# Extraímos a tabela com o BeautifulSoup
tabela = site.find(attrs={'class': 'projects-result'})

# E agora extraímos todos os elementos da lista na tabela
vagas = tabela.find_all('li')

Cada elemento o BeautifulSoup é explorável. Aconselho a bincar um pouco com os
elementos do BeautifulSoup antes de prosseguirmos…

Vamos selecionar todos os clientes com quatro estrelas ou mais que tenham a palavra “crawler” na descrição.

# ... Continuando...
vagas = tabela.find_all('li')

# Quais termos serão buscados?
termos = ['crawler', 'spider']
vagas_interessantes = []

# E varremos todas as vagas para encontrar os termos selecionados
for vaga in vagas:
    # Busca pela classe description (repare na conversão para texto!)
    descricao = vaga.find(attrs={'class': 'description'}).text
    avaliacao = float(vaga.find(attrs={'class': 'avaliacoes-star'})['data-score'])
    link_vaga = vaga.find(attrs={'class': 'title'}).find('a')['href']

    if any(termo for termo in termos if termo in descricao) and avaliacao > 4:
            'link_vaga': link_vaga,
            'descricao': descricao

Todas as vagas interessantes estarão (com seus links) na lista vagas_interessantes. Podemos agora paginar a busca, criar algum tipo de notificação quando houver uma vaga interessante, dentre outras inúmeras possibilidades. Divirta-se!