Understanding private proxies for crawling

Disclaimer: Since I started working with web crawling, I noticed it could be used in unethical ways. The line is not always clear, and it depends a lot on interpretations. I wrote this with the intention to help people to learn, so be responsible with this knowledge.

Along with CAPTCHAs, ReCAPTCHAs and Fingerprinting, IP blocking is one of the most used techniques that sites use to keep crawlers away.

The main way to bypass this protection is to use proxies. However, using a public proxy is almost always inefficient because of the blacklists, which keep a track on these public IPs.

In this context, some people feel inclined to use Tor as the proxy, but besides being very slow, Tor is a resource aimed to help people under freedom constraints, and you should think twice before using it for trivial purposes.

Then you decide to rent private proxies, but there are many other things you should know to make the best decision. I will discourse below about a few things that may help you.

Proxy protocol

The most used protocol for proxies is HTTP/HTTPS, but sometimes you want to connect under another protocol, such as FTP, ICQ, Whois, etc, and then you should use a TCP Layer proxy protocol, which is the SOCKS protocol.

There are yet mainly two types of SOCKS: SOCKS4, which supports only TCP, and SOCKS5, which supports both TCP and UDP, is more secure and it also supports authentication.

SOCKS are very powerful and can also be used to connect to HTTP/HTTPS services. However, there are not so many providers on the internet that enables us to connect in the destination port we want, almost always it is restricted only to the ports 443 and 80.

Proxy exclusivity

There are three levels of proxy exclusivity: public, shared and dedicated.

The public, as we said before, are used by lots of people and thus get blocked very fast, either because it is io a blacklist or simply being noticed by the host faster (because probably many people have used them for that purpose).

Shared proxies are used by more than one user, and have some of the public proxies downsides, but it can be a good choice if you intend to use it in a not so popular domain.

Dedicated proxies are the most expensive of all, but also have the best results in a general case, and it is very indicated to be used in popular domains as social media and e-commerce.

Connection type

When talking about crawlers, you usually connect to a proxy using some library (like the Python Requests) and must insert the proxy you want to connect to the function parameters. Usually, those proxy providers give you an IP and port to connect, like Depending on the authentication method you also will need to put your credentials, like user:passwd@

However, the connection to this IP can happen in various ways, the most common is the direct connection, in other words, if you connect to that proxy, and then make a request to, for instance, ipify, the answer must be that same IP you connected directly. When the proxy provider adopts this approach, they tend to give you an endpoint with the list of your proxies.

When these providers want to give you a plethora of IPs, they cannot give you all the possible IPs they have in their database, so they give you some socket (IP + port) that works like a gateway for their service. That’s how back-connected and rotating proxies work.

Back-connected proxies are just various proxies connected by only one socket. If you connect to a back-connected proxy, let’s say, and then make a request to ipify, your result will not be the same as your socket.

The back-connected proxies can retrieve a proxy from a pool of randomized proxies or update your proxies periodically. This last type is usually called Rotating Proxy.

Network type

When you install internet in your home, you must contract an Internet Service Provider, or ISP, which gives you an IP each time you connect to the internet. If you check your IP on the internet, you will see that your IP is registered as coming from some ISP.

There are other types of IPs which are bought just to serve as proxies. They are known as Datacenter IPs, and it is the most common proxy sold in the market today.

The Datacenter IP have its downsides, one of them being the relative ease of blocking from well-known IP providers. The Luminati Proxy service achieved a solution for this, which uses residential IPs from people who use Hola VPN, as said here.

Residential IPs use a back-connected socket (or Super Proxy, as called by Luminati), in which every time you connect they change your IP from a pool of millions of IPs.

There is another solution called Mobile IPs, which is very similar to Residential but not so popular.

Other relevant features to check

While renting a proxy there are yet many other things to pay attention.

Bandwidth: You must know how much data pass through the proxies and if your proxy provider has any limit for data traffic.

Subnets: When talking about Datacenter IPs, this may be the most important resource to pay attention, because lots of firewalls block all the subnet you are.

Replacement: You need to have an option to replace your proxies if they get blocked, even if the replacement is done once in a month. Otherwise, you will pay for a service you cannot use.

IP location: This is mostly related to the crawled sites, some of them (especially government sites) allow only proxies from a given location to access.

Authentication system: Most of the proxy services allow connection with IP or with username + password. However, when the authentication is using IPs, you must pay attention if the amount of IPs is limited.

Customer service: To me, this is one of the most important features, since you almost always have any doubt or problem while using the proxy service.

My experience with proxies

I must have used a few proxy services in the last year and will share below my impressions. Most of the proxies that I did not use was because they did not answer my tickets or took several days to do so.

Blazing proxies: We use it yet in most of our application because it has support to SOCKS5 (and we can choose the port we will connect), it has lots of subnets, have a fair price and a fair customer service, and no bandwidth limitations.

Luminati: I was amazed by their product because they are the vanguard in the Residential proxies, which are virtually unblockable. Even their Datacenter IPs are fresh and constantly replaced. They have a Proxy Management System and you can configure it in lots of ways. But the most incredible feature they have is their customer service, which was my best customer experience so far. Their downside is the high prices (which is still fair to the product they sell) and the bureaucracy to contract some services, which are also fair, given the nature of their services.

RotatingProxies: A back-connected proxy service which changes their IPs every 5 minutes. They use residential IPs, but I found out the proxies had no much quality and were almost always already blocked by the service I tried to connect. They also have a long time to activate your account.

SharedProxies: A cheap service but with very few subnets. They also charge if you will try to connect to it using more than one IP.


Proxies are almost always needed for large scale web crawling and can be painful to choose the best service if you do not know what you need. You must be able to define your priorities and constraints and know well the choices you have in the market, and this post is a starting point to make that choice.

Web crawlers 101

Neste post mostrarei algumas ferramentas interessantes para desenvolver web crawlers.

– Conhecimento básico em python
– Conhecimento básico em HTML
– Conhecimento básico em HTTP

Para quem não sabe, um web crawler é basicamente um software que visita várias páginas web em busca de algum tipo de informação.

Por exemplo, suponha que você queira encontrar um emprego como freelancer, digamos, para desenvolver crawlers. Talvez você já tenha outros projetos, e não queria ficar passeando entre os sites com as vagas todos os dias. Não seria legal automatizar esse processo e ainda torná-lo divertido? E ainda teria uma boa experiência para citar em sua entrevista! 😉

Iremos fazer aqui um crawler simples para o 99freelas, e pegaremos todas as vagas que tenham algumas palavras chaves que definiremos. Sim, eu sei que existe um filtro de pesquisas para isso, mas lembre-se que este post tem apenas fins didáticos.

Vamos começar?

Para começar, é importante ter alguma ferramenta para inspecionar o código que iremos buscar. Particularmente, eu gosto de usar as DevTools do Chrome, que você pode entender muito melhor aqui, mas sinta-se à vontade para utilizar as ferramentas de desenvolvimento web de sua preferência.

Usando o DevTools do Chrome, vá para a página do 99freelas e pressione F12 no teclado. Essa nova janela que apareceu mostram as DevTools de que falei. Antes de prosseguir, tente entender um pouco de como funciona essa ferramenta!

Mas e que horas que começaremos a extrair os dados?

Calma! Antes vamos instalar as dependências! Aconselho a usar um ambiente virtual do python:

$ pip install requests beautifulsoup4  

Agora sim podemos começar nosso script para buscar os dados do site:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Parâmetros da query string da URL, utilizados como filtro
params = {
    'categoria': 'web-e-desenvolvimento',

# Fazendo a requisição
resposta = requests.get(URL, params)

Antes de prosseguir, vamos brincar um pouco com a resposta do sistema.


O primeiro irá mostrar o status HTTP de resposta da requisiçao. O resposta.content e o resposta.text ambos retornam a informação buscada no site, mas a primeira dá o resultado em bytes e o segundo em unicode.

Utilizaremos o BeautifulSoup para explorar o HTML da página. Que é bem mais fácil que fazer um parser de texto puro para o resposta.content.

Vamos buscar a primeira entrada da tabela no site 99freelas. Pressionando Ctrl+Shift+C e clicando na primeira entrada da tabela, o inspetor de elementos provavelmente irá aparecer na tela. No HTML vamos buscar pela div <div class="projects-result">, e dentro desse tag, buscar cada elemento da lista:

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.99freelas.com.br/projects'

# Filtros da busca no site, que são a query string da URL
params = {
    # Na URL isso estará como ?categoria=web-e-desenvolvimento
    'categoria': 'web-e-desenvolvimento',

# Fazendo a requisição para a URL completa:
# https://www.99freelas.com.br/projects?categoria=web-e-desenvolvimento
resposta = requests.get(URL, params)

# Porque com o BeautifulSoup fica mais fácil analisar a página
site = bs(resposta.content, 'html.parser')

# Extraímos a tabela com o BeautifulSoup
tabela = site.find(attrs={'class': 'projects-result'})

# E agora extraímos todos os elementos da lista na tabela
vagas = tabela.find_all('li')

Cada elemento o BeautifulSoup é explorável. Aconselho a bincar um pouco com os
elementos do BeautifulSoup antes de prosseguirmos…

Vamos selecionar todos os clientes com quatro estrelas ou mais que tenham a palavra “crawler” na descrição.

# ... Continuando...
vagas = tabela.find_all('li')

# Quais termos serão buscados?
termos = ['crawler', 'spider']
vagas_interessantes = []

# E varremos todas as vagas para encontrar os termos selecionados
for vaga in vagas:
    # Busca pela classe description (repare na conversão para texto!)
    descricao = vaga.find(attrs={'class': 'description'}).text
    avaliacao = float(vaga.find(attrs={'class': 'avaliacoes-star'})['data-score'])
    link_vaga = vaga.find(attrs={'class': 'title'}).find('a')['href']

    if any(termo for termo in termos if termo in descricao) and avaliacao > 4:
            'link_vaga': link_vaga,
            'descricao': descricao

Todas as vagas interessantes estarão (com seus links) na lista vagas_interessantes. Podemos agora paginar a busca, criar algum tipo de notificação quando houver uma vaga interessante, dentre outras inúmeras possibilidades. Divirta-se!