Lucas' Blog

cat /dev/thoughts

Understanding Private Proxies for Crawling

Disclaimer: Since I started working with web crawling, I noticed it could be used in unethical ways. The line is not always clear, and it depends a lot on interpretations. I wrote this with the intention to help people to learn, so be responsible with this knowledge.

Along with CAPTCHAs, ReCAPTCHAs and Fingerprinting, IP blocking is one of the most used techniques that sites use to keep crawlers away.

The main way to bypass this protection is to use proxies. However, using a public proxy is almost always inefficient because of the blacklists, which keep a track on these public IPs.

In this context, some people feel inclined to use Tor as the proxy, but besides being very slow, Tor is a resource aimed to help people under freedom constraints, and you should think twice before using it for trivial purposes.

Then you decide to rent private proxies, but there are many other things you should know to make the best decision. I will discourse below about a few things that may help you.

Proxy protocol

The most used protocol for proxies is HTTP/HTTPS, but sometimes you want to connect under another protocol, such as FTP, ICQ, Whois, etc, and then you should use a TCP Layer proxy protocol, which is the SOCKS protocol.

There are yet mainly two types of SOCKS: SOCKS4, which supports only TCP, and SOCKS5, which supports both TCP and UDP, is more secure and it also supports authentication.

SOCKS are very powerful and can also be used to connect to HTTP/HTTPS services. However, there are not so many providers on the internet that enables us to connect in the destination port we want, almost always it is restricted only to the ports 443 and 80.

Proxy exclusivity

There are three levels of proxy exclusivity: public, shared and dedicated.

The public, as we said before, are used by lots of people and thus get blocked very fast, either because it is io a blacklist or simply being noticed by the host faster (because probably many people have used them for that purpose).

Shared proxies are used by more than one user, and have some of the public proxies downsides, but it can be a good choice if you intend to use it in a not so popular domain.

Dedicated proxies are the most expensive of all, but also have the best results in a general case, and it is very indicated to be used in popular domains as social media and e-commerce.

Connection type

When talking about crawlers, you usually connect to a proxy using some library (like the Python Requests) and must insert the proxy you want to connect to the function parameters. Usually, those proxy providers give you an IP and port to connect, like 123.1.12.10:8080. Depending on the authentication method you also will need to put your credentials, like user:passwd@123.1.12.10:8080.

However, the connection to this IP can happen in various ways, the most common is the direct connection, in other words, if you connect to that proxy, and then make a request to, for instance, ipify, the answer must be that same IP you connected directly. When the proxy provider adopts this approach, they tend to give you an endpoint with the list of your proxies.

When these providers want to give you a plethora of IPs, they cannot give you all the possible IPs they have in their database, so they give you some socket (IP + port) that works like a gateway for their service. That’s how back-connected and rotating proxies work.

Back-connected proxies are just various proxies connected by only one socket. If you connect to a back-connected proxy, let’s say 123.1.12.10:8080, and then make a request to ipify, your result will not be the same as your socket.

The back-connected proxies can retrieve a proxy from a pool of randomized proxies or update your proxies periodically. This last type is usually called Rotating Proxy.

Network type

When you install internet in your home, you must contract an Internet Service Provider, or ISP, which gives you an IP each time you connect to the internet. If you check your IP on the internet, you will see that your IP is registered as coming from some ISP.

There are other types of IPs which are bought just to serve as proxies. They are known as Datacenter IPs, and it is the most common proxy sold in the market today.

The Datacenter IP have its downsides, one of them being the relative ease of blocking from well-known IP providers. The Luminati Proxy service achieved a solution for this, which uses residential IPs from people who use Hola VPN, as said here.

Residential IPs use a back-connected socket (or Super Proxy, as called by Luminati), in which every time you connect they change your IP from a pool of millions of IPs.

There is another solution called Mobile IPs, which is very similar to Residential but not so popular.

Other relevant features to check

While renting a proxy there are yet many other things to pay attention.

Bandwidth: You must know how much data pass through the proxies and if your proxy provider has any limit for data traffic.

Subnets: When talking about Datacenter IPs, this may be the most important resource to pay attention, because lots of firewalls block all the subnet you are.

Replacement: You need to have an option to replace your proxies if they get blocked, even if the replacement is done once in a month. Otherwise, you will pay for a service you cannot use.

IP location: This is mostly related to the crawled sites, some of them (especially government sites) allow only proxies from a given location to access.

Authentication system: Most of the proxy services allow connection with IP or with username + password. However, when the authentication is using IPs, you must pay attention if the amount of IPs is limited.

Customer service: To me, this is one of the most important features, since you almost always have any doubt or problem while using the proxy service.

My experience with proxies

This post is about a topic that change a lot, and the experiences below may be out of date.

I must have used a few proxy services in the last year and will share below my impressions. Most of the proxies that I did not use was because they did not answer my tickets or took several days to do so.

Blazing proxies: We use it yet in most of our application because it has support to SOCKS5 (and we can choose the port we will connect), it has lots of subnets, have a fair price and a fair customer service, and no bandwidth limitations.

Luminati: I was amazed by their product because they are the vanguard in the Residential proxies, which are virtually unblockable. Even their Datacenter IPs are fresh and constantly replaced. They have a Proxy Management System and you can configure it in lots of ways. But the most incredible feature they have is their customer service, which was my best customer experience so far. Their downside is the high prices (which is still fair to the product they sell) and the bureaucracy to contract some services, which are also fair, given the nature of their services.

RotatingProxies: A back-connected proxy service which changes their IPs every 5 minutes. They use residential IPs, but I found out the proxies had no much quality and were almost always already blocked by the service I tried to connect. They also have a long time to activate your account.

SharedProxies: A cheap service but with very few subnets. They also charge if you will try to connect to it using more than one IP.

[Editted 2019-06-03] ProxyRack: They have rotating residential proxies with cheap prices and thousands of IPs. The caveat is when you need to cancel the service, you must send them an e-mail.

Conclusion

Proxies are almost always needed for large scale web crawling and can be painful to choose the best service if you do not know what you need. You must be able to define your priorities and constraints and know well the choices you have in the market, and this post is a starting point to make that choice.

Comments