Lucas' Blog

cat /dev/thoughts

A Basic Guide About Encodings

As well said by Joel Spolsky[1], it is almost unacceptable for a developer to be that reckless and ignore the existence of encodings when all we deal with is information and encodings are the base of representing that information.

I’ve faced lots of problems with roots in encodings and wasn’t aware of that, also, even when I realized the existence of encodings, I was told that it was almost impossible to learn about it. Well, it is very confusing, but I found out that some of the main reasons are the terms and the confusion that a lot of sources make using them.

This post will point a basic structure and a few definitions and then use examples and the most common questions about encodings. Feel free to suggest more questions in the comments.

A basic structure

To understand what encodings are, it is necessary to understand the information flow in which the idea of a character turns into a byte (and vice-versa). The RFC 2130[2], defines 3 abstraction levels, here we will put a level before the others and make them four, based on the Unicode Model[3], to improve the understanding:

  1. Abstract Character Repertoire (ACR)
  2. Coded Character Set (CCS)
  3. Character Encoding Scheme (CES)
  4. Transfer Encoding Syntax (TES)

The idea of a letter, for example, passes through these 4 layers to turn into a piece of information manageable by a computer. Let’s see what each of these layers means and, by the end, illustrate with a few examples.

Abstract Characters Repertoire

An abstract character, or shortly, a character, is a minimal unit of text that has semantic value[5], thus, an Abstract Characters Repertoire is a set of characters, that will be encoded, like some alphabet or symbol set.

A {, ~, ã or a are examples of characters. But as a character is an abstract definition, in other words, the char is not this a, but any representation of this letter with the same meaning, it may happen that the glyphs can be different, in another font, for instance. It is also possible that the apparently the same glyph represent two different characters, like the letter A and the Greek Uppercase alpha Α (check it, they have a different code).

Examples of ACRs are the Unicode/10646 (below we will explain why we treat Unicode and ISO10646 as the same) repertoire and the Western European alphabets and symbols of Latin-1 (CS 00697). We must pay attention that these standards some times have the same name for repertoires and for the character set.

Coded Character Set

A Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers. Examples of coded character sets are ISO 10646, US-ASCII, and ISO-8859 series.[2]

Character Encoding Scheme

A Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. Examples of Character Encoding Schemes are ISO 2022 and UTF-8. A given CES is typically associated with a single CCS; for example, UTF-8 applies only to ISO 10646.[2]

The Unicode Model[3] defines two different layers within this one, but this post will consider it as only one for simplicity’s sake.

Transfer Encoding Syntax

It is frequently necessary to transform encoded text into a format which is transmissible by specific protocols. The Transfer Encoding Syntax (TES) is a transformation applied to character data encoded using a CCS and possibly a CES to allow it to be transmitted. Examples of Transfer Encoding Syntaxes are Base64 Encoding, gzip encoding, and so forth.

Examples

Suppose we have a character Á, representing the latin letter A with an accent ' in it. It belongs to various different Characters Repertoires, for example, the Unicode/10646 and Latin-1 repertoires. Let’s use the Unicode/10646.

Looking into the Coded Character Set Table (also called as codepage, Charset Table, etc) we find that the integer that represents it is the U+00C1 (the Unicode Standard puts the U+ before all code points representations). The next step is to convert it into an octet, using the Character Encoding Scheme (also know as encode the string), we can use UTF-8, and have a result like 0xc381, UTF-16 and have 0xfffec100, or any other Encoding Scheme defined by the chosen CCS. For this example, let’s use the UTF-8.

The use of the Transfer Encoding Syntax is commonly related to the transmission form. The HTML may define the TES in the Content-transfer-encoding header[4], for instance, as a Base64, and we would have w4E=.

Normally we deal mostly with the ACS, CCS, and CES, calling it the encoding process, and let the TES to be dealt with by the machine.

So we have:

  1. (ABS) Abstract Character Á
  2. (CCS) Code point U+00C1
  3. (CES) The octets 0xC381
  4. (TES) Base64 w4E=

The decoding just follows the inverse path.

Some elements that may cause confusion

The explanation above seems quite simple, so where the confusion lives? I will try to list a few below.

The charset name in MIME header

According to the RFC 2130[2]

The term ‘Character Set’ means many things to many people. Even the MIME registry of character sets registers items that have great differences in semantics and applicability.

This causes a lot of confusion, like when you read Content-Type: text/plain; charset=ISO-8859-1 in an HTML page, because “in MIME, the Coded Character Set and Character Encoding Scheme are specified by the Charset parameter to the Content-Type header field”[2].

What are the other Coded Character Sets? I only know Unicode!

There are lots of CCSs, for example, in the beginning, there were standards like EBCDIC and ASCII (the name is the same across all the layers), and others CCS that completed ASCII in terms of languages arise later, like Cyrillic (ISO 8859-5) and Latin1 (ISO 8859-1). The effort to unify it gave birth to the Unicode and ISO 10646, that wrapped the other standards. There is also another CCS standard used in China called GB18030[9].

Important to emphasize that the Unicode and the ISO 10646 are now synchronized[8] and the names are almost interchangeable.

There are CCS inside others CCS? Like Latin1 is inside Unicode?

As the technical report for Unicode Character Encoding Model[3] notes:

Subsetting is a major formal aspect of ISO/IEC 10646. The standard includes a set of internal catalog numbers for named subsets and further makes a distinction between subsets that are fixed collections and those that are open collections, defined by a range of code positions. Open collections are extended any time in addition to the repertoire gets encoded in a code position between the range limits defining the collection. When the last of its open code positions are filled, an open collection automatically becomes a fixed collection.

So, there is various Latin CCS and also there is a subset of Unicode/10646 that is called Latin. It is not like the Latin1 (ISO 8859-1) itself is inside Unicode, but all of its characters are and they form a collection.

And what is “plain text”?

The use of the “plain text” expression is controversial[1], but still, it is largely used, and it may be useful to at least try to understand what it should mean.

When the ASCII was firstly implanted, the CCS/CES had 256 possible codepoints but only 128 being used, what left 128 spare positions. So a lot of people had a lot of different ideas with what to do with the spare bits. Western Europe created the Latin and others, the Russian created the Cyrillic and almost every other culture made their version. Despite this Babel Tower, almost all the encodings preserved the ASCII characters with the same octets. The newer encodings like UTF-8 also followed this convention.

According to Jukka Korpela[6]

ASCII has been used and is used so widely that often the word ASCII refers to “text” or “plain text” in general, even if the character code is something else! The words “ASCII file” quite often mean any text file as opposed to a binary file.

However, stay aware that using ASCII characters does not guarantee that your string will be read correctly by anyone.

Conclusion

Seems like encodings are not so hard to understand, right? Next time your text look like a mess in the screen you may know what to do.

Also the next time you have to choose what encoding to use, remember that the recommendation from the RFC 2130 is to use CCS ISO 10646 and encoding UTF-8 as default.

References

[1] The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky
[2] RFC 2130 - The Report of the IAB Character Set Workshop
[3] Unicode Character Encoding Model
[4] RFC 1341 - The Content-Transfer-Encoding Header Field
[5] Wikipedia - Character encoding
[6] A tutorial on character code issues, by Jukka “Yucca” Korpela
[7] RFC 3629 - UTF-8, a transformation format of ISO 10646
[8] Wikipedia - Unicode
[9] Wikipedia - GB 18030

Building a Chat From Scratch With Go and MQTT

Recently I started learning Go and about Messaging Protocols, and as I think that is easier to learn something while also putting it into practice, I made a very basic chat that I want to share with you.

The Go documentation is, at the same time, complete and scarce. Every package has its own GoDocs, but sometimes with very little explanation, then I needed to dig further to understand better the MQTT Go Library. So I thought I could make a blog post in order to share what I learned and expand the community’s material.

For this tutorial, I used the Go language, the Go package for MQTT protocol, RabbitMQ and its MQTT plugin; and the code was inspired in this example.

What about all those MQs?

Beforehand I need to explain the difference between the broker and the protocols. There are three main messaging protocols: AMQP, MQTT, and STOMP. Each of these has their pros and cons, and situations in with one of them is a better choice.

Then there is the also the brokers, like RabbitMQ, Mosquitto, NSQ and ZeroMQ. The brokers are programs that may implement one or more messaging protocols and are used in order to pass the message between the publisher and the subscriber (we will understand better these names below).

In this tutorial, I used MQTT v3.1.1 with RabbitMQ. I also played with AMPQ but found MQTT much easier and interesting, especially its use cases (IoT).

The setup

First I had to install Go, then I installed RabbitMQ with

1
sudo apt-get install rabbitmq-server

and install these two Go libraries

1
2
go get github.com/akamensky/argparse
go get github.com/eclipse/paho.mqtt.golang

Then I just enabled the MQTT plugin on the RabbitMQ and adjusted two users (user1 and user2, in with user and password are the same)

1
2
3
4
5
sudo rabbitmq-plugins enable rabbitmq_mqtt
rabbitmqctl add_user user1 user1
rabbitmqctl add_user user2 user2
rabbitmqctl set_permissions -p / user1 ".*" ".*" ".*"
rabbitmqctl set_permissions -p / user2 ".*" ".*" ".*"

The code

The main loop first gets user and password from the input arguments and assemble the full URL service:

1
2
3
4
5
6
7
8
9
10
11
func main() {
    user, passwd := parseUserArgs()
    fullUrl := fmt.Sprintf("mqtt://%s:%s@localhost:1883/test", user, passwd)
    uri, err := url.Parse(fullUrl)
    failOnError(err, "Failed to parse given URL")

    forever := make(chan bool)
    go listen(uri)
    go poolMessage(uri, user)
    <-forever
}

Then a channel is created in order to keep the program running and two goroutines are created: one to listen to the messages from the broker (the subscriber), and other to get the message from the output and send it to the broker (the producer).

The consumer is very simple, it creates a client, connected to the given URI, and, every time it receives a message, it calls the callback function and prints it on the screen:

1
2
3
4
5
6
7
8
func showMessage(client mqtt.Client, msg mqtt.Message) {
    fmt.Printf("* %s\n", string(msg.Payload()))
}

func listen(uri *url.URL) {
    client := connect(uri)
    client.Subscribe(parseTopic(uri), QOS_AT_MOST_ONCE, showMessage)
}

The first parameter of Subscribe is the topic, that is like the channel we are listening to. Topics are very interesting, and can even have a hierarchy that turns easier to broadcast messages and share specific rules. More about topics can be read here. The topic we are using here is the path of our URI, test.

The second argument is about the Quality of Service, or the level of confidence we can trust our message will be delivered. There are three levels:

  • At most once (0): The level we used in our chat, also know as Fire-and-Forget. It sends a message and doesn’t wait for any kind of confirmation. Thus, messages will be delivered once or none;
  • At least once (1): The message will be delivered and, after a while, if no response is returned, it will be sent again. Thus, messages will be delivered one or more times;
  • Exactly once (2): This is the slowest QoS because it has a four part handshake, that assures the message will be delivered once, no more or less.

The third argument is the callback function, in this case, the showMessage, which is called with the client and the message to be printed.

The producer, at its turn, waits until a message is typed and then sends it to the broker, it is the poolMessage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func sendMessage(msg string, uri *url.URL) {
    client := connect(uri)
    RETAIN_MESSAGE := false
    client.Publish(parseTopic(uri), QOS_AT_MOST_ONCE, RETAIN_MESSAGE, msg)
}

func poolMessage(uri *url.URL, user string) {
    for {
        r := bufio.NewReader(os.Stdin)
        msg, _ := r.ReadString('\n')
        msg = fmt.Sprintf("%s: %s", user, strings.TrimSpace(msg))
        sendMessage(msg, uri)
    }
}

I used bufio, and not fmt.Scanf because I found a lot easier to read spaces from the terminal with the former. After reading the input, the message is passed to a function to be sent.

The Publish’s parameters are very similar to Subscribe, the only difference is the RETAIN_MESSAGE. When this argument is flagged true, the broker stores the last message and every time a new user subscribes it receives that retained message.

I experimented using the parameter as true to see how it worked and had some trouble trying to remove the retained message after. As I did not want to receive that message every time I connected I discovered I had to overwrite it with a null value.

I tried to publish a message with a null value, without success, and searched for a RabbitMQ client interface, and also didn’t find any. The solution I had was using the Mosquitto client

1
mosquitto_pub -n -r -t "test"

In which I publish (mosquitto_pub) a null message (-n) as retained (-r) to the topic “test” (-t "test").

But I’m still uncomfortable not knowing how to do that with the Go’s MQTT.

The last piece of this code is the connect function, which is quite simple, it gets a few options mqtt.ClientOptions, and connects a new client. The loop with the token.WaitTimeout waits until the connection is established, checking it each microsecond:

1
2
3
4
5
6
7
8
9
func connect(uri *url.URL) mqtt.Client {
    opts := createClientOptions(uri)
    client := mqtt.NewClient(opts)
    token := client.Connect()
    for !token.WaitTimeout(time.Microsecond) {
    }
    failOnError(token.Error(), "Failed while connecting")
    return client
}

The options are built using the data passed in URI, telling where the broker is, who is the user and its password. It is still possible to SetClientID in order to keep a session for a unique client (I did not use it here for simplicity’s sake):

1
2
3
4
5
6
7
8
9
func createClientOptions(uri *url.URL) *mqtt.ClientOptions {
    password, _ := uri.User.Password()
    name := uri.User.Username()
    opts := mqtt.NewClientOptions()
    opts.AddBroker(fmt.Sprintf("tcp://%s", uri.Host))
    opts.SetUsername(name)
    opts.SetPassword(password)
    return opts
}

There were some boilerplates I just skipped. The complete code can be seen here.

Conclusion

This is a quite simple tutorial and I know it did not cover all the subject. Feel free to share any ideas or questions.

Understanding Private Proxies for Crawling

Disclaimer: Since I started working with web crawling, I noticed it could be used in unethical ways. The line is not always clear, and it depends a lot on interpretations. I wrote this with the intention to help people to learn, so be responsible with this knowledge.

Along with CAPTCHAs, ReCAPTCHAs and Fingerprinting, IP blocking is one of the most used techniques that sites use to keep crawlers away.

The main way to bypass this protection is to use proxies. However, using a public proxy is almost always inefficient because of the blacklists, which keep a track on these public IPs.

In this context, some people feel inclined to use Tor as the proxy, but besides being very slow, Tor is a resource aimed to help people under freedom constraints, and you should think twice before using it for trivial purposes.

Then you decide to rent private proxies, but there are many other things you should know to make the best decision. I will discourse below about a few things that may help you.

Proxy protocol

The most used protocol for proxies is HTTP/HTTPS, but sometimes you want to connect under another protocol, such as FTP, ICQ, Whois, etc, and then you should use a TCP Layer proxy protocol, which is the SOCKS protocol.

There are yet mainly two types of SOCKS: SOCKS4, which supports only TCP, and SOCKS5, which supports both TCP and UDP, is more secure and it also supports authentication.

SOCKS are very powerful and can also be used to connect to HTTP/HTTPS services. However, there are not so many providers on the internet that enables us to connect in the destination port we want, almost always it is restricted only to the ports 443 and 80.

Proxy exclusivity

There are three levels of proxy exclusivity: public, shared and dedicated.

The public, as we said before, are used by lots of people and thus get blocked very fast, either because it is io a blacklist or simply being noticed by the host faster (because probably many people have used them for that purpose).

Shared proxies are used by more than one user, and have some of the public proxies downsides, but it can be a good choice if you intend to use it in a not so popular domain.

Dedicated proxies are the most expensive of all, but also have the best results in a general case, and it is very indicated to be used in popular domains as social media and e-commerce.

Connection type

When talking about crawlers, you usually connect to a proxy using some library (like the Python Requests) and must insert the proxy you want to connect to the function parameters. Usually, those proxy providers give you an IP and port to connect, like 123.1.12.10:8080. Depending on the authentication method you also will need to put your credentials, like user:passwd@123.1.12.10:8080.

However, the connection to this IP can happen in various ways, the most common is the direct connection, in other words, if you connect to that proxy, and then make a request to, for instance, ipify, the answer must be that same IP you connected directly. When the proxy provider adopts this approach, they tend to give you an endpoint with the list of your proxies.

When these providers want to give you a plethora of IPs, they cannot give you all the possible IPs they have in their database, so they give you some socket (IP + port) that works like a gateway for their service. That’s how back-connected and rotating proxies work.

Back-connected proxies are just various proxies connected by only one socket. If you connect to a back-connected proxy, let’s say 123.1.12.10:8080, and then make a request to ipify, your result will not be the same as your socket.

The back-connected proxies can retrieve a proxy from a pool of randomized proxies or update your proxies periodically. This last type is usually called Rotating Proxy.

Network type

When you install internet in your home, you must contract an Internet Service Provider, or ISP, which gives you an IP each time you connect to the internet. If you check your IP on the internet, you will see that your IP is registered as coming from some ISP.

There are other types of IPs which are bought just to serve as proxies. They are known as Datacenter IPs, and it is the most common proxy sold in the market today.

The Datacenter IP have its downsides, one of them being the relative ease of blocking from well-known IP providers. The Luminati Proxy service achieved a solution for this, which uses residential IPs from people who use Hola VPN, as said here.

Residential IPs use a back-connected socket (or Super Proxy, as called by Luminati), in which every time you connect they change your IP from a pool of millions of IPs.

There is another solution called Mobile IPs, which is very similar to Residential but not so popular.

Other relevant features to check

While renting a proxy there are yet many other things to pay attention.

Bandwidth: You must know how much data pass through the proxies and if your proxy provider has any limit for data traffic.

Subnets: When talking about Datacenter IPs, this may be the most important resource to pay attention, because lots of firewalls block all the subnet you are.

Replacement: You need to have an option to replace your proxies if they get blocked, even if the replacement is done once in a month. Otherwise, you will pay for a service you cannot use.

IP location: This is mostly related to the crawled sites, some of them (especially government sites) allow only proxies from a given location to access.

Authentication system: Most of the proxy services allow connection with IP or with username + password. However, when the authentication is using IPs, you must pay attention if the amount of IPs is limited.

Customer service: To me, this is one of the most important features, since you almost always have any doubt or problem while using the proxy service.

My experience with proxies

This post is about a topic that change a lot, and the experiences below may be out of date.

I must have used a few proxy services in the last year and will share below my impressions. Most of the proxies that I did not use was because they did not answer my tickets or took several days to do so.

Blazing proxies: We use it yet in most of our application because it has support to SOCKS5 (and we can choose the port we will connect), it has lots of subnets, have a fair price and a fair customer service, and no bandwidth limitations.

Luminati: I was amazed by their product because they are the vanguard in the Residential proxies, which are virtually unblockable. Even their Datacenter IPs are fresh and constantly replaced. They have a Proxy Management System and you can configure it in lots of ways. But the most incredible feature they have is their customer service, which was my best customer experience so far. Their downside is the high prices (which is still fair to the product they sell) and the bureaucracy to contract some services, which are also fair, given the nature of their services.

RotatingProxies: A back-connected proxy service which changes their IPs every 5 minutes. They use residential IPs, but I found out the proxies had no much quality and were almost always already blocked by the service I tried to connect. They also have a long time to activate your account.

SharedProxies: A cheap service but with very few subnets. They also charge if you will try to connect to it using more than one IP.

[Editted 2019-06-03] ProxyRack: They have rotating residential proxies with cheap prices and thousands of IPs. The caveat is when you need to cancel the service, you must send them an e-mail.

Conclusion

Proxies are almost always needed for large scale web crawling and can be painful to choose the best service if you do not know what you need. You must be able to define your priorities and constraints and know well the choices you have in the market, and this post is a starting point to make that choice.

GnuPG Small Guide

GnuPG, or Gnu Privacy Guard, is a software that implements the OpenPGP standard – in which PGP stands for Pretty Good Privacy.

It is a tool used for – crash the drums – privacy. By privacy, you should understand sending private messages between users, to store any data safely, being sure that a message has a given origin and others.

For those who think that privacy is a waste of time, stuff for paranoid tech guys, I suggest reading this post in Stack Exchange.

Although we live in a free era, the freedom is sometimes menaced by dictatorships and state surveillance and we do not know when these privacy knowledge will go from “nice to have” to a “must have”.

PGP was created in 1991 by Phil Zimmermann, in order to securely store messages and files, and no license was required for its non-commercial use. In 1997 it was proposed as a standard in the IETF, and thus emerged OpenPGP.

Uses for the GPG

GPG have some nice features, some used more frequently, others not so:

  • Signing and verifying a commit: With this, you can always know the owner of a commit, as well as let the others be sure you wrote some piece of code. The Git Horror Story is a tale that tries to instill the sense for sign/verify commits into the greenhorn programmers. Besides it all, there are some people that have its own issues with commit signatures.
  • Send and receive cryptographed messages: Remember when as a kid you created encoded messages and thought no one would never understand¹? Well, bad news, everyone could read them. But now you can create messages that no one will understand², for real.
  • Checking signatures: It is possible to know (with a great confidence degree) that some file is from someone specific. For instance, consider downloading Tor, it is a secure browser, but if you download a hacked software, it will serve for the exact contrary of its purpose. But calm down, no need to cry, with GPG you can check the integrity of any software.

Come with us to learn cool stuff!

¹ No? Oh… sorry for your childhood. :‘( ² Maybe it is possible to break a 4096 RSA encryption.

Creating your key

First of all, you must download and install the GPG tools. Then you can check the installation with

> gpg --version
gpg (GnuPG) 1.4.20
License GPLv3+: GNU GPL version 3 or later [...]

The next step is to create a key for your use, which is very easy! The step-by-step can be seen here or here. Briefly, it is

> gpg --gen-key

And some things you must pay attention to the creation of the key:

  • I would rather choose RSA kind of key. I suggest you to read this to understand why.
  • Be sure to choose the maximum bit length for your key, if you want it to be safer. At the time this article is being written, the maximum is 4096.
  • It is nice to set an expiration date. It must happen to you to lose your key or die (as we all some day) and then is a good practice to let others know that key is not being used anymore.
  • Avoid comments on your key, as it is redundant at almost every time.
  • It is nice to use strong digest algos like SHA512. I suggest you understand and create a nice gpg.conf to assure you are using the best configuration.

Here you can find a lot of other good practices for your GPG day-to-day use.

Assuming you finished the creation of your key, you can check it all with

> gpg --list-keys
pub   4096R/746F8A65 2017-05-24 [expires: 2018-05-24]
      Key fingerprint = 014C F6E9 C2E0 12A2 4187  F108 178A C6CD 746F 8A65
      uid                  Lucas Almeida Aguiar <lucas.tamoios@gmail.com>
      sub   4096R/AFC85A01 2017-05-24 [expires: 2018-05-24]

As a brief summary, pub stands for “public” key, then you have the key length (4096 bits) with the R from RSA, a slash, and the short fingerprint, then the creation and expiration date. The short fingerprint takes the last 8 digits from your actual fingerprint. The uid is what you wrote a few minutes ago when I told you to do not write a comment. Let’s talk about the subs later.

For now, pay attention: the uid is not enough for you to believe someone is who he/she is telling you he/she is. Anyone can create a key with any name or e-mail. To be sure someone is really who he/she is telling you, you must check its fingerprint. We will cover it deeper when discussing web of trust.

Working with keys

Generally, you have not only your keys but also other people’s public keys, that you use to verify signatures and to send them encoded stuff. You have the power to edit your key and change how you see the other’s keys with

> gpg --edit-keys 746F8A65

This hash in front of the command is just the short fingerprint of the uuid you want to edit.

> gpg --edit-keys 746F8A65
gpg (GnuPG) 1.4.20; Copyright (C) 2015 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Secret key is available.

pub  4096R/746F8A65  created: 2017-05-24  expires: 2018-05-24  usage: SC
                     trust: ultimate      validity: ultimate
sub  4096R/AFC85A01  created: 2017-05-24  expires: 2018-05-24  usage: E
sub  4096R/B2CD6DC9  created: 2017-05-24  expires: 2018-05-24  usage: S
[ultimate] (1). Lucas Almeida Aguiar <lucas.tamoios@gmail.com>

Pay attencion to the letters in the usage attribute, they mean - S for sign - E for encrypt - C for certify

The certify usage is the most powerful of them because it can create, trust and revoke keys.

The gpg --edit-keys allows you to change passwords, trust keys, sign keys, change expire date, and other.

Trusting keys

The PGP have a decentralized trust model, called web of trust. It allows you to trust keys even without a central server, as it is on X.509. The kinds of trust you can set to keys are:

  • Ultimate: is only used for your own keys.
  • Full(I trust fully): is used for keys which you really trust. Anyone who trust you, will also trust all of your fully trusted keys. Take care with it.
  • Marginal(I trust marginally): if you set a key as marginal trusted, it is like you only trust 1/3 of its trusted keys. An example took from here: If you set Alice’s, Bob’s and Peter’s key to ‘Marginal’ and they all sign Ed’s key, Ed’s key will be valid.
  • Unknown: is the default state.
  • Undefined(Don’t know): has the same meaning as ‘Unknown’ but actively set by the user.
  • Never(I do NOT trust): same of ‘Unknown / Undefined’, but meaning you know that the key owner is not accurately verifying other keys before signing them.

To trust someone you must first import a key. You can import the raw file with the public key with

> gpg --import john_doe.asc

or download it from a keyserver

> gpg --keyserver hkp://pgp.mit.edu --search-keys "john_doe@example.com"

Then when you --list-keys, the key you imported should be there.

Trusting keys is a serious issue. If you start to trust everyone, without checking, you will probably end up being yourself trusted as “Never”. I recommend you to only trust keys when you are sure it belongs to its owner, and checked with him the key’s fingerprint.

Check the fingerprint of the key you want to trust

> gpg --fingerprint jonh_doe@example.com

Edit the key you want to trust

> gpg --list-keys
pub   3072D/930F2A9E 2017-06-20 [expires: 2017-06-21]
      Key fingerprint = ...
> gpg --edit-keys 930F2A9E
...
pub  4096R/930F2A9E  created: 2017-06-20 expires: 2018-06-20 usage: SC
...
gpg> trust

Then you must set the trust level, according to the list described above. See here with more details.

Conclusion

The aim of this post was to give an overview about GPG, since the links I put here can serve as a bootstrap to further learnings.

Linux Asynchronous I/O

Disclaimer: This post is a compilation of my study notes and I am not an expert on this subject (yet). I put the sources linked for a deeper understanding.

When we talk about asynchronous tasks what comes to our minds is almost always a task running in a separated thread. But, as I will clear below, this task usually is blocking, and not async at all.

Another interesting misunderstanding occurs between nonblocking and asynchronous (or blocking and synchronous as well), and people use to think these pair of words are always interchangeable. We will discuss below why this is wrong.

The models

So here we will discuss the 04 available I/O models under Linux, they are:

  • blocking I/O
  • nonblocking I/O
  • I/O multiplexing
  • asynchronous I/O

Blocking I/O

The most common way to get information from a file descriptor is a synchronous and blocking I/O. It sends a request and waits until data is copied from the kernel to the user.

An easy way to implement multitasking is to create various threads, each one with blocking requests.

Non-blocking I/O

This is a synchronous and non-blocking way to get data.

When a device is open with the option O_NONBLOCK, any unsuccessful try to read it return an exception EWOULDBLOCK or EAGAIN. The application then retries to read the data until the file descriptor is ready for reading.

This method is very wasteful and maybe that’s the reason to be rarely used.

I/O multiplexing or select/poll

This method is an asynchronous and blocking way to get data.

The multiplexing, in POSIX, is done with the functions select() and poll(), that registers one or more file descriptors to be read, and then blocks the thread. As been as the file becomes available, the select returns and is possible to copy the data from the kernel to the user.

I/O multiplexing is almost the same as the blocking I/O, with the difference that is possible to wait for multiple file descriptors to be ready at a time.

Asynchronous I/O

This one is the method that is, at the same time, asynchronous and non-blocking.

The async functions tell the kernel to do all the job and report only when the entire message is ready in the kernel for the user (or already in the user space).

There are two asynchronous models:

  • The all asynchronous model, that the POSIX functions that begin with aio_ or lio_;
  • The signal-driven model, that uses SIGIO to signal when the file descriptor is ready.

One of the main difference between these two is that the first copy data from kernel to the user, while the seconds let the user do that.

The confusion

The POSIX states that

Asynchronous I/O Operation is an I/O operation that does not of itself cause the thread requesting the I/O to be blocked from further use of the processor. This implies that the process and the I/O operation may be running concurrently.

So the third and fourth models are, really, asynchronous. The third being blocker, since after the register of the functions it waits for the FDs.

I believe that there is almost always space to a discussion when it comes to the use of any terms. But only the fact that there is divergence whether blocking I/O and synchronous I/O are the same thing shows us that we have to be cautious when we use these terms.

To finish, an image worths a thousand of words, and a table even more, so let us look at this:

Further words

When dealing with asynchronous file descriptors, it is important to make account of how many of them the application can handle open at a time. This is easily checked with

$ cat /proc/sys/fs/file-max

Be sure to check it to the right user.

Other References

I/O Multiplexing, by shichao
Boost application performance using asynchronous I/O, by M. Jones
Asynchronous I/O and event notification on linux
The Open Group Base Specifications Issue 7, by The IEEE and The Open Group