Working with Strings in Solidity

This is the first in a series of blogs we’re going to bring to you directly from the trenches, going into some of the nitty-gritty technical detail of some of the things we’re doing with the Protocol at the moment.

Today’s article comes from Alex Pinto, a recent addition to our blockchain engineering team who’s been spending the past few weeks getting up to speed on using Solidity, and will take us through some of the challenges and particularities of the language.

Today I give you a post about programming for the Ethereum blockchain using the Solidity language. I won’t follow any plan in doing this: my objective is only to write about my obstacles in learning this language and the practical difficulties I encounter in my daily work.

I want the freedom to write about any topic without having first to introduce preliminary material, as I’d have to do if I were writing a textbook. If you notice me talking about things I have not explained before, that is by design. Leave me a comment below and I’ll come back to them in a later post.

Basic access

Today, I want to talk about strings in Solidity. Solidity is, at first, similar in syntax to Javascript and other C-like languages. Because of that, it is easy for a newcomer with a grounding in one of several common and widespread languages to get a quick grasp of what a Solidity program does. Nevertheless, Solidity is mighty in the proverbial details that hide unforeseen difficulties. That is the case of the type string and the related type bytes.

Both of these are dynamic array types, which means that they can store data of an arbitrary size. Each element of a variable of type bytes is, unsurprisingly, a single byte. Each element of a variable of type string is a character of the string. So far so good, but the initial looks are deceiving. One who comes from other languages might expect the string type to provide several useful functions, like:

determining the string’s lengthreading or changing the character at a given location in the stringjoining two stringsextracting part of a string

Bad news: Solidity’s string does none of this! If we need any of the above, we have to do it manually.

So, let’s explore some of these difficulties and see what we can do about them. I open Remix and type the following code in a new file called string.sol.

The right side of the screen, in Remix, is taken by the developer’s area. In the Compile tab, I check the Auto-Compile option, so that Remix will notify me of errors and code-analysis warnings as I write my code. The static code-analysis is controlled by the options in the tab Analysis, and I usually have all options selected.

In the current case, Remix will report two warnings of the same kind: the methods I have written can potentially have a high-to-infinite gas cost. I will ignore that in this post.

The above contract is very minimal. It defines a state variable store of type string, a method to set it and a method to get it. Let’s test it.

In the Run tab, I hit Deploy and if there are no problems with the contract, a new area will appear below that button with the address where the contract is located and the functions that are available.

Below the working area, Remix shows a detailed record of the transaction’s result. Initially, it shows only a line indicating the account that deployed the contract, the contract and method that was called, ie String.(constructor), and how much Ether was passed to the execution (initially this is shown in Wei, which is the smallest unit of Ether, corresponding to 10^-18 Ether). We can expand it by clicking over the header, revealing logs, execution and transaction costs, available gas, final result, etc.

At this point, I just want to press the button getStore on the right, and notice how that shows beneath it the result:

This is a well-intentioned effort, but does not work. Remix is kind enough to immediately point 4 errors and 1 warning:

Two of these are on the same line: string newString = new string(3);

Warning: Variable is declared as a storage pointer. Use an explicit “storage” keyword to silence this warning.TypeError: Type string memory is not implicitly convertible to expected type string storage pointer

The other three occur in the following lines, eg newString[0] = "A"; and are all of the same type:

TypeError: Index access for string is not possible.

To understand the first issue, I have to tell you about data location. Writing to the blockchain is very expensive. Every node that runs the transaction has to do the same writing, which makes the transaction more expensive and the blockchain bigger. When a node downloads a block containing this transaction, it will incur larger storage costs because of this writing. In Ethereum, every transaction has an associated cost, called gas, to incentivise programmers to be as economic as possible.

When writing a contract, authors have a choice of what kind of data to use: memory is cheap (i.e. it costs relatively low gas, but the data are volatile and lost after a function finishes executing); storage is the most expensive (and is absolutely needed for contract state, which must persist from function call to function call); there is also a calldata location (that corresponds to the values in the stack frame of a function that is executing). This is the cheapest location to use, but it has a limited size. In particular, that means that functions may be limited in their number of arguments.

Every data type has a default location. This is from the Solidity documentation:

Forced data location:
-parameters (not return) of external functions: calldata
-state variables: storage
Default data location:
-parameters (also return) of functions: memory
-all other local variables: storage

Notice the subtlety: function parameters are by default stored in memory, except if the function is external, in which case they will be stored in the stack (ie calldata). This means that a function that is perfectly alright when public can suddenly have too many arguments when made external.

Now, let’s come back to our code and examine the line

string newString = new string(3);

This is a local variable inside the function, and so by default it is in storage. The new keyword is used to specify the initial size of a memory dynamic array. Memory arrays cannot be resized. On the other hand, we can change the size of a storage dynamic array by changing its lengthproperty, but can’t use new with them.

This is the source of our error. In this case, all we want to do with this string is create it and return it to the outside. Let the outside world decide what to do with it, and whether it is temporary only or important enough to persist on the blockchain. In this example, the storage is not important, and the string will be created in memory. To do that, we add the memory keyword in the declaration, like this: string memory newString = new string(3);

Direct access to strings: equivalence with bytes

Let’s see the second sort of errors now. This is simple and unavoidable: Solidity does not currently allow index access to strings. From the FAQ:

string is basically identical to bytes only that it is assumed to hold the UTF-8 encoding of a real string. Since string stores the data in UTF-8 encoding it is quite expensive to compute the number of characters in the string (the encoding of some characters takes more than a single byte). Because of that,
string s; s.length;
is not yet supported and not even index access s[2]

The alternative is to first transform the string into bytes, and then access it directly. This works because string is an array type, albeit with some restrictions.

But there is a trap to watch out for. bytes stores raw data; string stores UTF-8 characters. The following code does not always return the number of characters in _s:

The problem here occurs if _s contains any character that takes more than 1 byte to represent in UTF. In that case, the function returns the length of the byte representation of the input string, and will be more than the number of characters.

This has also an impact when trying to address a particular character of the string, as we cannot predict at which location the character’s bytes will be. We have to parse the string linearly identifying any multi-byte character, or else make sure we restrict our input to characters of fixed length. If we work exclusively with ASCII strings, for example, we’ll be safe.

Returning to our previous function, this works:

But for example, the following code which tries to set the third character of a string to X, will fail when it receives multi-byte characters.

This returns “AbXdef” for an input of “Abcdef”, but returns “XbÁnç!” for an input of “€bÁnç!”

Conclusion

There are still many more things that can be said about this topic, but this is a long enough post already, so I’ll wrap up. The key concept regarding the type string is that this is an array of UTF-8 characters, and can be seamlessly converted to bytes. This is the only way of manipulating the string at all. But it is important to note that UTF-8 characters do not exactly match bytes. The conversion in either direction will be accurate, but there is not an immediate relation between each byte index and the corresponding string index. For most things, there may be an advantage in representing the string directly as the type bytes (avoiding conversions) and be very careful when using characters that are encoded in UTF by more than one byte.

That’s enough for now. See you another day, with more steps in this coding adventure.

About the Author

Alex is a software engineer at Aventus, working on the blockchain engineering team. He has 20 years of experience working in technology, completing a PhD in Computer Science as well as a post-doctorate in Cryptography. As part of his research, Alex has published papers on Kolmogorov Complexity, Cryptography, Database Anonymization and Code Obfuscation.

Alex also spent seven years lecturing at the University Institute of Maia, including directing the degree programmes for BSc Computer Science and Information Systems and Software.