A Rubyist's Walk Along the C-side (Part 4): Primitive Data Types

This is an article in a multi-part series called “A Rubyist’s Walk Along the C-side”

In the previous article, we saw how to call Ruby methods in C extensions. In this article, we’ll look at the primitive data types in the Ruby C API.

In Ruby, everything is an object. However, that is not true when writing a C extension as not all types are created equal. For the more “primitive” types, there are often more efficient ways to manipulate them than to call the Ruby method. For example, there’s a more efficient way to get an element at a particular index in Ruby arrays in C than to call Array#[].

Primitive data types

Every type that is listed in the union of the RVALUE struct is a primitive data type (if you’re interested in what the RVALUE struct is and how the garbage collector works in Ruby, take a look at my article titled “Garbage Collection in Ruby”). The primitive data types are the following:

  • Array
  • Bignum
  • Class
  • Complex
  • File
  • Float
  • Hash
  • MatchData
  • Object
  • Rational
  • Regexp
  • String
  • Struct
  • Symbol

All the types above are “heap allocated” in Ruby, meaning that they require memory allocation and are managed by the garbage collector. However, Ruby also has the concept of immediates, which doesn’t even require an object allocation! Fixnum (i.e. small integer values), is represented as an immediate. But how does it store data without allocating an object? Remember that the VALUE type is an unsigned long? Fixnum takes advantage of that by setting a special bit in the VALUE and then it’s able to directly store the integer value in the VALUE. In addition to Fixnum, Ruby’s true, false, and nil types are also represented using immediates.

In this article, we’ll only be covering the following types:

  • Special constants (nil, true, false)
  • Fixnum
  • Array
  • Hash
  • String
  • Symbol

Exploring the other types is left as an exercise for the reader!

Constants

The Ruby C API has many builtin constants for our convinience. By convention, Ruby core modules are prefixed with rb_m (e.g. rb_mKernel for the Kernel module), classes are prefixed with rb_c (e.g. rb_cObject for the Object class), and exceptions are prefixed with rb_e (e.g. rb_eRuntimeError for the RuntimeError exception). You can find the list of builtin modules, classes, and exceptions here.

Special constants

There are special constants for Ruby’s true, false, and nil values. We can use Qtrue, Qfalse, and Qnil in our C code for each of the Ruby values. We’ve seen Qnil in the previous articles when we want to return nil from a Ruby method.

In Ruby, all values are considered truthy except false and nil. Ruby’s C API provides the RTEST macro which will return a C FALSE if the value is nil or false and a C TRUE otherwise.

VALUE my_obj = ...;
if (RTEST(my_obj)) {
    // my_obj is not Qfalse or Qnil
} else {
    // my_obj is either Qfalse or Qnil
}

Fixnum

If you recall Fixnum from Ruby 2.3 and earlier, it’s a type used to represent small integers efficiently. Since Ruby 2.4, Fixnum and Bignum have been merged to form the Integer class so we no longer have to differentiate the two when using Ruby. However, the two types are still distinct in the C API since they are represented differently internally.

To convert a C long to a Fixnum, use the LONG2FIX macro. Similarly, to convert a Fixnum back to a long, use the FIX2LONG macro. The example below shows how to use these two macros.

// Create Ruby fixnum zero
VALUE zero_ruby = LONG2FIX(0);

// Convert Ruby fixnum to C long
long zero_c = FIX2LONG(zero_ruby);

Array

Creating arrays

There are two ways to create a Ruby array:

  1. rb_ary_new: This creates a new, empty Ruby array.
    VALUE my_arr = rb_ary_new();
    
  2. rb_ary_new_capa: This creates a new, empty Ruby array with a specific capacity. This is more efficient than rb_ary_new if the number of elements is known ahead of time since no resizing will be needed within the capacity.
    // New Ruby array with capacity of 100
    VALUE my_arr = rb_ary_new(100);
    

Adding to arrays

If we want to add one element to the array, use rb_ary_push. It accepts two arguments and returns the original array ary:

  1. ary: The array to append to.
  2. item: The Ruby object to be added to the array.
// Function prototype for rb_ary_push
VALUE rb_ary_push(VALUE ary, VALUE item);
// Creating a new array and pushing a fixnum
VALUE my_array = rb_ary_new();
rb_ary_push(my_array, LONG2FIX(42));

To more efficiently add a large number of elements from a C array to a Ruby array, we can use rb_ary_cat. It accepts three arguments and returns the original array ary:

  1. ary: The array to append to.
  2. argv: The C array of Ruby objects added to the array.
  3. len: The number of elements to add.
// Function prototype for rb_ary_cat
VALUE rb_ary_cat(VALUE ary, const VALUE *argv, long len);
// Creating a new array and pushing three elements
VALUE my_array = rb_ary_new();
VALUE ruby_constants[3] = { Qtrue, Qfalse, Qnil };
rb_ary_cat(my_array, ruby_constants, 3);

Removing from arrays

Just like in Ruby, we can remove from Ruby arrays using functions like rb_ary_pop and rb_ary_shift. Exploring these functions is left as an exercise for the reader. You can find the list of exported Ruby array functions in array.h and the implementation in array.c.

Indexing arrays

We can use RARRAY_LEN, RARRAY_PTR, and RARRAY_AREF to get the length, backing C array pointer, and an element at a specific index of a Ruby array, respectively. Here are examples of how it’s used (my_array is assumed to be a Ruby array that already exists):

// Get the length of my_array
long len = RARRAY_LEN(my_array);

// Get the backing C array of my_array
VALUE *elements = RARRAY_PTR(my_array);
// Read the first element of my_array
VALUE first_element = elements[0];
// Set the 42nd element to the fixnum 0
elements[41] = LONG2FIX(0);

// Another way to read the first element of my_array
VALUE first_element = RARRAY_AREF(my_array, 0);

We should be careful when using RARRAY_PTR and RARRAY_AREF. RARRAY_PTR may return a different pointer after elements are added or removed from the array since Ruby may decide to resize the backing C array. Reading or writing to the original pointer may lead to undefined behavior such as segmentation faults. Unlike Ruby’s Array#[], RARRAY_AREF does not check the index that is passed in so we must ensure the index is in the range 0 <= index < RARRAY_LEN. Any indexes out of the range will result in undefined behavior including returning a garbage value or a segmentation fault.

Hash

Creating hashes

Creating hashes is very simple, just use rb_hash_new, which accepts no arguments and returns the hash.

// Function prototype for rb_hash_new
VALUE rb_hash_new(void);
// Creating a new hash
VALUE my_heap = rb_hash_new();

Look up

To look up an entry in the hash, we can use rb_hash_aref which is the implementation for Hash#[]. It accepts two arguments and returns the value of the key (or the default value if none is found):

  1. hash: The hash object.
  2. key: The key object.
// Function prototype for rb_hash_aref
VALUE rb_hash_aref(VALUE hash, VALUE key);
// Lookup my_key from my_hash
VALUE my_val = rb_hash_aref(my_hash, my_key);

Set

To set an entry in the hash, we can use rb_hash_aset which is the implementation for Hash#[]=. It accepts three arguments and returns the value val:

  1. hash: The hash object.
  2. key: The key object.
  3. val: The value to set at key.
// Function prototype for rb_hash_aref
VALUE rb_hash_aset(VALUE hash, VALUE key, VALUE val);

// Set my_key to my_val in my_hash
rb_hash_aset(my_hash, my_key, my_val);

Iteration

To iterate over every key/value pair in the hash, we can use rb_hash_foreach. Just like iterating through a hash in Ruby with Hash#each, we should not insert to or delete from the hash while iterating. rb_hash_foreach accepts three arguments and does not return anything:

  1. hash: The hash to iterate over.
  2. func: The callback function that is called for every key/value pair in the hash. This function must accept three arguments and return either ST_CONTINUE to continue iterating, ST_STOP to stop iterating, or ST_DELETE to delete the current entry. The function signature looks like the following:
    int rb_foreach_func(VALUE key, VALUE value, VALUE arg);
    
    1. key: The key of the entry.
    2. value: The value of the corresponding key.
    3. arg: The value that is passed into farg during the rb_hash_foreach call.
  3. farg: Any data that we want to pass into the func callback as the third argument. This could be anything and does not have to be a valid Ruby object.
// Function prototype for rb_hash_foreach
void rb_hash_foreach(VALUE hash, rb_foreach_func *func, VALUE farg);
// Iterate over every key/value pair in my_hash
int my_hash_iter_func(VALUE key, VALUE value, VALUE arg) {
    // Implementation goes here
    return ST_CONTINUE;
}
rb_hash_foreach(my_hash, my_hash_iter_func, 0);

String

Creating strings

There are too many ways to create strings. I have a whole article written about the common ways to create strings. Which one we use will depend on the situation, and if the wrong one is used, subtle and catastrophic bugs can be introduced like “The Ruby inplace bug”.

Appending to Ruby strings

We can use rb_str_cat or rb_str_cat_cstr to append to a string.

rb_str_cat accepts three arguments and returns the original string str:

  1. str: The string to append to.
  2. ptr: Pointer to a character buffer.
  3. len: Number of characters of the character buffer to append.
// Function prototype for rb_str_cat
VALUE rb_str_cat(VALUE str, const char *ptr, long len);
// Appending "Hello world!" to a Ruby string my_string
size_t string_length = 12;
char *c_str = malloc(string_length);
// The C string c_str may or may not contain a null terminator
memcpy(c_str, "Hello world!", string_length);
rb_str_cat(my_string, c_string, string_length);
free(c_str);

rb_str_cat_cstr is simpler to use than rb_str_cat but the caveat is that our C string must be null-terminated. It accepts two arguments and returns the original string str:

  1. str: The string to append to.
  2. ptr: Pointer to a C string that must be null-terminated.
// Function prototype for rb_str_cat_cstr
VALUE rb_str_cat_cstr(VALUE str, const char *ptr);
// Appending "Hello world!" to a Ruby string my_string
rb_str_cat_cstr(my_string, "Hello world!");

Reading and writing to strings

Just like how we can get the backing C array from a Ruby array, we can similarly get the C character array that backs the Ruby string. We can use StringValuePtr to get the backing character array and RSTRING_LEN to get the length of the string. Note that despite the name RSTRING_LEN, it behaves like String#bytesize and not String#length. The difference is that String#bytesize will return the number of bytes that the string occupies and String#length will return the number of characters. These two values will differ when multi-byte characters exist in the string.

// Get the length of my_string
long length = RSTRING_LEN(my_string);

// Get the backing C character array
char *buff = StringValuePtr(my_string);
// Change the 11th character of my_string to 'a'
buff[10] = 'a'

We can also use StringValueCStr to get a pointer to the backing character buffer. Unlike StringValuePtr, StringValueCStr will raise an ArgumentError if the Ruby string contains null characters in the middle (i.e. the string cannot be treated as a C string) and will ensure the string is properly null-terminated. Using StringValueCStr will allow us to safely use C functions that require null-terminated strings like strcat, strlen, strcmp, etc.

Symbol

We’ve used rb_intern many times to get an ID type to call methods. In fact, this ID type is the backing implementation of a symbol and has better performance by avoiding allocating the symbol object itself. Let’s see how rb_intern works again:

// Function prototype for rb_intern
ID rb_intern(const char *name);
// Getting the ID of "hello"
ID hello = rb_intern("hello");

Creating Ruby symbol

However, since ID is not a VALUE type, it is not a Ruby object and we cannot pass an ID back to Ruby. To create the Ruby symbol, we can use rb_id2sym which accepts one argument and returns the Ruby symbol:

  1. x: The ID of the symbol.
// Function prototype for rb_id2sym
VALUE rb_id2sym(ID x);
// Converting ID "hello" into Ruby symbol :hello
VALUE hello = rb_id2sym(rb_intern("hello"));

Checking types

In Ruby, we often take advantage of duck typing in our code. However, our C code often has assumptions on the Ruby type of an object and may misbehave when a type we don’t expect is passed in. When writing C extension code (especially in public APIs in gems), it is often a good idea to check the type of the parameters passed in. We can use RB_TYPE_P to check the type and Check_Type to enforce the type.

RB_TYPE_P

RB_TYPE_P accepts two arguments:

  1. obj: The object.
  2. t: The type. You can see the list of Ruby types that can be pass in.

It will return true if the object is of the type t and false otherwise.

// Function prototype for RB_TYPE_P
bool RB_TYPE_P(VALUE obj, enum ruby_value_type t);
// Demo of checking whether an object is a fixnum
VALUE my_obj = ...;
if (RB_TYPE_P(my_obj, T_FIXNUM)) {
    // my_obj is a Ruby fixnum
} else {
    // my_obj is not a Ruby fixnum
}

Check_Type

Check_Type accepts the same arguments as RB_TYPE_P but will raise a TypeError if the object is not of the correct type.

// Function prototype for Check_Type
void Check_Type(VALUE obj, enum ruby_value_type t);
// Demo of ensuring object is a fixnum
VALUE my_obj = ...;
// Raise TypeError if my_obj is not a fixnum
Check_Type(my_obj, T_FIXNUM);
// my_obj is for sure a fixnum

Conclusion

In this article, we discussed Ruby’s primitive types. Specifically, we took a deeper look at Ruby’s immediate types, arrays, strings, hashes, and symbols.

There was quite a lot of information to unpack! But be sure to take the time to try out these data types yourself to make sure you understand how they work. Having a solid understanding of the primitive data types is important as we’ll be using them very frequently in future articles. In the next article, we’ll look at using various scopes of variables using the C API.