My take on modernizing C

This comment was posted to reddit on Feb 21, 2021 at 7:31 pm and was deleted within 17 hour(s) and 44 minutes.

My take on modernizing C

I have been throwing around some ideas for a module system for a new language that tries to take into account easy implementation. Despite what several people have suggested, I want to cover all of my bases and want a language that allows for incremental and parallel compilation. There are two ways that I can see implementing modules each with varying degrees of implementation hurdles and ease of user experience. In embedded, static libraries are a bit thing, and I want them to be first class and not tacked on into the language / build process. Existing C libraries can be used, but users are forced to create interface "modules" in the new language, which would expose the type information of the C library in the source language.

Compile Per Source

At a basic level, there will be one compiler process instance per source file in a project. This means it is enough to specify a source file and inputs to the source file to compile it. Here is an example of the top of a source file:

// Package declaration
package Test;

// Imports
import "foldera\folderb\foo";
import "foldera\bar";

// Using statements
using Foo.Core as FooCore;
using Bar.Core as BarCore;

// Where did this one come from?
using TestCore as Core;

We see a few things here. All source files start with a package declaration as a first statement, followed by imports and using statements. Each file is a module and multiple modules can be packaged into packages.

Modules (files) must be imported based on relative file paths at the compile site regardless of if they belong to the same package. Why not just make a bunch of different files part of the same package and import the package disregarding file paths like many languages do? This would require the compiler to operate at a package or project level in order to figure out which files a given source has access to for public types. Importing based on file paths like this eases implementation by allowing all the source file dependencies to be discovered in the chain easily. We give up usability / conciseness for easier compiler implementation. Subsequently, parallelization becomes easy, because one compiler instance can be spawned per source file.

The access mechanism exposed through the package declaration and using statements is not important to the discussion, so it will be skipped. This is just an idea right now, but essentially a source file can access its package (Test) and any other imported files that belong to the same package without specifying the package, but all other public interfaces not belonging to the same package can only be used by specifying the name in a using statement.

For example's sake, we can now see that the modules (files) foo and bar expose the Foo.Core and Bar.Core packages, but what exactly is TestCore and where does it come from? A library input. When a library is created (from a different project), the packages and modules are packaged into the library, which will contain the public interface. Since we don't want the original file paths in the library, because they have no meaning, there has to be a mechanism to gain access to the library's public interface. Therefore, when referencing source files of the same project, you are required to import based on the relative path of the source, but libraries are provided to the compiler as an argument (library file path). Inside the source, to use the library, you can just write a using statement as above.

Why not just import "deps\libraryx" like the source files? The issue there is that a library can reference a public interface as part of another library. Doing this creates a scenario where it is not be easy for the compiler to find the other dependent libraries whereas modules (files) can just be followed in chain like a header files. Therefore, to remain consistent with this mechanism, a library and all its dependent libraries are just provided to the compiler as arguments.

Nextly, when we follow the import file chain, we can have mutually dependent source files. The compiler would be invoked once per input file and produce one output file. However, there is a case where the public interface in the input file would change the output of another source file in the import chain. When the compiler generates its output, it knows if an imported file's output will change based on the input file. Therefore if we imagine a build system for the compiler, we need a mechanism to tell the build system hey "we know we need to compile A because it's changed, but A depends on B (which has not changed), but changing A will change B." Easy, the compiler can just spit out a list of files that need to be recompiled as part of compiling the input file through standard out, a pipe, or a file. The build system can then get this information and fire up another compiler instance to compile B.

Is such a system perfect? No. Will there be compilation duplication because the compiler works at a file level versus a project level? Yes. Will it be slower than compiling at a project level? Yes. However, look at it this way. This is not a full on magical compile system like C#, Go, Rust, etc., but its nowhere near as bad as the header system used in C/C++ right now. If C/C++ (embedded) projects can tolerate compilation speeds right now, this mechanism will only be as worse or (likely) much better. There are many benefits of this type of module system over the header system in that we wouldn't need to worry about all the crusty header stuff like definition collisions. Additionally, the compiler doesn't necessarily need to use source every time to compile - it may be able to use cached object files to discover the public interfaces of packages / modules.

The other way to think about making a module system would be to compile at the project level. This way we do not need to keep reloading public interface definitions on each compiler invocation and can just add to the types as we incrementally compile outdated files. Many modern languages utilize some form of this compilation mechanism. I have been tossing around ideas about how to implement this as well, but I seem to find it much harder to parallelize than the aforementioned mechanism.

Now before closing, the scope of the discussion is geared towards a new language targeting embedded microcontrollers. Stated previously, the projects will be much smaller than what you are going to find for modern day PCs. Is it enough to just compile everything every time and make the compiler fast enough to cover 99% of projects? Probably. Are all embedded projects going to be less than 100KLOC? Probably. However, I cannot guarantee either of those two statements. Microcontroller technology is advancing every year with bigger memory parts and they are still developed in C, "bare metal". Compiling the entire project every time is an option, but I would rather prefer to try and come up with an incremental / parallel mechanism if it's easy to do so and promises to compiler at speeds not worse than traditional C/C++ embedded projects.

/r/ProgrammingLanguages Thread

My take on modernizing C

Recently removed from /r/ProgrammingLanguages

More Random Comments