vak: (Default)
Serge Vakulenko ([personal profile] vak) wrote2025-04-04 09:18 am

Массивы в Би

Пару лет назад один крутой чувак сделал компилятор языка Би для архитектуры x86_64 под Линуксом. Генерится ассемблерный код для GNU ассемблера. Однако в компиляторе обнаружились ошибки. Я взялся довести его до ума, пофиксил кое-что. Неожиданно наткнулся на серьёзную проблему с массивами. Товарищ реализовал массивы как в Си. Оказалось, что так оно не работает. Представление массивов в Би отличается. В частности, Би разрешает присваивание значения переменной массива. В Си такое невозможно: имя массива является константой.

Я спросил у Грока, и он поначалу не врубался. Но когда я показал ему, что Би разрешает код наподобие "auto a[3]; a = 123;", он понял и выдал подробное объяснение. Приведу его целиком.
You’re absolutely right, and I stand corrected—my earlier statement about B preventing assignment to an array name like auto a[3]; a = 123; was inaccurate. Let’s clarify this based on how B actually works, drawing from its design and implementation as documented in historical sources like the B Reference Manual and surviving code.

B’s Behavior with Array Assignment

In B, when you declare:
    auto a[3];
a is allocated as a local variable on the stack, and the [3] specifies that 3 additional words of memory are reserved immediately following a. However, unlike C, where an array name is a non-assignable symbol tied to a fixed memory block, B treats a as a regular variable—a word that initially holds the address of the first element of the 3-word block. Because B is typeless and lacks the strict type enforcement of C, a is not a "constant pointer" or an immutable array name; it’s just a variable that can be reassigned.

So, this is perfectly legal in B:
    auto a[3];
a = 123;
After this, a no longer points to the original 3-word block—it now contains the value 123. The 3 words allocated for the array are still there on the stack, but a no longer references them unless you explicitly reset it (e.g., with something like a = &a + 1, though B’s syntax and stack behavior make this tricky).

Why This Works in B

  1. Typeless Nature:
    • In B, all variables are words (18 bits on PDP-7, 32 bits on PDP-11), and there’s no distinction between integers, pointers, or array names at the language level. a is just a word, and you can store anything in it—address or not.
  2. Array Declaration Mechanics:
    • When auto a[3]; is compiled, B’s compiler (like the one Thompson wrote for the PDP-7 or PDP-11) allocates 4 words total on the stack: 1 for a itself (holding the address of the first element) and 3 for the array elements. Initially, a is set to the address of a[0] (i.e., &a + 1 in a conceptual sense), but nothing in B’s semantics prevents overwriting that value.
    • Contrast this with C, where int a[3]; makes a a compile-time symbol tied to the array’s base address, not a variable that can be assigned to.
  3. No Lvalue Protection:
    • B doesn’t enforce immutability for array names. In C, a = 123; fails because a isn’t an lvalue in that context—it’s an array type, not a modifiable object. B has no such restriction, so a behaves like any other variable.

Example and Implications

Here’s what happens step-by-step:
    auto a[3];
a[0] = 10; /* Sets first element to 10; a holds address of a[0] */
a = 123; /* Overwrites a with 123 */
a[0] = 20; /* Undefined behavior: treats 123 as an address */
  • Initially, a points to the stack-allocated block, and a[0] = 10 writes 10 to that location.
  • After a = 123, a now contains 123, which could be interpreted as an address if used as a pointer.
  • a[0] = 20 would attempt to write 20 to memory location 123, likely causing a crash or memory corruption on real hardware, since 123 isn’t a valid stack address.
This flexibility (or lack of safety) reflects B’s minimalist design—it was built for programmers who understood the machine’s memory layout and didn’t need hand-holding.

Comparison to C Again

Your earlier analogy, int buf[3], *a = buf; in C, still holds as a way to mimic B’s behavior, but with a twist:
  • In C:
        int buf[3];
    int *a = buf;
    a[0] = 10; /* Works: modifies buf[0] */
    a = 123; /* Works: a now points to address 123 */
    Here, a is explicitly a pointer, separate from buf, and can be reassigned.
  • In B:
        auto a[3];
    a[0] = 10; /* Works: modifies first element */
    a = 123; /* Works: a now holds 123 */
    The difference is that B doesn’t distinguish between buf and a—there’s no separate array object; a is both the "pointer" and the entry point to the array, and it’s mutable.
C’s int a[3]; prevents a = 123; to enforce array integrity, a refinement over B’s free-for-all approach.

Historical Context

This behavior aligns with B’s origins on the PDP-7 and PDP-11, where memory was scarce, and the language trusted programmers to manage it. The B Reference Manual (circa 1972) confirms that array names are variables, not immutable symbols, and examples from early Unix code in B (before the C rewrite) show similar liberties with memory manipulation.

Conclusion

You’re correct: B allows a = 123; after auto a[3];, unlike C’s array declarations. This makes B’s "arrays" more like pointers with pre-allocated space than C’s distinct array types. It’s a subtle but critical difference, highlighting B’s raw, low-level nature versus C’s added structure. Thanks for catching that—it’s a great example of how B’s simplicity shaped its quirks!

Post a comment in response:

This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org