vak: (Default)
[personal profile] vak
Пару лет назад один крутой чувак сделал компилятор языка Би для архитектуры x86_64 под Линуксом. Генерится ассемблерный код для GNU ассемблера. Однако в компиляторе обнаружились ошибки. Я взялся довести его до ума, пофиксил кое-что. Неожиданно наткнулся на серьёзную проблему с массивами. Товарищ реализовал массивы как в Си. Оказалось, что так оно не работает. Представление массивов в Би отличается. В частности, Би разрешает присваивание значения переменной массива. В Си такое невозможно: имя массива является константой.

Я спросил у Грока, и он поначалу не врубался. Но когда я показал ему, что Би разрешает код наподобие "auto a[3]; a = 123;", он понял и выдал подробное объяснение. Приведу его целиком.
You’re absolutely right, and I stand corrected—my earlier statement about B preventing assignment to an array name like auto a[3]; a = 123; was inaccurate. Let’s clarify this based on how B actually works, drawing from its design and implementation as documented in historical sources like the B Reference Manual and surviving code.

B’s Behavior with Array Assignment

In B, when you declare:
    auto a[3];
a is allocated as a local variable on the stack, and the [3] specifies that 3 additional words of memory are reserved immediately following a. However, unlike C, where an array name is a non-assignable symbol tied to a fixed memory block, B treats a as a regular variable—a word that initially holds the address of the first element of the 3-word block. Because B is typeless and lacks the strict type enforcement of C, a is not a "constant pointer" or an immutable array name; it’s just a variable that can be reassigned.

So, this is perfectly legal in B:
    auto a[3];
a = 123;
After this, a no longer points to the original 3-word block—it now contains the value 123. The 3 words allocated for the array are still there on the stack, but a no longer references them unless you explicitly reset it (e.g., with something like a = &a + 1, though B’s syntax and stack behavior make this tricky).

Why This Works in B

  1. Typeless Nature:
    • In B, all variables are words (18 bits on PDP-7, 32 bits on PDP-11), and there’s no distinction between integers, pointers, or array names at the language level. a is just a word, and you can store anything in it—address or not.
  2. Array Declaration Mechanics:
    • When auto a[3]; is compiled, B’s compiler (like the one Thompson wrote for the PDP-7 or PDP-11) allocates 4 words total on the stack: 1 for a itself (holding the address of the first element) and 3 for the array elements. Initially, a is set to the address of a[0] (i.e., &a + 1 in a conceptual sense), but nothing in B’s semantics prevents overwriting that value.
    • Contrast this with C, where int a[3]; makes a a compile-time symbol tied to the array’s base address, not a variable that can be assigned to.
  3. No Lvalue Protection:
    • B doesn’t enforce immutability for array names. In C, a = 123; fails because a isn’t an lvalue in that context—it’s an array type, not a modifiable object. B has no such restriction, so a behaves like any other variable.

Example and Implications

Here’s what happens step-by-step:
    auto a[3];
a[0] = 10; /* Sets first element to 10; a holds address of a[0] */
a = 123; /* Overwrites a with 123 */
a[0] = 20; /* Undefined behavior: treats 123 as an address */
  • Initially, a points to the stack-allocated block, and a[0] = 10 writes 10 to that location.
  • After a = 123, a now contains 123, which could be interpreted as an address if used as a pointer.
  • a[0] = 20 would attempt to write 20 to memory location 123, likely causing a crash or memory corruption on real hardware, since 123 isn’t a valid stack address.
This flexibility (or lack of safety) reflects B’s minimalist design—it was built for programmers who understood the machine’s memory layout and didn’t need hand-holding.

Comparison to C Again

Your earlier analogy, int buf[3], *a = buf; in C, still holds as a way to mimic B’s behavior, but with a twist:
  • In C:
        int buf[3];
    int *a = buf;
    a[0] = 10; /* Works: modifies buf[0] */
    a = 123; /* Works: a now points to address 123 */
    Here, a is explicitly a pointer, separate from buf, and can be reassigned.
  • In B:
        auto a[3];
    a[0] = 10; /* Works: modifies first element */
    a = 123; /* Works: a now holds 123 */
    The difference is that B doesn’t distinguish between buf and a—there’s no separate array object; a is both the "pointer" and the entry point to the array, and it’s mutable.
C’s int a[3]; prevents a = 123; to enforce array integrity, a refinement over B’s free-for-all approach.

Historical Context

This behavior aligns with B’s origins on the PDP-7 and PDP-11, where memory was scarce, and the language trusted programmers to manage it. The B Reference Manual (circa 1972) confirms that array names are variables, not immutable symbols, and examples from early Unix code in B (before the C rewrite) show similar liberties with memory manipulation.

Conclusion

You’re correct: B allows a = 123; after auto a[3];, unlike C’s array declarations. This makes B’s "arrays" more like pointers with pre-allocated space than C’s distinct array types. It’s a subtle but critical difference, highlighting B’s raw, low-level nature versus C’s added structure. Thanks for catching that—it’s a great example of how B’s simplicity shaped its quirks!

Date: 2025-04-04 17:35 (UTC)
chaource: (Default)
From: [personal profile] chaource
https://www.t3x.org/

You might enjoy this too!

Date: 2025-04-04 18:04 (UTC)
spamsink: (Default)
From: [personal profile] spamsink
Хорошо объясняет. Много ему всего скормили, видать.

Date: 2025-04-05 20:10 (UTC)
From: [personal profile] ymz5
Виходить, на наших очах відбувається революція у програмобудуванні, а також, можливо, у викладанні його. :)