Advanced JavaScript Code Search with Orama

Advanced JavaScript Code Search with Orama

·

9 min read

By Marco Ippolito

Editor’s note: This is a cross-post written by Developer Experience Engineer, Marco Ippolito. Marco has his own blog at Medium.

Learn how to build a powerful code search tool using Orama

Searching for specific pieces of code in large JavaScript files can be a daunting task, especially when dealing with complex projects, as developers, we’ve all experienced the frustration of sifting through hundreds or even thousands of lines of code, trying to locate that elusive function or variable.

Traditional text editors or IDEs may offer basic search functionalities, but they often fall short of providing accurate and contextually relevant results. The main issue lies in their inability to specify where a variable is declared or to differentiate between different variable types like “const”, “let”, “function,” or class properties. This limitation leads to wasted time and effort during the code navigation process.

In this article, we’ll learn how to build a powerful code search tool using Orama, an incredible full-text search engine.

Abstract Syntax Tree

When searching for specific code elements within JavaScript files, relying solely on string occurrences is insufficient. The reason is that JavaScript code can be written in various styles and can include complex structures that go beyond simple text patterns. To perform precise and accurate searches, we need a more sophisticated approach, which is where the Abstract Syntax Tree becomes essential.

An Abstract Syntax Tree is a powerful representation of a code file’s syntax and structure. It breaks down the code into a hierarchical tree of nodes, each representing a distinct syntactic element, such as functions, variables, expressions, loops, and more. This tree structure allows us to analyze the code’s context, relationships, and scope, providing a more accurate understanding of its components.

This code snippet showcases the transformation of the code into an Abstract Syntax Tree:

const hello = "Ciao";

Here’s the AST representation:

{
  "type": "Program",
  "start": 0,
  "end": 22,
  "body": [
    {
      "type": "VariableDeclaration",
      "start": 0,
      "end": 21,
      "declarations": [
        {
          "type": "VariableDeclarator",
          "start": 6,
          "end": 20,
          "id": {
            "type": "Identifier",
            "start": 6,
            "end": 11,
            "name": "hello"
          },
          "init": {
            "type": "Literal",
            "start": 14,
            "end": 20,
            "value": "Ciao",
            "raw": "\"Ciao\""
          }
        }
      ],
      "kind": "const"
    }
  ],
  "sourceType": "script"
}

The Abstract Syntax Tree may seem overwhelming, as even a tiny code snippet creates a large and intricate JSON representation. But don’t worry; we’ll simplify it, making it easy to work with and explore its potential.

Let’s get started

First, we’ll read the content of the file we want to search in using the node:fs module. Then, we’ll leverage Acorn, a widely-used JavaScript library, to generate the AST:

// index.js
import { parse } from "acorn";
import { readFileSync } from "node:fs";

// read the file content
const file = readFileSync("public/index.js", "utf-8");
// parse the file content into AST
const ast = parse(file, { ecmaVersion: "latest", locations: true });

Once we generate the AST, we have a complex JavaScript object representing the code’s structure. To simplify navigation and analysis, we flatten the AST into an array, reducing complexity.

Before we proceed with the flattening process, due to the hierarchical nature of the AST, we focus our search on child nodes within specific properties. The fieldsToTraverse array includes the properties where we want to explore and search for further child nodes:

// const.js
export const fieldsToTraverse = [
  "body",
  "declarations",
  "arguments",
  "expressions",
  "quasis",
  "property",
  "object",
  "id",
  "init",
  "expression",
  "callee",
  "params",
  "key",
  "value",
  "left",
  "right",
];

We also want a list of fields to pick from each node. The fieldsToPick array includes the properties we are interested in.

// const.js
export const fieldsToPick = ["name", "kind", "value", "type"];

Now, we can create a function that traverses the AST to flatten it. This function recursively navigates through the hierarchical structure of the AST and converts it into a flat array representation. As it traverses the tree, it identifies nodes of interest (fieldsToTraverse) and extracts relevant information based on the specified properties (fieldsToPick).

The flattened array is then organized to preserve the relationships between nodes, making it easier to access and search for specific code elements.

// traverse.js
import { randomUUID } from "node:crypto";
import { fieldsToTraverse, fieldsToPick } from "./const.js";

export function traverse(node, results, parentId, parentType, field, kind) {
  // give each node a uuid
  const id = randomUUID();
  // set custom properties
  const base = {
    parentType,
    parentId,
    id,
    field,
  };
  // pick properties we are interested in
  const extraFields = fieldsToPick.reduce((acc, prop) => {
    if (node[prop]) {
      // value can contain child nodes or actual values
      if (prop === "value" && typeof value === "object") {
        return acc;
      }
      acc[prop] = node[prop];
    }
    return acc;
  }, {});
  // merge custom properties with the ones from node
  const res = { ...base, ...extraFields };
  // dont push the node if it's empty
  if (Object.keys(extraFields).length !== 0) {
    // add the node to array
    results.push(res);
  }
  const type = res.type;
  // if its a variable declaration get its kind (let, const, var)
  if (type === "VariableDeclaration") {
    kind = res.kind;
  } else if (kind) {
    res.kind = kind;
  }
  // traverse the children inside properties we are interested
  for (const field of fieldsToTraverse) {
    const child = node[field];
    if (child) {
      // if is an array traverse all
      if (Array.isArray(child)) {
        for (const el of child) {
          traverse(el, results, id, type, field, kind);
        }
      } else {
        traverse(child, results, id, type, field, kind);
      }
    }
  }
}

We have created an array of objects that looks like this:

[
  {
    // root
    parentType: undefined,
    parentId: undefined,
    id: '1697e177-b48f-422d-8918-49de8c14a4c2',
    field: undefined,
    type: 'Program',
    loc: SourceLocation { start: [Position], end: [Position] }
  },
  {
    // VariableDeclaration
    parentType: 'Program',
    parentId: '1697e177-b48f-422d-8918-49de8c14a4c2',
    id: '9ca1e405-9777-41e0-8050-4f543374cffc',
    field: 'body',
    kind: 'const',
    type: 'VariableDeclaration',
    loc: SourceLocation { start: [Position], end: [Position] }
  },
  {
    // VariableDeclarator of kind const
    parentType: 'VariableDeclaration',
    parentId: '9ca1e405-9777-41e0-8050-4f543374cffc',
    id: '92411202-b341-49e1-838e-6aa2c6afd51e',
    field: 'declarations',
    type: 'VariableDeclarator',
    loc: SourceLocation { start: [Position], end: [Position] },
    kind: 'const'
  },
  {
    // name of the variable
    parentType: 'VariableDeclarator',
    parentId: '92411202-b341-49e1-838e-6aa2c6afd51e',
    id: '12d76f20-7210-4568-9acf-dabea67c3ca5',
    field: 'id',
    name: 'hello',
    type: 'Identifier',
    loc: SourceLocation { start: [Position], end: [Position] },
    kind: 'const'
  },
  {
    // value of the variable
    parentType: 'VariableDeclarator',
    parentId: '92411202-b341-49e1-838e-6aa2c6afd51e',
    id: '7f86024d-98cd-444f-aed5-b7f32f574ff1',
    field: 'init',
    value: 'Ciao',
    type: 'Literal',
    loc: SourceLocation { start: [Position], end: [Position] },
    kind: 'const'
  }
]

We have successfully traversed and flattened the Abstract Syntax Tree. Each entry represents a specific piece of the JavaScript code. These entries correspond to individual nodes within the code’s Abstract Syntax Tree. Each node is associated with its parent node through the parentType, parentId, indicating the hierarchical relationship between nodes.

The additional properties like name, kind, value, type, and loc provide valuable information about each code element. For instance, name might represent the name of a variable or function, kind could indicate the type of declaration (e.g., “const” or “let“), and value may contain the value assigned to a variable. The type property specifies the type of node (e.g., “Identifier“, “Literal” , “MemberExpression,” etc.), and loc represents the source code location of the node.

Now we want to be able to search and analyze the entries based on their properties, identify specific code patterns, locate variable declarations, and analyze function calls.

Orama: the lightning-fast search engine

Orama is a fast, batteries-included, full-text search engine entirely written in TypeScript, with zero dependencies.

By providing our array of AST nodes to Orama, we gain the ability to execute complex queries on the data.

Let’s install Orama:

npm i @orama/orama

First of all, we create an Orama database by defining the structure of our data:

// index.js
import { create, insertMultiple, search } from "@orama/orama";
import { parse } from "acorn";
import { readFileSync } from "node:fs";
import { traverse } from "./traverse.js";

const file = readFileSync("public/index.js", "utf-8");
const ast = parse(file, { ecmaVersion: "latest", locations: true });

const nodes = [];
traverse(ast, nodes);

const db = await create({
  schema: {
    parentType: "string",
    parentId: "string",
    id: "string",
    field: "string",
    name: "string",
    type: "string",
    kind: "string",
  }
});

Then we insert our nodes inside the database:

// index.js
import { create, insertMultiple, search } from "@orama/orama";
import { parse } from "acorn";
import { readFileSync } from "node:fs";
import { traverse } from "./traverse.js";

const file = readFileSync("public/index.js", "utf-8");
const ast = parse(file, { ecmaVersion: "latest", locations: true });
const nodes = [];
traverse(ast, nodes);

const db = await create({
  schema: {
    parentType: "string",
    parentId: "string",
    id: "string",
    field: "string",
    name: "string",
    type: "string",
    kind: "string",
  }
});

await insertMultiple(db, nodes);

And… that’s it! Once we’ve added our AST nodes to Orama, searching for specific information becomes straightforward!

Let’s use a more complex code snippet to perform our queries on:

// public/index.js
class SayHello {
  sayHello = "ciao";
  constructor(sayHello) {
    this.sayHello = sayHello;
    console.log(this.sayHello);
  }
}
new SayHello("ciao");

function sayHello(name) {
  const greet = "ciao";
  console.log(`${greet} ${name}`);
}

let greet = "ciao";

const name = sayHello("Marco");

The code snippet contains several variables with repeated names and different scopes, which can lead to confusion when searching using traditional search filters.

For instance, the variable sayHello is defined both as a property of the class SayHello, a parameter of its constructor, and a function, which can make it hard to track its usage and assignments.

Similarly, the variable greet is declared multiple times, both as a local variable inside the function sayHello and as a global variable outside the function.

Let’s try to use Orama to search the SayHello class declaration, by filtering for parentType:

// index.js
// find where SayHello Class is declared
const {
  hits: [classDeclaration],
} = await search(db, {
  term: "sayHello",
  properties: ["name"],
  where: {
    parentType: "ClassDeclaration",
  },
});

Output:

{
  "id": "4d8e85bf-74b8-4dbd-8e92-dbbca3ff7b78",
  "score": 1.8931447341375505,
  "document": {
    "parentType": "ClassDeclaration",
    "parentId": "601ee3d4-9b6a-4247-8763-deab2544a256",
    "id": "4d8e85bf-74b8-4dbd-8e92-dbbca3ff7b78",
    "field": "id",
    "name": "SayHello",
    "type": "Identifier",
    // where is located inside the file
    "loc": {
      "start": {
        "line": 1,
        "column": 6
      },
      "end": {
        "line": 1,
        "column": 14
      }
    }
  }
}

Acorn provides us with the field loc, (location) within the AST nodes, this information allows us to determine precisely where each token appears in the source file.

We can also search for a variable specifying if we are looking for a “let” or a “const”:

// index.js
// search a variable that is const
const { hits: [greetLet] } = await search(db, {
  term: "greet",
  where: {
    parentType: "VariableDeclarator",
    kind: "let",
  },
});

Output:

{
    "id": "0d0978cd-9b2f-409d-bc1a-ee3299727ffa",
    "score": 2.8535879407490152,
    "document": {
      "parentType": "VariableDeclarator",
      "parentId": "6f734655-7864-4c80-a99c-ffdcd1d02cee",
      "id": "0d0978cd-9b2f-409d-bc1a-ee3299727ffa",
      "field": "id",
      "name": "greet",
      "kind": "let",
      "type": "Identifier",
      // location inside the file
      "loc": {
        "start": {
          "line": 15,
          "column": 4
        },
        "end": {
          "line": 15,
          "column": 9
        }
      }
    }
  }

Conclusion

This was a fun experiment where we used Orama and Acorn together to make it easier to explore and understand complex code. The combination of their features worked well, and it would be cool to create a plugin for Visual Studio Code using this system. With such a plugin, we could improve code search, analysis, and navigation, making coding a lot more efficient and enjoyable for developers.

You can download the source code from https://github.com/marco-ippolito/orama-ast.