By Marco Ippolito
Editor’s note: This is a cross-post written by Developer Experience Engineer, Marco Ippolito. Marco has his own blog at Medium.
Learn how to build a powerful code search tool using Orama
Searching for specific pieces of code in large JavaScript files can be a daunting task, especially when dealing with complex projects, as developers, we’ve all experienced the frustration of sifting through hundreds or even thousands of lines of code, trying to locate that elusive function or variable.
Traditional text editors or IDEs may offer basic search functionalities, but they often fall short of providing accurate and contextually relevant results. The main issue lies in their inability to specify where a variable is declared or to differentiate between different variable types like “const”, “let”, “function,” or class properties. This limitation leads to wasted time and effort during the code navigation process.
In this article, we’ll learn how to build a powerful code search tool using Orama, an incredible full-text search engine.
Abstract Syntax Tree
When searching for specific code elements within JavaScript files, relying solely on string occurrences is insufficient. The reason is that JavaScript code can be written in various styles and can include complex structures that go beyond simple text patterns. To perform precise and accurate searches, we need a more sophisticated approach, which is where the Abstract Syntax Tree becomes essential.
An Abstract Syntax Tree is a powerful representation of a code file’s syntax and structure. It breaks down the code into a hierarchical tree of nodes, each representing a distinct syntactic element, such as functions, variables, expressions, loops, and more. This tree structure allows us to analyze the code’s context, relationships, and scope, providing a more accurate understanding of its components.
This code snippet showcases the transformation of the code into an Abstract Syntax Tree:
const hello = "Ciao";
Here’s the AST representation:
{
"type": "Program",
"start": 0,
"end": 22,
"body": [
{
"type": "VariableDeclaration",
"start": 0,
"end": 21,
"declarations": [
{
"type": "VariableDeclarator",
"start": 6,
"end": 20,
"id": {
"type": "Identifier",
"start": 6,
"end": 11,
"name": "hello"
},
"init": {
"type": "Literal",
"start": 14,
"end": 20,
"value": "Ciao",
"raw": "\"Ciao\""
}
}
],
"kind": "const"
}
],
"sourceType": "script"
}
The Abstract Syntax Tree may seem overwhelming, as even a tiny code snippet creates a large and intricate JSON representation. But don’t worry; we’ll simplify it, making it easy to work with and explore its potential.
Let’s get started
First, we’ll read the content of the file we want to search in using the node:fs
module. Then, we’ll leverage Acorn, a widely-used JavaScript library, to generate the AST:
// index.js
import { parse } from "acorn";
import { readFileSync } from "node:fs";
// read the file content
const file = readFileSync("public/index.js", "utf-8");
// parse the file content into AST
const ast = parse(file, { ecmaVersion: "latest", locations: true });
Once we generate the AST, we have a complex JavaScript object representing the code’s structure. To simplify navigation and analysis, we flatten the AST into an array, reducing complexity.
Before we proceed with the flattening process, due to the hierarchical nature of the AST, we focus our search on child nodes within specific properties. The fieldsToTraverse
array includes the properties where we want to explore and search for further child nodes:
// const.js
export const fieldsToTraverse = [
"body",
"declarations",
"arguments",
"expressions",
"quasis",
"property",
"object",
"id",
"init",
"expression",
"callee",
"params",
"key",
"value",
"left",
"right",
];
We also want a list of fields to pick from each node. The fieldsToPick
array includes the properties we are interested in.
// const.js
export const fieldsToPick = ["name", "kind", "value", "type"];
Now, we can create a function that traverses the AST to flatten it. This function recursively navigates through the hierarchical structure of the AST and converts it into a flat array representation. As it traverses the tree, it identifies nodes of interest (fieldsToTraverse
) and extracts relevant information based on the specified properties (fieldsToPick
).
The flattened array is then organized to preserve the relationships between nodes, making it easier to access and search for specific code elements.
// traverse.js
import { randomUUID } from "node:crypto";
import { fieldsToTraverse, fieldsToPick } from "./const.js";
export function traverse(node, results, parentId, parentType, field, kind) {
// give each node a uuid
const id = randomUUID();
// set custom properties
const base = {
parentType,
parentId,
id,
field,
};
// pick properties we are interested in
const extraFields = fieldsToPick.reduce((acc, prop) => {
if (node[prop]) {
// value can contain child nodes or actual values
if (prop === "value" && typeof value === "object") {
return acc;
}
acc[prop] = node[prop];
}
return acc;
}, {});
// merge custom properties with the ones from node
const res = { ...base, ...extraFields };
// dont push the node if it's empty
if (Object.keys(extraFields).length !== 0) {
// add the node to array
results.push(res);
}
const type = res.type;
// if its a variable declaration get its kind (let, const, var)
if (type === "VariableDeclaration") {
kind = res.kind;
} else if (kind) {
res.kind = kind;
}
// traverse the children inside properties we are interested
for (const field of fieldsToTraverse) {
const child = node[field];
if (child) {
// if is an array traverse all
if (Array.isArray(child)) {
for (const el of child) {
traverse(el, results, id, type, field, kind);
}
} else {
traverse(child, results, id, type, field, kind);
}
}
}
}
We have created an array of objects that looks like this:
[
{
// root
parentType: undefined,
parentId: undefined,
id: '1697e177-b48f-422d-8918-49de8c14a4c2',
field: undefined,
type: 'Program',
loc: SourceLocation { start: [Position], end: [Position] }
},
{
// VariableDeclaration
parentType: 'Program',
parentId: '1697e177-b48f-422d-8918-49de8c14a4c2',
id: '9ca1e405-9777-41e0-8050-4f543374cffc',
field: 'body',
kind: 'const',
type: 'VariableDeclaration',
loc: SourceLocation { start: [Position], end: [Position] }
},
{
// VariableDeclarator of kind const
parentType: 'VariableDeclaration',
parentId: '9ca1e405-9777-41e0-8050-4f543374cffc',
id: '92411202-b341-49e1-838e-6aa2c6afd51e',
field: 'declarations',
type: 'VariableDeclarator',
loc: SourceLocation { start: [Position], end: [Position] },
kind: 'const'
},
{
// name of the variable
parentType: 'VariableDeclarator',
parentId: '92411202-b341-49e1-838e-6aa2c6afd51e',
id: '12d76f20-7210-4568-9acf-dabea67c3ca5',
field: 'id',
name: 'hello',
type: 'Identifier',
loc: SourceLocation { start: [Position], end: [Position] },
kind: 'const'
},
{
// value of the variable
parentType: 'VariableDeclarator',
parentId: '92411202-b341-49e1-838e-6aa2c6afd51e',
id: '7f86024d-98cd-444f-aed5-b7f32f574ff1',
field: 'init',
value: 'Ciao',
type: 'Literal',
loc: SourceLocation { start: [Position], end: [Position] },
kind: 'const'
}
]
We have successfully traversed and flattened the Abstract Syntax Tree. Each entry represents a specific piece of the JavaScript code. These entries correspond to individual nodes within the code’s Abstract Syntax Tree. Each node is associated with its parent node through the parentType
, parentId
, indicating the hierarchical relationship between nodes.
The additional properties like name
, kind
, value
, type
, and loc
provide valuable information about each code element. For instance, name
might represent the name of a variable or function, kind
could indicate the type of declaration (e.g., “const” or “let“), and value
may contain the value assigned to a variable. The type
property specifies the type of node (e.g., “Identifier“, “Literal” , “MemberExpression,” etc.), and loc
represents the source code location of the node.
Now we want to be able to search and analyze the entries based on their properties, identify specific code patterns, locate variable declarations, and analyze function calls.
Orama: the lightning-fast search engine
Orama is a fast, batteries-included, full-text search engine entirely written in TypeScript, with zero dependencies.
By providing our array of AST nodes to Orama, we gain the ability to execute complex queries on the data.
Let’s install Orama:
npm i @orama/orama
First of all, we create an Orama database by defining the structure of our data:
// index.js
import { create, insertMultiple, search } from "@orama/orama";
import { parse } from "acorn";
import { readFileSync } from "node:fs";
import { traverse } from "./traverse.js";
const file = readFileSync("public/index.js", "utf-8");
const ast = parse(file, { ecmaVersion: "latest", locations: true });
const nodes = [];
traverse(ast, nodes);
const db = await create({
schema: {
parentType: "string",
parentId: "string",
id: "string",
field: "string",
name: "string",
type: "string",
kind: "string",
}
});
Then we insert our nodes inside the database:
// index.js
import { create, insertMultiple, search } from "@orama/orama";
import { parse } from "acorn";
import { readFileSync } from "node:fs";
import { traverse } from "./traverse.js";
const file = readFileSync("public/index.js", "utf-8");
const ast = parse(file, { ecmaVersion: "latest", locations: true });
const nodes = [];
traverse(ast, nodes);
const db = await create({
schema: {
parentType: "string",
parentId: "string",
id: "string",
field: "string",
name: "string",
type: "string",
kind: "string",
}
});
await insertMultiple(db, nodes);
And… that’s it! Once we’ve added our AST nodes to Orama, searching for specific information becomes straightforward!
Let’s use a more complex code snippet to perform our queries on:
// public/index.js
class SayHello {
sayHello = "ciao";
constructor(sayHello) {
this.sayHello = sayHello;
console.log(this.sayHello);
}
}
new SayHello("ciao");
function sayHello(name) {
const greet = "ciao";
console.log(`${greet} ${name}`);
}
let greet = "ciao";
const name = sayHello("Marco");
The code snippet contains several variables with repeated names and different scopes, which can lead to confusion when searching using traditional search filters.
For instance, the variable sayHello
is defined both as a property of the class SayHello
, a parameter of its constructor, and a function, which can make it hard to track its usage and assignments.
Similarly, the variable greet
is declared multiple times, both as a local variable inside the function sayHello
and as a global variable outside the function.
Let’s try to use Orama to search the SayHello
class declaration, by filtering for parentType
:
// index.js
// find where SayHello Class is declared
const {
hits: [classDeclaration],
} = await search(db, {
term: "sayHello",
properties: ["name"],
where: {
parentType: "ClassDeclaration",
},
});
Output:
{
"id": "4d8e85bf-74b8-4dbd-8e92-dbbca3ff7b78",
"score": 1.8931447341375505,
"document": {
"parentType": "ClassDeclaration",
"parentId": "601ee3d4-9b6a-4247-8763-deab2544a256",
"id": "4d8e85bf-74b8-4dbd-8e92-dbbca3ff7b78",
"field": "id",
"name": "SayHello",
"type": "Identifier",
// where is located inside the file
"loc": {
"start": {
"line": 1,
"column": 6
},
"end": {
"line": 1,
"column": 14
}
}
}
}
Acorn provides us with the field loc
, (location) within the AST nodes, this information allows us to determine precisely where each token appears in the source file.
We can also search for a variable specifying if we are looking for a “let” or a “const”:
// index.js
// search a variable that is const
const { hits: [greetLet] } = await search(db, {
term: "greet",
where: {
parentType: "VariableDeclarator",
kind: "let",
},
});
Output:
{
"id": "0d0978cd-9b2f-409d-bc1a-ee3299727ffa",
"score": 2.8535879407490152,
"document": {
"parentType": "VariableDeclarator",
"parentId": "6f734655-7864-4c80-a99c-ffdcd1d02cee",
"id": "0d0978cd-9b2f-409d-bc1a-ee3299727ffa",
"field": "id",
"name": "greet",
"kind": "let",
"type": "Identifier",
// location inside the file
"loc": {
"start": {
"line": 15,
"column": 4
},
"end": {
"line": 15,
"column": 9
}
}
}
}
Conclusion
This was a fun experiment where we used Orama and Acorn together to make it easier to explore and understand complex code. The combination of their features worked well, and it would be cool to create a plugin for Visual Studio Code using this system. With such a plugin, we could improve code search, analysis, and navigation, making coding a lot more efficient and enjoyable for developers.
You can download the source code from https://github.com/marco-ippolito/orama-ast.