Tackled a problem of developing a web crawler architecture

Tomoharu Tsutsumi
2 min readJan 27, 2024

--

Hi, architects. I learned a new problem of architecture and summarized what point was good and bad.

The problem

My first answer

The ideal answer

Reverse Index Service: This service manages the inverted index, which maps keywords to the documents that contain them. It is used to quickly look up which documents are relevant to a given search query.

Document Service: This service manages the retrieval of documents. Once the Reverse Index Service identifies which documents are needed, the Document Service fetches the actual content of those documents.

There were some mistakes.

・Database was not necessary because saving indexes and snippets was enough to satisfy the specification.

・Didn't know what is reverse indexes(like below).

Standard Index:
・Document1: [word1, word2, word3]
・Document2: [word2, word4]
Reverse Index:
・word1: [Document1]
・word2: [Document1, Document2]
・word3: [Document1]
・word4: [Document2]

・Thought index service would be included in the web crawler.

・Didn't include the document service and queues.

・CDN is not necessary because static there are not static contents in this specification.

Feel free to reach out to me on LinkedIn, which you can find below. Looking forward to connecting!

https://www.linkedin.com/in/tomoharu-tsutsumi-56051a126/

--

--

Tomoharu Tsutsumi

Senior Software Engineer at two industry-leading startups ( Go | Ruby | TypeScript | JavaScript | Gin | Echo | Rails | React | Redux | Next)