Tackled a problem of developing a web crawler architecture
Hi, architects. I learned a new problem of architecture and summarized what point was good and bad.
The problem
My first answer
The ideal answer
Reverse Index Service: This service manages the inverted index, which maps keywords to the documents that contain them. It is used to quickly look up which documents are relevant to a given search query.
Document Service: This service manages the retrieval of documents. Once the Reverse Index Service identifies which documents are needed, the Document Service fetches the actual content of those documents.
There were some mistakes.
・Database was not necessary because saving indexes and snippets was enough to satisfy the specification.
・Didn't know what is reverse indexes(like below).
Standard Index:
・Document1: [word1, word2, word3]
・Document2: [word2, word4]
Reverse Index:
・word1: [Document1]
・word2: [Document1, Document2]
・word3: [Document1]
・word4: [Document2]
・Thought index service would be included in the web crawler.
・Didn't include the document service and queues.
・CDN is not necessary because static there are not static contents in this specification.