This project is for enabling mirroring which can replicate MongoDB Atlas Data with Microsoft Fabric One Lake in near real time.
It requires two main steps:
- Creating the Fabric mount point also called Landing Zone or MirrorDB
- Deploy Code to run in an App service using
Deploy to Azure
button below or to a VM by cloning the Git repo and running theapp.py
script.
Once started, app will continously keep track of the changes in MongoDB Atlas and replicate them to the MirrorDB created in Step 1.
To create the MirrorDB, we use the Fabric UI.
Follow the steps below for the MirrorDB (LZ) creation.
- Click on the “+ New Item” in your workspace.

2. In the pop up window that opens, select “Mirrored Database (preview)”
3. Give a name for your new MirrorDB.
4. The new MirrorDB creation will begin and the screen shows the progress of the same.
- The new MirrorDB creation is complete when you see the below screen showing that replication is “Running”. Note that you can get the LandingZone url also from this screen.

6. You can also verify the LandingZone folder created within the MirrorDB in Azure storage explorer.
Step 2 is basically executing the ARM template by clicking the Deploy to Azure
button. But, we need to get the parameters to be provided to the ARM template ready beforehand.
- Keep the MongoDB
Connection uri
,Database name
andCollection name
handy for input in ARM template. Note you can give multiple collections as an array[col1, col2]
or can giveall
for all collections in a Database. - Install Azure Storage explorer. Connect to Azure Storage by selecting
Attach to a resource
->ADLS Gen2 container or directory
->Sign in using Oauth
. Select your Azure login id and on next screen give theBlob container or directory URL
ashttps://onelake.blob.fabric.microsoft.com/<workspace name in Fabric>
. Once connected you can see the Workspace underStorage Accounts
->(Attached Containers)
->Blob Containers
. Double click your Workspace, you should see the MirrorDB folder. You should also have aLandingZone
folder withinFiles
folder. You can always check for parquet files in this folder which will get replicated to OneLake and shown as tables in OneLake.
- For authentication, its through Service Principal and so we need to go to
App Registrations
in Azure portal and register a new app. Also create a new secret in the App. Get the Tenant Id, App Id and the value of the secret for input in ARM template. the secret value should be copied when being created, you will not be able to see it later.
- Go to Fabric Workspace -> Manage Access -> Type in the new App name -> Select the App and give it
admin
permissions to write to this workspace. SelectAdd people or groups
if you donot already have the App in your list.
Refer to .env_example
for the parameters needed.
If running on local or a VM, clone this repo, populate all parameters in .env_example
and rename it to .env
. Then just run python app.py
to start the app and thus start replication from MongoDB Atlas collection OneLake MirrorDB's table.
Clicking below button, will take you to Azure portal, give in the values for the parameters along with the App service specific parameters and click Deploy. Once App service is deployed, Go to the resource
and watch Log stream
to see if the app had started. After the app service is deployed, it will take 10 - 15 mins for the app to be deployed onto the App service and started.
Please don't click on the App url
before the deployment is complete and logs from the App start showing up.
Click below to start your App service for MongoDB to Fabric replication:
- Please note the code actually creates two threads for each collection (one for initial_sync and one for delta_sync) and thus if we have large collections (~10 Million+ records), we should be judicous in selecting the compute size of the App service or VM. As a high level bench mark, a compute of 4 CPUs, 16 GiB of memory might work for 5 such collections with a high throughput of say 1000 records/second. Beyond, that we should really monitor the performance and threads and check the CPU usage.
- Azure Storage explorer is your point to start the troubleshooting. Use below files that start with an underscore to get vital information. (They are not copied to OneLake as they start with underscore"_"). Also note these are pickle files and you can view them using command "python -mpickle _maxid.pkl” in terminal.
a. _max_id file: Will tell you what was the maximum _id field that was captured before initial sync begain. Any _id > this _id from _max_id is coming from real time sync. All records with _id <= this _max_id are copied as part of initial_sync
b. _resume_token: Contains the last resume token of the real time change event copied to LZ. Thus, you see this file only if atleast one real time changes parquet file was written in LZ.
c. _initial_sync_status: Indicates initial_sync is complete or not. "Y" in this file will indicate that initial_sync is complete.
d. _metadata.json: Has the primary key which is always "_id". This file should exist in a replicated folder/ table for mirroring to work.
e. _last_id: This is the "_id" value of the last record of the last initial sync batch file written to LZ. This file is deleted when initial sync is completed.
f. _internal_schema: This is one of the very first files written and has the schema as of the records in the collection being replicated. - The restartability of the App service/ replication is guaranteed if _resume_token file is present. This is because if initial sync is not completed and we restart the App service, the delta changes that came in the interim were being accumulated in a TEMP parquet files in the App service which will be lost. Thus, as a best practice, if the process fails before initial sync is completed, it is advised to delete all files in the collection folder using Azure Storage Explorer and restrart the process so that it can get the new max _id and start initial_sync. Once initial_sync is completed and _resume_token file is created we can restart without any worries as it will pick up changes from the last resume_token from the change stream.