diff --git a/Project-plan.md b/Project-plan.md index 7e14d27..84c24fa 100644 --- a/Project-plan.md +++ b/Project-plan.md @@ -1,7 +1,67 @@ # Various steps to work on the project +The project is divided into the following stages and sub-stages + +### 1. Collecting Data + +- **Objective**: Gather job offers and company information from multiple sources. +- **Sources**: + - [The Muse API](https://www.themuse.com/developers/api/v2) + - [Adzuna API](https://developer.adzuna.com/) + - Web Scraping from stepstone using selenium and beautifulsoup. +- **Tools**: + - Requests library for API interaction. + - Postman tool (for testing) + - Web scraping techniques. + +### 2. Data Modeling + +- **Objective**: Create a data lake or database to store collected data. +- **Approaches**: + - NoSQL Database (Elastic search) +- **Tools**: + - Elasticsearch + - UML Diagram for data model visualization. + +### 3. Data Consumption + +
+Click to expand + In our present scenario, we get data from 3 sources, MUSE API, Adjurna API, and Stepstone. +
+ +- **Objective**: Analyze the collected data to derive insights about the job market. +- **Analysis Tasks**: + - Number of offers per company. + - Sectors with the highest recruitment. + - Ideal job criteria (location, technologies, sector, level). +- **Tools**: + - Dash for visualization. + - Elasticsearch for statistics. + - Queries fed by the database(s). + +### 4. Going into Production + +- **Objective**: Deploy project components and create APIs for data retrieval. +- **Components**: + - API using FastAPI or Flask. + - Docker containers for each component. Steps for the dockerization are available in [Docker-image file](Docker-image-integration.md). +- **Tools**: + - FastAPI or Flask for API development. + - Docker for containerization. + - Docker Compose for container orchestration. + +### 5. Automation of Flow (future work) + +- **Objective**: Automate data retrieval from sources. +- **Tools**: + - Apache Airflow for workflow automation. + - Python file defining the DAG. + + These are the instructions given to us from the Course cohost and is available at [Datascientist.com project doc](https://docs.google.com/document/d/1glRF8HtyNqcHnZud8KqeJYLdC07_MqjuFGJVOuw7gBc/edit) +# Contributions ## Step 0 ### Framing (first meeting): | Task | Description and links| diff --git a/README.md b/README.md index ba11537..44fce66 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ This project aims to showcase skills in data engineering by gathering and analyz ```bash cd Job-Market-Project ``` -3. **Set Up Virtual Environment (Optional):** It's a good practice to work within a virtual environment to manage dependencies. In our case, we have created a Python virtual environment using `virtualenv` or `conda`: +3. **Set Up Virtual Environment (Optional):** It's a good practice to work within a virtual environment to manage dependencies. In our case, we have created a Python virtual environment using `virtualenv` (which can be installed through `pip install virtualenv`) or `conda`: ```bash # Using virtualenv python -m venv env @@ -34,6 +34,10 @@ This project aims to showcase skills in data engineering by gathering and analyz conda create --name myenv conda activate myenv ``` + **Deactivate the Virtual Environment:** When you're done working on your project, you can deactivate the virtual environment to return to the global Python environment. + ```bash + deactivate + ``` 5. **Install Dependencies:** Install the required Python packages specified in the requirements.txt file: ```bash @@ -47,9 +51,6 @@ This project aims to showcase skills in data engineering by gathering and analyz 8. **Access FastAPI Application:** Once your FastAPI application is running, we can access it in our browser by navigating to `http://localhost:8000` (assuming we're running it locally). -## Launch on Binder - [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunp77/Job-Market-project/main) - ## Project structure: ``` @@ -96,65 +97,17 @@ Job-Market-project/ └── UserStories.md # User stories file ``` -## Project Stages - -The project is divided into the following stages and sub-stages - -### 1. Collecting Data - -- **Objective**: Gather job offers and company information from multiple sources. -- **Sources**: - - [The Muse API](https://www.themuse.com/developers/api/v2) - - [Adzuna API](https://developer.adzuna.com/) - - Web Scraping from stepstone using selenium and beautifulsoup. -- **Tools**: - - Requests library for API interaction. - - Postman tool (for testing) - - Web scraping techniques. - -### 2. Data Modeling - -- **Objective**: Create a data lake or database to store collected data. -- **Approaches**: - - NoSQL Database (Elastic search) -- **Tools**: - - Elasticsearch - - UML Diagram for data model visualization. - -### 3. Data Consumption - -
-Click to expand - In our present scenario, we get data from 3 sources, MUSE API, Adjurna API, and Stepstone. -
- -- **Objective**: Analyze the collected data to derive insights about the job market. -- **Analysis Tasks**: - - Number of offers per company. - - Sectors with the highest recruitment. - - Ideal job criteria (location, technologies, sector, level). -- **Tools**: - - Dash for visualization. - - Elasticsearch for statistics. - - Queries fed by the database(s). - -### 4. Going into Production - -- **Objective**: Deploy project components and create APIs for data retrieval. -- **Components**: - - API using FastAPI or Flask. - - Docker containers for each component. Steps for the dockerization are available in [Docker-image file](Docker-image-integration.md). -- **Tools**: - - FastAPI or Flask for API development. - - Docker for containerization. - - Docker Compose for container orchestration. - -### 5. Automation of Flow (future work) - -- **Objective**: Automate data retrieval from sources. -- **Tools**: - - Apache Airflow for workflow automation. - - Python file defining the DAG. +## Elasticsearch Integration + +In this project, we utilize Elasticsearch as our primary database solution for efficient storage, retrieval, and analysis of structured and unstructured data. Elasticsearch is a distributed, RESTful search and analytics engine designed for horizontal scalability, real-time search, and robust analytics capabilities. Elasticsearch proves invaluable in situations requiring full-text search, real-time indexing, scalability, and advanced analytics capabilities. Here python is utilized for seamless interaction with Elasticsearch by leveraging the `elasticsearch` Python client library. We can install the `elasticsearch` module using the following command in your terminal or command prompt: +```bash +pip install elasticsearch +``` +- The `db_connection.py` script demonstrates how Python code can be written to establish connections to Elasticsearch, perform data operations, and integrate Elasticsearch functionality into our project workflow effectively. +- Docker plays a crucial role in our project by facilitating the containerization of Elasticsearch and simplifying the management of deployment environments. +- The [docker-compose.yml](docker-compose.yml) file defines the Docker services required for running Elasticsearch and Kibana within isolated containers. +- Docker Compose orchestrates the deployment of these services, ensuring consistent and reproducible environments across different development and deployment stages. By containerizing Elasticsearch, we achieve greater portability, scalability, and ease of deployment, making it convenient to deploy our Elasticsearch infrastructure in various environments with minimal configuration. + ## Docker Images @@ -180,4 +133,7 @@ This project is licensed under the [GNU General Public License v3.0](LICENSE). > [Link to the docs for project](https://docs.google.com/document/d/1glRF8HtyNqcHnZud8KqeJYLdC07_MqjuFGJVOuw7gBc/edit) - + +## Launch on Binder + [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunp77/Job-Market-project/main) \ No newline at end of file diff --git a/database-selection.md b/database-selection.md index f3b3eef..b2c75c8 100644 --- a/database-selection.md +++ b/database-selection.md @@ -15,7 +15,9 @@ By leveraging these advantages, organizations and researchers can gain valuable Steps to launch the Elastic search and load data, 1. Run the docker-compose.yml as below where the repo present -docker-compose up -d. + ```bash + docker-compose up -d + ``` 2. open kibana in any browser using :: http://localhost:5601/ 3. To access it, open pane like, diff --git a/scripts/database/db_connection.py b/scripts/database/db_connection.py index 2411eb1..cb3980a 100644 --- a/scripts/database/db_connection.py +++ b/scripts/database/db_connection.py @@ -172,7 +172,7 @@ def handler(): # get script base path script_dir = os.path.dirname(os.path.realpath(__file__)) - read_file_path = script_dir.replace("\scripts\database", "\data\processed_data") + read_file_path = script_dir.replace("\\scripts\\database", "\\data\\processed_data") print(read_file_path) # call the below function to load the dataset successfully to ES database ss_dataset(read_file_path, es)