Building block view

Warning

🚧 This section is still in active development and is subject to changes 🚧

This document describes the core parts of the software, what each of the “building blocks” are for Seedcase, and how they connect and interact with one another. This is the “blueprint” for the design of Seedcase, so it is more detailed than the other design documents but not so detailed as to be describing actual code. To clarify how much detail each part will be described in, there are two established terms we use:

The first section covers the overall view of the main parts inside Seedcase. The second section goes deeper and describes the contents within each main part, but only as black boxes.

White box overall system, Level 1

This section is a high level overview of the internals of Seedcase and is visually represented with the “C4 Container” diagram below. Each part within the diagram is a black box representation of the core internal parts of Seedcase.

C4 Container diagram showing the major internal parts of Seedcase.
Core internal parts of Seedcase (as black boxes).
Building Block Description
Web Interface User Interface as a web portal for interacting with the Seedcase-managed Data Resource.
User Authentication Managing and controlling who gets access to what sections of the Data Resource.
User and Data Management Layer The programmatic layer connecting the web interface to the backend environment.
Backend Environment This enclosed environment contains the inputted data, admin details, changelog, and raw data.
External Secure Server The external server where requested data is sent to (only needed if the user requesting access is an external partner).

This list describes the individual black box parts in more detail:

  • Web Interface: This is the interface that users will interact with, displayed as HTML pages. Different sections of the Web Interface are available to different users depending on their permission level (for instance, admin users will see more pages related to managing data projects).
  • User Authentication: This is the part that handles authenticating users who log into the Web Portal of the Data Resource instance managed by Seedcase. Depending on the user’s assigned permissions, they will see different pages within the Web Interface.
  • User and Data Management Layer: This is the programmatic interface that connects the Web Portal to the Backend Environment. While all actions will be achieved through the Web Portal, advanced users will be able to directly connect to this layer (for instance, through the terminal, Python, or R).
  • Backend Environment: This enclosed environment will contain the database and file storage for the Data Resource. Raw data will be stored here, as well as administrative data files such as a listing of the data projects.
  • External Secure Server: This external system is where requested data will be sent to depending on needs of the user.

Building block, Level 2

Web Portal (white box)

C4 Component diagram of Web Portal interface.
Black boxes within the Web Portal part.
Building Block Description
Landing Page The initial page that a user sees when they visit the seedcase. It provides an overview of the seedcase project and includes links to other important pages.
Login Pages It is used to authenticate users before they can access protected areas. It includes fields for users to enter their login credentials, such as a username and password.
Project List Pages It displays a list of all data projects that a user has access to. Each project is typically displayed as a separate item in a list and includes basic information about the project such as its name and status.
Project Information Pages
Data List Pages It displays a list of all data sets of one project that a user has access to. Each data set is typically displayed as a separate item in a list and includes basic information about the data such as its description, source, and date of creation.
Data Detail Pages It provides detailed information about a specific data set. This includes any relevant metadata.
Upload Data Pages It allows authorized users to upload new data sets to the seedcase. It includes fields for users to enter information about the data set, as well as a file upload field to select the data file.
Download Data Pages It allows authorized users to download data sets from the seedcase. It includes options to filter the data set based on various criteria, such as date range or specific fields.
Data QC Process Pages It provides information about the quality control process used to ensure that data sets are accurate and reliable. It provides a list of the quality control functions. It also provides the result to determine if a data set passes or fails.
Data Request Pages Submitting a request to use data from the Data Resource. Includes the list of variables selected from the data, some descriptions on the proposed use of the data, as well as information on subsetting the data.

Data catalog

  • Is a list or table with variable name, a brief description, and a link to further details that includes the data dictionary entry for the variable. The data dictionary would contain basic aggregate statistics (like number of missingness, general distribution of the data values) as well as a changelog of the variable. Alongside the list of variables are buttons to select them for a data request application. The variable list or table would be built from the underlying set of files that contain the documentation.

Data projects

  • The page contains a list of all data projects connected to the Data Resource. This would include the title of the project, the name of the applicant, and who else will have access to the data, as well as a short description written as part of the data project application. If the researcher is interested, they should be able to click through to a more detailed description of the individual project which will display a list of the variables that the data project has been granted access to as well as simple aggregations of the variables.

User Authentication (white box)

user id
Black boxes within the User Authentication part.
Building Block Description
Connect Remote ID Server Connect to remote user authentication server
Local Authentication Use seedcase local user authentication server

User and Data Management Layer (white box)

user data management

API Layer

Black boxes related to API Layer within the User and Data Management Layer part.
Building Block Description
API Security This implements measures to protect an API from unauthorized access, data theft, and other security threats.
Dataset CRUD* API This API endpoint allows creating, reading, updating, and deleting the dataset in the system.
Data file CRUD* API This API endpoint allows creating, reading, updating, and deleting the data file in the system.
List Projects Metadata This API endpoint allows listing all existing projects, it could use parameters to display related project metadata.
List Project Log This API endpoint allows displaying a log of all activity related to a specific project in the system.
Researchers CRUD* API This API endpoint allows creating, reading, updating, and deleting a researcher user in the system.

*CRUD: An acronym for Create, Read, Update, and Delete. These four actions are the core operations for data storage on computers.

Admin management

Black boxes related to Admin Management within the User and Data Management Layer part.
Building Block Description
Project Management These admin pages allow administrators to manage projects in the system. This includes creating new projects, updating existing projects, and assigning researchers to projects.
Project Dataset Management These admin pages allow administrators to manage the datasets associated with a specific project. This includes adding new datasets to the project, updating existing datasets, and managing the access permissions for each dataset.
Data QC Process Management These admin pages allow administrators to manage the quality control (QC) process for data sets in the system. This includes setting thresholds for pass/fail criteria and managing the process for reviewing and approving data sets that have been cleaned and prepared.
User Role Management These admin pages allow administrators to manage the access permissions for users in the system. This includes defining roles and permissions for different types of users, such as administrators, researchers, and data analysts, and managing the access permissions for each user.
API Security Management These admin pages allow administrators to manage the security measures that are in place to protect the API for the system.
External Connection Management These admin pages allow administrators to manage the connections between the system and external data sources or applications. This includes setting up new connections, updating existing connections, and managing the access permissions for each connection.

These are some potential endpoints for the API. For almost all of them, there are some overlapping features. For instance, each endpoint accepts (either as a requirement or optionally):

  • Authorization (string): A security code that is used to authenticate the user. This security code should be generated and provided to the user when they create an account.
Common API response status codes shared by (almost all) endpoints.
Response status code Description
400 Bad Request The request was malformed or invalid.
401 Unauthorized The authentication code provided is invalid or missing.
500 Internal Server Error There was an error processing the request.

Upload data

  • POST /data/raw/: This endpoint allows users to upload raw research data to a project.
Response status code Description
201 Created The research data was successfully posted.
404 Not Found The project with the specified ID was not found.

Upload raw data

  • POST /data/raw/file: This endpoint allows users to upload raw data files to a project.
Response status code Description
201 Created The raw data file was successfully uploaded.

Generate data for a data project

  • POST /projects/{project_id}/data/: This endpoint allows users to post data to be generated for a data project, and provides a location for the user to download the generated data after it has been created.
    • project_id (string, required): The unique identifier of the data project that the data is being generated for.
Response status code Description
202 Accepted The request was accepted and the data generation process has started.
404 Not Found The project with the specified ID was not found.

Run QC function

  • POST /data/raw/qc/{function_name}: This endpoint allows users to call a specific data cleaning function. The user must provide the name of the function they wish to call as a URL parameter and any additional parameters required by the function in the request body.
    • function_name (string, required): The name of the data cleaning function to call.
Response status code Description
201 process completed The request was successful and the data cleaning function was executed.

Get list of data projects

  • GET /projects/: This endpoint allows users to retrieve a list of data projects based on the specified filter parameters.
    • status (string, optional): Filter the list of projects based on their status. Possible values are “proposed”, “ongoing”, and “completed”. If not specified, all projects will be returned.
Response status code Description
200 OK The request was successful.

Get metadata

  • GET /metadata/: This endpoint allows users to retrieve metadata for all the data contained in the Data Resource.
  • GET /projects/{project_id}/metadata: This endpoint allows users to retrieve metadata, like project description and variables requested, for the data project.
Response status code Description
200 OK The request was successful.

Get list of registered users

  • GET /users/: Get list of users registered within the Data Resource.
  • GET /projects/{project_id}/users/: Get list of users assigned to a specific project.
    • project_id (string, required): The ID of the project.
Response status code Description
200 OK The request was successful.

Assign permissions to users

  • POST /users/{user_id}/permissions/: This endpoint allows (authorized) users to add, remove, or update user permissions for a project.
    • user_id (string, required): The ID of the user.
    • Additional parameters in the request body, as a JSON object with the following fields:
      • user_email (string): The email address of the user whose permissions will be updated.
      • action (string): The action to be performed on the user’s permissions. Valid values are “add”, “remove”, or “update”.
      • permission (string): The permission to be granted or revoked. Valid values are “read” or “write”. If the action is “update”, the user must also provide the new value of the permission field.
Response status code Description
200 OK The request was successful and the user permissions were updated
404 Not Found The user with the specified ID was not found.

Get the changelog

  • GET /log/: This endpoint allows users to retrieve the history log of the Data Resource based on certain criteria. The user must provide at least one of the following parameters in the query string:
    • dataset_id (string, optional): The ID of the dataset for which to retrieve the history log.
    • date_from (string, optional): The starting date for which to retrieve the history log (formatted as YYYY-MM-DD).
    • date_to (string, optional): The ending date for which to retrieve the history log (formatted as YYYY-MM-DD). If multiple parameters are provided, the endpoint will return all history logs that match any of the provided criteria.
Blackboxes related to data entry within the User and Data Management Layer part.
Response status code Description
200 OK The request was successful and the history log was retrieved.
404 Not Found No history logs were found for the specified criteria.

Backend Environment (whitebox)

The input data will be quite heterogeneous and so the Seedcase backend will need to be composed of multiple components:

  • Raw data files as plain text
  • Cleaning and processing programming scripts
  • A formal database structure (e.g. SQL), that is ACID (Atomic, Consistent, Isolated, and Durable) compliant for data integrity and that handles heterogeneous data types
  • A Common Data Model that is standardized and structured for use by the database
  • A version control system to track changes to the raw data and processing scripts, for recordkeeping, auditing, and transparency
  • A data version numbering system, based on the CalVer style, to help track which data the user is working with and to allow for comparison between versions
  • A changelog describing the changes
  • A data dictionary linked to the variables contained in the database, modified manually by the users but generated automatically

C4 Component diagram showing the Backend Environment Container.
Black boxes within the Backend Environment part.
Building Block Description
Raw File Storage Raw data that requires processing before entering the database, preferably as CSV or other plain text format.
Large File Storage Data files (like images) that are too big or can’t be stored in the database. A variable for the file path will be used in the database instead.
Relational Database PostgreSQL relational database containing the data, as well as some metadata, data projects, and user information.
Black boxes of tables in Relational Database within the Backend Environment part.
Building Block Description
User information Details about registered users and their permission levels, viewable by admin users.
Resource data Research data collected, uploaded, and store by users of Data Resource.
Data project information Submitted requests for data as a data project, with details like project description and list of variables.
Changelog Changes made to data recorded as a table.