Building Cloud Services: Spring Data
Link to project here.
One of the important functions that we have to learn how to build in our cloud services, is to take advantage of the vast amount storage resources available to store data on a clients behalf, and then be able to search that data and find what a particular client needs.
This process looks very similar to sending object data over HTTP where we used the Jackson library to automatically convert objects into JSON to be placed into a HTTP body, and then converted back into an object on the receiving end. Instead of just plainly converting to JSON, we'd have to to convert the data into a format that fits in to the model of storage for the given database. When using SQL databases for example, we could assign each member variable to a column in a table, and each object would occupy a single row. The conversion from object to row for storage, and then from row to object on retrieval is called Object Relational Mapping (ORM).
Table Of Contents
Spring Repositories
Java Persistence API
The Java Persistence API (JPA) is a tool that helps us with ORM through annotations to store instances of objects in a database. Back to our video service example:
@Entity
public class Video {
@Id
@GeneratedValue(strategy=GenerationType.AUTO)
private long id;
private String name;
private String category;
// ... Getters/Setters ...
}
The first thing that we have to do is tell JPA that this is a type of data that we would like to store in our database, this is done via the @Entity
annotation. In order to be able to effectively store, search for and access these object, we can mark any member variable with the @Id
annotation to mark it as the unique identifier for the object. The @GeneratedValue(strategy=GenerationType.AUTO)
annotation tells JPA to automatically generate a unique identifier for us.
Creating Repositories
Once we have a series of classes that are annotated with @Entity
, and instantiated objects with unique identifiers, we can interact with the database using JPA or Spring by creating a repository.
@Repository
public interface VideoRepo extends CrudRepository<Video, Long> { ... }
All we have to do is create a public interface that extends one of the provided ones such as CrudRepository
, and parametrize it with the type we are using, Video
, as well as the type of its unique identifier Long
. CrudRepository
is a standard interface for a repository that can save objects to a database. CRUD stands for "Create, Read, Update and Delete". Having specified the interface we would like, Spring Data will then automatically fill an object with all of the complex implementations needed to save instances of classes annotated with @Entity
into the database because we annotated our class with @Repository
.
The CrudRepository
interface comes with a series of default methods that would be included in the repository object for CRUD operations using each Video
object's unique identifier. We will need to define methods for other member variables in the interface should they be needed.
@Repository
public interface VideoRepo extends CrudRepository<Video, Long> {
public List<Video> findByName(String n);
public List<Video> findByNameAndCategory(String n, String cat);
}
Behind the scenes, Spring will automatically create methods that meet the method signature their names defined in our interface:
-
public List<Video> findByName(String n);
List<Video>
- The expected return type.findBy
- The query method.Name
- look invideo.name
.n
- comparevideo.name
to this.
-
public List<Video> findByNameAndCategory(String n, String cat);
List<Video>
- The expected return type.findBy
- The query method.Name
- look invideo.name
.Category
- look invideo.category
.n
- comparevideo.name
to this.cat
- comparevideo.category
to this.
Many kinds of complex search mechanisms can be added simply by defining new methods and giving them appropriate names. More information on this can be found in the docs here.
All that's left is to tell the Spring application to search for repositories in our code via @EnableJpaRepositories()
in our Application.java
file.
@EnableAutoConfiguration
@EnableJpaRepositories(basePackageClasses= VideoRepo.class) // New
@ComponentScan(basePackages = "com.mobile")
@Configuration
public class Application {
public static void main(String[] args) {
SpringApplication.run(Application.class, args);
}
}
This will then automatically create implementations of any repositories it finds and auto-wires them where needed:
@Controller
public class VideoService {
@Autowired
private VideoRepo videos;
// ... request mappings ...
}
Understanding SQL Injection Attacks
Below is a insecure method that allows users to search our database for their own private videos:
public class VideoService {
public List<Video> myVideos(String name) {
User u = getUser(...);
String query = "SELECT * FROM video WHERE owner='" + u.getName() + "' AND name='" + name + "'";
return execute(query);
}
}
Let's say the user is coursera
and passes in the name sql
to the method, it will result in the following query:
SELECT * FROM video WHERE owner='coursera' AND name='sql'
Which is what is intended. If the user passes foo' OR 'a'='a
in place of sql
:
SELECT * FROM video WHERE owner='coursera' AND name='foo' OR 'a'='a'
This results in additional logic being executed, and is potentially very dangerous. In this case, owner='coursera' AND name='foo'
will be evaluated together and 'a'='a'
will always return true, this would mean that the attacker will gain access to every video in the database. Any arbitrary logic can be injected into the query, allowing the attacker to do all kinds of arbitrary things to the database like changing or deleting data.
So whenever we are creating any sort of mechanism that is going to mix any client data in with executable logic, it must be done in a safe manner that prevents the client from being able to take control. Luckily, if we are just defining queries via the methods provided by Spring such as findBy
or findAll
, it will guarantee that we are immune to SQL injection attacks. Keep in mind that these kinds of attacks can affect all kinds of systems, not just SQL databases.
Spring Data REST
Spring Data REST is a way of automatically making our data repository accessible via a RESTful interface. Spring Data REST configuration is defined in a class called RepositoryRestMvcConfiguration
that can be imported into the application’s configuration.
Note: This step is unnecessary if you use Spring Boot’s auto-configuration. Spring Boot automatically enables Spring Data REST when you include spring-boot-starter-data-rest
and, in your list of dependencies, your app is flagged with either @SpringBootApplication
or @EnableAutoConfiguration
.
To customize the configuration, register a RepositoryRestConfigurer
(or extend RepositoryRestConfigurerAdapter
) and implement or override the configure…-
methods relevant to your use case.
Spring Boot will automatically generate paths for routing requests to the repositories based on the name of the entity it holds. For example, the default values for Video
will be:
- path = "videos"
- itemResourceRel = "video"
- collectionResourceRel = "videos"
If we GET from /videos
:
"_links": {
"self": {
"href": "http://localhost:8080/api/videos"
},
"videos": { <-- collectionResourceRel
"href": "http://localhost:8080/api/videos"
}
}
If we GET from /videos/1
:
"_links": {
"self": {
"href": "http://localhost:8080/api/videos/1"
},
"video": { <-- itemResourceRel
"href": "http://localhost:8080/api/videos/1"
}
}
@RepositoryRestResource
can be used to change these details if needed:
@RepositoryRestResource(collectionResourceRel="video-collection", path="custom-video-path")
public interface VideoRepo extends CrudRepository<Video, Long> {
//... stuff ...
}
NoSQL Databases
A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. There are several varieties of NoSQL databases:
- Key-Value Store
- Big Table
- Document Store
- Graph
Denormalization
Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. Whereas denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.
At the most fundamental level, they can be thought of as being giant hash tables. So using hash table as an example:
Map<User, List<Purchase>> purchasesByUser = ...;
// If we want to get a list of purchases by user
List<Purchase> ps = purchasesByUser.get(user);
// If we want to get users by purchase
List<User> us = new ArrayList<>();
for(List<Purchase> ps: purchasesByUser.values) {
for (Purchase p : ps) {
// some logic...
if (more_logic) us.add(user);
}
}
If we want to look up information that is already in the format that the Map is in, it is very straightforward like getting a list of purchases for a given user. If we need some other form of that information, such as getting a list of users that have made a given purchase, we have to do a lot of inefficient work.
To increase performance, we can denormalize this data:
Map<User, List<Purchase>> purchasesByUser = ...;
Map<Purchase, List<User>> usersByPurchases = ...;
// If we want to get a list of purchases by user
List<Purchase> ps = purchasesByUser.get(user);
// If we want to get a list of users by pruchase
List<User> us = purchasesByUser.get(purchase);
By duplicating the data in a different form, we are substantially increasing the performance of a query. Of course, this comes with the down side of whenever details need to be updated, they need to be updated in two places instead of one.
Optimizing for Query Patterns
Let's say we have a comments page where each comment has attached the content, a username, and the user's country:
- We could implement with a set of user entities, and a separate set of all the comments for a particular page with a key pointing to the user that created the comment. When we load a page, we can query for the comments of the page, and then using the user keys from the query, retrieve the relevant user information.
- We could also directly embed the user information with each comment. When the page is loaded, we only need to search for comments that belong to a particular page.
- We could take it a step further by storing all the comments for a page and their information as a single aggregated entity. So that instead of searching the comments in a set of all comments for the ones that belong to a specific page, we just need to query for the page itself.
Approach 1 allows us to keep our data normalized, and whenever a change takes place, we only need to change data in one location. The downside however, is whenever the page loads, the server needs to make two separate queries in sequence to retrieve the data it needs.
With 2 and 3, we can achieve very fast read speeds. However, when an update needs to occur such as the user changed their country of origin, we will have to update this information in multiple places. These approaches a better applied to data that rarely change.
Ultimately it's about tailoring the design of your database to the needs of the application.
Write Contention & Sharding
In the same comments section before, we now want to include a comment count at the top of the page:
- Each time the page loads, we can count the number of comments for a given page. Obviously this will have very slow performance, especially for very large databases.
- We could have a separate set that stores the count for the number of comments on a page. Whenever a comment is added, that count gets incremented by 1. This will provide an extremely fast read. However, when large volumes of comments are added simultaneously, each updater will have to lock the count, update it and then release it before another updater can update it again. This will cause a write bottleneck.
- We could also take advantage of a technique called Sharding. Here, we can create subsets of the data (shards) to enable parallel writes based on the number of shard we have. Instead of a single value that holds the comment count, it can be split into N shards where, which will allow N updaters to run simultaneously. Whenever a request comes in for the comment count, we can simply sum up the values in all N shards. This is somewhat of a middle ground between approach 1 and 2.
Implementing MongoDB
When implementing MongoDB in our example video application, and given all of the abstractions we have been using that Spring provides, the changes needed are actually surprisingly minimal.
Aside from adding the dependencies to our build file, all we need to do is change our VideoRepo
interface to extend MongoRepository
rather than CrudRepository
and then replace the JPA annotations in our @Entity
object with their MongoDB equivalent.
@Repository
public interface VideoRepo extends MongoRepository<Video, Long> { ... }
@Entity
public class Video {
@Id
// Removed @GeneratedValue()
private long id;
// ...
}
Implementing Amazon Dynamo DB
With our repository interface, there are actually no changes no changes needed from the original, but there is a but more work that needs to be done on our @Entity
object by replacing the annotations with DynamoDB specific ones. Note that we also had to change the type the ID to String.
@Repository
public interface VideoRepo extends CrudRepository<Video, Long> { ... }
// Was @Entity
@DynamoDBTable(tableName="Videos")
public class Video {
private String id; // Was Long
private String name;
// Constructors...
@DynamoDBHashKey
@DynamoDBAutoGeneratedKey
public String getId() { return id; }
public void setId(String id) { this.id = id; }
@DynamoDBAttribute
public String getName() { return name; }
public void setName(String name) { this.name = name; }
// other attributes...
}